SPSSX Discussion

aggregate

Classic

List

Threaded

10 messages Options

drfg2008

aggregate

How can I shorten the an aggregate command if I have hundreds of variables to aggregate (into a new file)?

Example:

AGGREGATE
/OUTFILE='AGG_country_a'
/BREAK=year month
/country1=SUM(country1)
/country2_sum=SUM(country2)

(...)

/country101=SUM(country101)
/country102_sum=SUM(country102).

Thanks

Dr. Frank Gaeth

David Marso

Re: aggregate

Administrator

AGG blah blah....
/Sum_C000001 TO TO Sum_C999999 = SUM (C0000001 TO C9999999).
This is in the FM!
If you are stuck on having the _Sum as a suffix then go play with your pet python.

drfg2008 wrote

How can I shorten the an aggregate command if I have hundreds of variables to aggregate (into a new file)?

Example:

AGGREGATE
/OUTFILE='AGG_country_a'
/BREAK=year month
/country1=SUM(country1)
/country2_sum=SUM(country2)

(...)

/country101=SUM(country101)
/country102_sum=SUM(country102).

Thanks

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

drfg2008

Re: aggregate

needs digits at the end.

This example would work:

input program.
loop a =1 to 10**3 by 1.
end case.
end loop.
end file.
end input program.
EXECUTE.
VECTOR v(100).
COMPUTE group = RV.BERNOULLI(0.5).
DO REPEAT #i = v1 to v100.
COMPUTE #i = RV.NORMAL(100,10).
END REPEAT.
EXECUTE.
DATASET DECLARE AGG.
AGGREGATE
/OUTFILE='AGG'
/BREAK=group
/v1 to v100 = sum(v1 to v100).

but what if the variables are pure strings without a common body - as in the example ... ?
and without python

Dr. Frank Gaeth

Andy W

Re: aggregate

The "TO" modifier on the right hand side for the aggregate function selects variables within the range of the variables, without regard to the actual strings. It is the renaming on the left hand side that needs to specify the numeric suffixes. BUT.... the suffixes need to have the same length as the TO range.

input program.
loop a =1 to 10**3 by 1.
end case.
end loop.
end file.
end input program.
dataset name base.
EXECUTE.
VECTOR v(100).
COMPUTE group = RV.BERNOULLI(0.5).
DO REPEAT #i = v1 to v100.
COMPUTE #i = RV.NORMAL(100,10).
END REPEAT.
EXECUTE.
*Renaming some variables in the v1 - 100 range.
RENAME VARIABLES (V3 = F)
(V15 = B)
(V67 = C)
(V98 = D).

*This works.
DATASET DECLARE AGG.
AGGREGATE
/OUTFILE='AGG'
/BREAK=group
/v1 to v100 = sum(v1 to v100).
*This however does not work.
DATASET ACTIVATE base.
COMPUTE Oops = 1.
MATCH FILES FILE = *
/KEEP group V1 to V99 Oops V100.
DATASET DECLARE AGG2.
AGGREGATE
/OUTFILE='AGG2'
/BREAK=group
/v1 to v100 = sum(v1 to v100).

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

David Marso

Re: aggregate

Administrator

In reply to this post by drfg2008

"but what if the variables are pure strings without a common body - as in the example ... ?
and without python"

Well then you're SOL (SortaOutaLuck).
Maybe consider VARSTOCASES then an AGGREGATE using the
single new variable index and your previous break(s) as breaks on the new aggregate.
You can then rename as you wish and then do a CASESTOVARS.
All you need to do is feed the beast the first and last variable.
If the variables are not contiguous then better work that out or nukem later.
Frank, You are still thinking inside the box. Time to burn the box!!!
**---.
MATRIX.
SAVE UNIFORM(1000,9) / OUTFILE * /VARIABLES a,b,c,d,e,f,g,h,i .
END MATRIX.
COMPUTE BREAK_Var=TRUNC(RV.UNIFORM(1,10)).
/** Above is merely data simulation IGNORE!!!**/.

VARSTOCASES MAKE NewVar FROM a TO i / INDEX oldvar (newVar).
AGGREGATE OUTFILE * / BREAK Break_Var oldvar / S=SUM(NewVar).
ALTER TYPE OldVar (A20).
COMPUTE OldVar=CONCAT("SUM_",OldVar).
CASESTOVARS ID=Break_Var / INDEX=oldvar.
LIST.

BREAK_Var SUM_a SUM_b SUM_c SUM_d SUM_e SUM_f SUM_g SUM_h SUM_i

1.00 56.21 60.64 57.20 55.61 54.88 57.25 58.00 52.76 58.71
2.00 56.65 56.13 54.88 51.77 59.32 58.17 62.05 52.60 58.12
3.00 55.12 58.40 53.97 53.42 53.77 51.17 52.92 55.92 45.32
4.00 58.67 63.63 59.35 58.98 58.84 64.29 65.15 63.27 57.23
5.00 57.21 55.05 54.08 48.11 54.86 54.29 51.28 51.96 53.26
6.00 51.59 47.72 53.05 50.84 50.91 50.17 51.84 50.83 49.73
7.00 45.74 45.37 46.58 45.95 45.26 43.67 46.30 46.73 48.06
8.00 52.21 57.48 61.41 62.51 62.69 57.81 58.75 60.92 58.71
9.00 56.02 52.99 60.13 59.25 58.36 60.50 60.49 60.00 56.48

Number of cases read: 9 Number of cases listed: 9

drfg2008 wrote

needs digits at the end.

This example would work:

input program.
loop a =1 to 10**3 by 1.
end case.
end loop.
end file.
end input program.
EXECUTE.
VECTOR v(100).
COMPUTE group = RV.BERNOULLI(0.5).
DO REPEAT #i = v1 to v100.
COMPUTE #i = RV.NORMAL(100,10).
END REPEAT.
EXECUTE.
DATASET DECLARE AGG.
AGGREGATE
/OUTFILE='AGG'
/BREAK=group
/v1 to v100 = sum(v1 to v100).

but what if the variables are pure strings without a common body - as in the example ... ?
and without python

Andy W

Re: aggregate

Another solution could be to calculate the AGGREGATE statistics within the original dataset (CSUM or using lags) and then select the last case. Quick and dirty example below - this is amenable to just specifying a range of variables, but will be more difficult if you want to mix in different types of aggregate functions.

***********************************.
data list free / group test.
begin data
1 1
1 2
1 3
1 4
2 1
2 1
2 1
2 1
3 2
3 2
3 2
end data.
compute test2 = test.
numeric A B C.
do repeat let = A B C.
compute let = RV.NORMAL(0,1).
end repeat.

do repeat let = test2 A B C.
compute #v = let.
do if $casenum = 1 or lag(group) <> group.
compute let = #v.
else.
compute let = let + lag(let).
end if.
end repeat.
match files file = *
/last = flag
/by group.
select if flag = 1.
exe.
***********************************.

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

David Marso

Re: aggregate

Administrator

I don't se how this helps. Frank's issue is he doesn't want to specify all those possibly hundreds of variable names. VARSTOCASES runs almost instantaneously.
AGGREGATE runs FAST. Variable names are tracked through the process via the INDEX subcommands.
Using LAG as you suggest within a DO REPEAT doesn't address this. It is not very scalable.
To convince yourself simulate 100000 cases with 1000 variables.
Run my code.
Run your code.
Compare stopwatches ;-)

Andy W wrote

Another solution could be to calculate the AGGREGATE statistics within the original dataset (CSUM or using lags) and then select the last case. Quick and dirty example below - this is amenable to just specifying a range of variables, but will be more difficult if you want to mix in different types of aggregate functions.

***********************************.
data list free / group test.
begin data
1 1
1 2
1 3
1 4
2 1
2 1
2 1
2 1
3 2
3 2
3 2
end data.
compute test2 = test.
numeric A B C.
do repeat let = A B C.
compute let = RV.NORMAL(0,1).
end repeat.

do repeat let = test2 A B C.
compute #v = let.
do if $casenum = 1 or lag(group) <> group.
compute let = #v.
else.
compute let = let + lag(let).
end if.
end repeat.
match files file = *
/last = flag
/by group.
select if flag = 1.
exe.
***********************************.

Andy W

Re: aggregate

I don't disagree, but I almost always want to keep other variables when using VARSTOCASES. If you need to duplicate a bunch of crap with VARSTOCASES and then unduplicate it with CASESTOVARS it won't be instantaneous with large datasets. (Background, most datasets I work with take MUCH LONGER to query than they do to perform transformations in SPSS, so I'm better off grabbing redundant per case info initially than having to make multiple queries.)

My suggestion can use the TO modifier on the DO REPEAT, so all you need to know are the first variable and last variable (ditto for your VARSTOCASES solution). Is doing a DO REPEAT on a 1000 variables a good solution? Probably not. (Should you be working with 1000s of variables to begin with? Probably not in most circumstances.)

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

David Marso

Re: aggregate

Administrator

I guess as usual it is a matter of personal preference and to some degree a benchmarking issue.
I will let Frank (OP) work out the details here. I don't have time and don't care enough to bother.
My suspicion is that AGGREGATE can achieve the end with greater efficiency than COMPUTE.
CREATE using CSUM still doesn't resolve the naming issues.
I noticed my code ends up with an alphabetical list of variables rather than the original order.
Easy enough to resolve, but this has already consumed much more than the 15 minutes max that I typically cap for something that is a non issue for my own work, so I will let that sleeping dog lie....
Besides, as usual Frank does NOT really provide much context for the question ;-(
Are we dealing with millions of cases? 200?
------------

Andy W wrote

I don't disagree, but I almost always want to keep other variables when using VARSTOCASES. If you need to duplicate a bunch of crap with VARSTOCASES and then unduplicate it with CASESTOVARS it won't be instantaneous with large datasets. (Background, most datasets I work with take MUCH LONGER to query than they do to perform transformations in SPSS, so I'm better off grabbing redundant per case info initially than having to make multiple queries.)

My suggestion can use the TO modifier on the DO REPEAT, so all you need to know are the first variable and last variable (ditto for your VARSTOCASES solution). Is doing a DO REPEAT on a 1000 variables a good solution? Probably not. (Should you be working with 1000s of variables to begin with? Probably not in most circumstances.)

drfg2008

Re: aggregate

In reply to this post by Andy W

that's an interesting idea

Dr. Frank Gaeth