Aggregate command problem with sum function

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Aggregate command problem with sum function

Richard Ristow
At 12:59 PM 12/3/2012, Rick Oliver wrote:

>Aggregate does not require sorted data. PRESORTED should only be
>used when the data are already sorted and even then is only useful
>for very large data files.

True. More precisely, it's the size of the *output* file that
matters: roughly, the number of 'break groups' (the number of
different values of the set of variables named on /BREAK) times the
number of output variables.

AGGREGATE without PRESORTED tends to run fast, and about linearly in
the size of the input file, until the size of the output file gets
"too big"; as the output file grows beyond that point, AGGREGATE
fairly quickly slows to a crawl. When last I checked, on a machine
with 3/4 gig of main memory, AGGREGATE ran fine with up to 100,000
break groups, was very slow with 1,000,000 break groups, with the
slowdown taking place in between.

A particular case where you'll have many break groups, and may want
to sort and specify PRESORTED, is using AGGREGATE to check for
duplicated keys in a large file. If duplicates are rare, the output
file will be only slightly smaller than the input -- i.e., it will be big.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
12