File Size using AGGREGATE OUTFILE = 'c:\tmp.sav'

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

File Size using AGGREGATE OUTFILE = 'c:\tmp.sav'

Marks, Jim
The SAVE command includes the subcommand /COMPRESSED.

 

Does AGGREGATE support a similar option when writing to a new file?

 

I am seeing file size grow by a factor of 2 ( 30 Mb to 65Mb).

 

For processing speed, I would prefer not to conduct a data pass to
aggregate and a second data pass to save. For disk space I would prefer
to have the smaller file size.

 

--jim

 

 

====================To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: File Size using AGGREGATE OUTFILE = 'c:\tmp.sav'

Richard Ristow
At 11:17 AM 10/8/2008, Marks, Jim wrote:

>The SAVE command includes the subcommand /COMPRESSED.
>Does AGGREGATE support a similar option when writing to a new file?

It doesn't look like there is one, nor on other commands (besides
SAVE and XSAVE) that take an OUTFILE specification.

It's an interesting omission. I thought there might be a COMPRESSED
option for FILE HANDLE, which could solve this problem; but there
doesn't seem to be.

I'd thought that AGGREGATE would write a COMPRESSED file if that's
the system default (as it usually is), but no such luck: I just tried
an aggregation on my system (using v.14), and the output is not compressed.

(To check, use command SYSFILE INFO on the saved file; from the menus, use
File> Display Data File Information> External File...)


On the other hand, you write,

>I am seeing file size grow by a factor of 2 ( 30 Mb to 65Mb).

That does surprise me, since aggregated files usually have both fewer
variables and fewer cases than the original file. Can you say what
your file structure and aggregation logic are?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: File Size using AGGREGATE OUTFILE = 'c:\tmp.sav'

Marks, Jim
Richard:

The file has about 100,000 records, of which 10,000 are grouped into
pairs that need to be combined-- 90% of the file is unchanged.

The aggregate uses multiple break variables, but only two define
groups-- id and date. The remaining break variables are constant across
id .

The aggregating function is
  /vara avrb ... =SUM(vara varb ...)

-- the aggregated file has the same var list, but about 5,000 fewer
cases.

When I saw the file size doubled, I was afraid of an error, but the data
appears to be correct.

--jim


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Richard Ristow
Sent: Wednesday, October 08, 2008 12:15 PM
To: [hidden email]
Subject: Re: File Size using AGGREGATE OUTFILE = 'c:\tmp.sav'

At 11:17 AM 10/8/2008, Marks, Jim wrote:

>The SAVE command includes the subcommand /COMPRESSED.
>Does AGGREGATE support a similar option when writing to a new file?

It doesn't look like there is one, nor on other commands (besides
SAVE and XSAVE) that take an OUTFILE specification.

It's an interesting omission. I thought there might be a COMPRESSED
option for FILE HANDLE, which could solve this problem; but there
doesn't seem to be.

I'd thought that AGGREGATE would write a COMPRESSED file if that's
the system default (as it usually is), but no such luck: I just tried
an aggregation on my system (using v.14), and the output is not
compressed.

(To check, use command SYSFILE INFO on the saved file; from the menus,
use
File> Display Data File Information> External File...)


On the other hand, you write,

>I am seeing file size grow by a factor of 2 ( 30 Mb to 65Mb).

That does surprise me, since aggregated files usually have both fewer
variables and fewer cases than the original file. Can you say what
your file structure and aggregation logic are?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: File Size using AGGREGATE OUTFILE = 'c:\tmp.sav'

Richard Ristow
At 01:41 PM 10/8/2008, Marks, Jim wrote:

>The aggregate uses multiple break variables, but only two define
>groups-- id and date. The remaining break variables are constant across id .

Curious: Then, why do you have them on the 'break' list?

>The file has about 100,000 records, of which 10,000 are grouped into
>pairs that need to be combined-- 90% of the file is unchanged.

And, 90 variables or thereabouts?

>The aggregating function is
>   /vara avrb ... =SUM(vara varb ...)
>
>-- the aggregated file has the same var list, but about 5,000 fewer cases.
>
>When I saw the file size doubled, I was afraid of an error, but the
>data appears to be correct.

 From information in the concurrent thread 'What is "COMPRESSED" when
saving a file?', what you're seeing could well arise if many of your
variables are numeric having small integer values.

But, earlier, you wrote,

>For processing speed, I would prefer not to conduct a data pass to
>aggregate and a second data pass to save.

See whether

AGGREGATE OUTFILE='c:\tmp.sav'
   /BREAK=...
   /...

is actually much faster than

AGGREGATE OUTFILE=*
   /BREAK=...
   /...

SAVE OUTFILE='c:\tmp.sav'/COMPRESSED.

I don't think the second form forces an additional data pass, at
least in modern releases of SPSS (since, I think, about 12.5).

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD