Massive SPSS files - is there a way to reduce?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Massive SPSS files - is there a way to reduce?

allitnils
hi all,

i have a problem whereby some of my SPSS files are pretty massive.
ie, they range from 4GB to 10GB in size.

they do have, admittedly, a lot of data within, however given the amount of files i have i'm sowly running out of disk space!
as they're all working files i can't exactly ZIP them..

i was wondering if i was perhaps doing something inefficient within SPSS that is causing my files to bloat? i'm saving all as .sav. i've not tried the /compress /uncompress flags as i've heard this doesn't do much, however  could potentially give it a go?
does anyone have any tricks or tips to follow to decrease SPSS file sizes?
Reply | Threaded
Open this post in threaded view
|

Re: Massive SPSS files - is there a way to reduce?

Jon Peck
Try using zsav format.  In most cases it compresses much better than sav.  Sav file compression is optimized for small integer values but isn't very effective with large or fractional values.  I have seen cases where zsav size is only 25% of sav file size.

Processing time may be either larger or smaller than for sav depending on a number of factors.

On Sun, Aug 6, 2017 at 10:37 PM, allitnils <[hidden email]> wrote:
hi all,

i have a problem whereby some of my SPSS files are pretty massive.
ie, they range from 4GB to 10GB in size.

they do have, admittedly, a lot of data within, however given the amount of
files i have i'm sowly running out of disk space!
as they're all working files i can't exactly ZIP them..

i was wondering if i was perhaps doing something inefficient within SPSS
that is causing my files to bloat? i'm saving all as .sav. i've not tried
the /compress /uncompress flags as i've heard this doesn't do much, however
could potentially give it a go?
does anyone have any tricks or tips to follow to decrease SPSS file sizes?




--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Massive-SPSS-files-is-there-a-way-to-reduce-tp5734620.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Massive SPSS files - is there a way to reduce?

Rich Ulrich
In reply to this post by allitnils

Jon gave the answer about compression. But you might be able to back up a step for a different solution.


If you have that huge amount of new, raw data, there is not much else to do. But another 

reason for a bunch of files may be that you create a bunch of versions of the same data, either 

for adding a few variables or scores or for editing some. 


If that is the case, you might consider organizing so that what you /process/ when you run is

put together by MATCH FILES, using the same, big, permanent file with various modified, smaller

datasets.  I found that approach useful, also, for keeping track of what-versions-of-what that I 

was using.  By re-using the same opening syntax, I also could document the specific computations

or selections that were in effect for a particular run, right there, instead of needing to document a 

whole new file. (That is, not every variation of an analysis would need its own disk file: INSERT could

specify a certain set of conditions and computations.)


-- 

Rich Ulrich 



From: SPSSX(r) Discussion <[hidden email]> on behalf of allitnils <[hidden email]>
Sent: Monday, August 7, 2017 12:37:44 AM
To: [hidden email]
Subject: Massive SPSS files - is there a way to reduce?
 
hi all,

i have a problem whereby some of my SPSS files are pretty massive.
ie, they range from 4GB to 10GB in size.

they do have, admittedly, a lot of data within, however given the amount of
files i have i'm sowly running out of disk space!
as they're all working files i can't exactly ZIP them..

i was wondering if i was perhaps doing something inefficient within SPSS
that is causing my files to bloat? i'm saving all as .sav. i've not tried
the /compress /uncompress flags as i've heard this doesn't do much, however
could potentially give it a go?
does anyone have any tricks or tips to follow to decrease SPSS file sizes?




--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Massive-SPSS-files-is-there-a-way-to-reduce-tp5734620.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Massive SPSS files - is there a way to reduce?

Jon Peck
Merging happens on the next required data pass, so, assuming that it is a simple add variables merge with no sorting required, one would expect the elapsed times to be similar.

I created two million-case sav files with 100 uniform random variables each and ran DESCRIPTIVES with MATCH FILES using an ID variable but no sorting.  I then ran DESCRIPTIVES with the already merged file - 200 variables in both cases.  I have an SSD, so disk accesses are fast, and lots of RAM.  All this on Windows 10.

For the premerged file, elapsed time was 0.1 minutes.  With the in-stream merge, elapsed time was .12 minutes, about a 20% time penalty.

So, if Bruce's scenario is applicable, the overhead for the merge will be small, but it would require careful management of the data.

But, the combined sav file size was 1.8GB while the zsav file was only 18MB - 1% the size.  This extreme ratio occurs because with uniform random numbers, the sav file compression is completely ineffective.

On Mon, Aug 7, 2017 at 11:35 AM, Rich Ulrich <[hidden email]> wrote:

Jon gave the answer about compression. But you might be able to back up a step for a different solution.


If you have that huge amount of new, raw data, there is not much else to do. But another 

reason for a bunch of files may be that you create a bunch of versions of the same data, either 

for adding a few variables or scores or for editing some. 


If that is the case, you might consider organizing so that what you /process/ when you run is

put together by MATCH FILES, using the same, big, permanent file with various modified, smaller

datasets.  I found that approach useful, also, for keeping track of what-versions-of-what that I 

was using.  By re-using the same opening syntax, I also could document the specific computations

or selections that were in effect for a particular run, right there, instead of needing to document a 

whole new file. (That is, not every variation of an analysis would need its own disk file: INSERT could

specify a certain set of conditions and computations.)


-- 

Rich Ulrich 



From: SPSSX(r) Discussion <[hidden email]> on behalf of allitnils <[hidden email]>
Sent: Monday, August 7, 2017 12:37:44 AM
To: [hidden email]
Subject: Massive SPSS files - is there a way to reduce?
 
hi all,

i have a problem whereby some of my SPSS files are pretty massive.
ie, they range from 4GB to 10GB in size.

they do have, admittedly, a lot of data within, however given the amount of
files i have i'm sowly running out of disk space!
as they're all working files i can't exactly ZIP them..

i was wondering if i was perhaps doing something inefficient within SPSS
that is causing my files to bloat? i'm saving all as .sav. i've not tried
the /compress /uncompress flags as i've heard this doesn't do much, however
could potentially give it a go?
does anyone have any tricks or tips to follow to decrease SPSS file sizes?




--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Massive-SPSS-files-is-there-a-way-to-reduce-tp5734620.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Massive SPSS files - is there a way to reduce?

Jon Peck
Another point about long strings: automatic Unicode conversions and some other defaults can produce strings that are much longer than necessary.  ALTER TYPE can automatically reduce them to the minimum required, but that will not reduce the sav file size as it has effective compression for this case.  However, if there are many repetitive string values, zsav will eviserate them, reducing file size.
On Mon, Aug 7, 2017 at 3:35 PM Mario Giesel <[hidden email]> wrote:
Just an idea: Maybe you can convert string variables to numeric (AUTORECODE) and then delete the strings? That could change a lot.

GL,
  Mario


Jon Peck <[hidden email]> schrieb am 22:15 Montag, 7.August 2017:


Merging happens on the next required data pass, so, assuming that it is a simple add variables merge with no sorting required, one would expect the elapsed times to be similar.

I created two million-case sav files with 100 uniform random variables each and ran DESCRIPTIVES with MATCH FILES using an ID variable but no sorting.  I then ran DESCRIPTIVES with the already merged file - 200 variables in both cases.  I have an SSD, so disk accesses are fast, and lots of RAM.  All this on Windows 10.

For the premerged file, elapsed time was 0.1 minutes.  With the in-stream merge, elapsed time was .12 minutes, about a 20% time penalty.

So, if Bruce's scenario is applicable, the overhead for the merge will be small, but it would require careful management of the data.

But, the combined sav file size was 1.8GB while the zsav file was only 18MB - 1% the size.  This extreme ratio occurs because with uniform random numbers, the sav file compression is completely ineffective.

On Mon, Aug 7, 2017 at 11:35 AM, Rich Ulrich <[hidden email]> wrote:






















Jon gave the answer about compression. But you might be able to back up a step for a different solution.







If you have that huge amount of new, raw data, there is not much else to do. But another 


reason for a bunch of files may be that you create a bunch of versions of the same data, either 


for adding a few variables or scores or for editing some. 







If that is the case, you might consider organizing so that what you /process/ when you run is


put together by MATCH FILES, using the same, big, permanent file with various modified, smaller


datasets.  I found that approach useful, also, for keeping track of what-versions-of-what that I 


was using.  By re-using the same opening syntax, I also could document the specific computations


or selections that were in effect for a particular run, right there, instead of needing to document a 


whole new file. (That is, not every variation of an analysis would need its own disk file: INSERT could


specify a certain set of conditions and computations.)







-- 


Rich Ulrich 












From: SPSSX(r) Discussion <[hidden email]> on behalf of allitnils <[hidden email]>


Sent: Monday, August 7, 2017 12:37:44 AM


To: [hidden email]


Subject: Massive SPSS files - is there a way to reduce?


 








hi all,





i have a problem whereby some of my SPSS files are pretty massive.


ie, they range from 4GB to 10GB in size.





they do have, admittedly, a lot of data within, however given the amount of


files i have i'm sowly running out of disk space!


as they're all working files i can't exactly ZIP them..





i was wondering if i was perhaps doing something inefficient within SPSS


that is causing my files to bloat? i'm saving all as .sav. i've not tried


the /compress /uncompress flags as i've heard this doesn't do much, however


could potentially give it a go?


does anyone have any tricks or tips to follow to decrease SPSS file sizes?














--


View this message in context:

http://spssx-discussion. 1045642.n5.nabble.com/Massive- SPSS-files-is-there-a-way-to- reduce-tp5734620.html



Sent from the SPSSX Discussion mailing list archive at Nabble.com.





=====================


To manage your subscription to SPSSX-L, send a message to


[hidden email] (not to SPSSX-L), with no body text except the


command. To leave the list, send the command


SIGNOFF SPSSX-L


For a list of commands to manage subscriptions, send the command


INFO REFCARD










=====================

To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the

command. To leave the list, send the command

SIGNOFF SPSSX-L

For a list of commands to manage subscriptions, send the command

INFO REFCARD




--
Jon K Peck
[hidden email]





=====================

To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the

command. To leave the list, send the command

SIGNOFF SPSSX-L

For a list of commands to manage subscriptions, send the command

INFO REFCARD


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Massive SPSS files - is there a way to reduce?

allitnils
In reply to this post by Jon Peck
quite amazing! I just saved a 3.2GB file as zsav and resulted in a 211mb file.
this will have an incredible impact on my network!
Reply | Threaded
Open this post in threaded view
|

Re: Massive SPSS files - is there a way to reduce?

Rich Ulrich
In reply to this post by Jon Peck

Thanks for the timing info. I knew there was a little overhead, but I never worried about it because 

there was so much savings from reading /only/ the small set of selected and scored data needed at 

the time.

Data maintenance:

  Important demographics, etc. ==> checked, massaged, selected to one file.

  Other raw data ==> scored composites and selected items from 15+ scales.  Result: three or four files, each with items to be used together.

Data analyses:

  Match two (or more) of the files, for any given analysis.


For clinical research, I found that I spent more time in maintenance than in analysis. Smaller was better

back in the days when "big" meant "slow" -- maintaining sub-files was economical of time. Matching

a small part of the total data was economical of time. 


-- 

Rich Ulrich 




From: Jon Peck <[hidden email]>
Sent: Monday, August 7, 2017 4:14 PM
To: Rich Ulrich
Cc: SPSS List
Subject: Re: [SPSSX-L] Massive SPSS files - is there a way to reduce?
 
Merging happens on the next required data pass, so, assuming that it is a simple add variables merge with no sorting required, one would expect the elapsed times to be similar.

I created two million-case sav files with 100 uniform random variables each and ran DESCRIPTIVES with MATCH FILES using an ID variable but no sorting.  I then ran DESCRIPTIVES with the already merged file - 200 variables in both cases.  I have an SSD, so disk accesses are fast, and lots of RAM.  All this on Windows 10.

For the premerged file, elapsed time was 0.1 minutes.  With the in-stream merge, elapsed time was .12 minutes, about a 20% time penalty.

< snip ... >


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Massive SPSS files - is there a way to reduce?

John F Hall

How about a 4tb external drive?

(Mr) John F Hall

[Retired academic survey researcher]

 

 

 

 

 

 

 

> Message du 08/08/17 08:16

> De : "Rich Ulrich" <[hidden email]>
> A : [hidden email]
> Copie à :
> Objet : Re: Massive SPSS files - is there a way to reduce?
>
>

> Thanks for the timing info. I knew there was a little overhead, but I never worried about it because 

> there was so much savings from reading /only/ the small set of selected and scored data needed at 

> the time.

> Data maintenance:

>   Important demographics, etc. ==> checked, massaged, selected to one file.

>   Other raw data ==> scored composites and selected items from 15+ scales.  Result: three or four files, each with items to be used together.

> Data analyses:

>   Match two (or more) of the files, for any given analysis.

>
>

> For clinical research, I found that I spent more time in maintenance than in analysis. Smaller was better

> back in the days when "big" meant "slow" -- maintaining sub-files was economical of time. Matching

> a small part of the total data was economical of time. 

>
>

> -- 

> Rich Ulrich 

>
>


>

From: Jon Peck <[hidden email]>
> Sent: Monday, August 7, 2017 4:14 PM
> To: Rich Ulrich
> Cc: SPSS List
> Subject: Re: [SPSSX-L] Massive SPSS files - is there a way to reduce?
 
Merging happens on the next required data pass, so, assuming that it is a simple add variables merge with no sorting required, one would expect the elapsed times to be similar.

>
I created two million-case sav files with 100 uniform random variables each and ran DESCRIPTIVES with MATCH FILES using an ID variable but no sorting.  I then ran DESCRIPTIVES with the already merged file - 200 variables in both cases.  I have an SSD, so disk accesses are fast, and lots of RAM.  All this on Windows 10.

>
For the premerged file, elapsed time was 0.1 minutes.  With the in-stream merge, elapsed time was .12 minutes, about a 20% time penalty.

>
< snip ... >

>

>
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

John F Hall

[Retired academic survey researcher]

IBM-SPSS Academic author 9900074

 

Email:        johnfhall@... 

Website:   http://surveyresearch.weebly.com/

Course:     http://surveyresearch.weebly.com/1-survey-analysis-workshop-spss.html

Research: http://surveyresearch.weebly.com/3-subjective-social-indicators-quality-of-life.html

John F Hall

[Retired academic survey researcher]

IBM-SPSS Academic author 9900074

 

Email:        johnfhall@... 

Website:   http://surveyresearch.weebly.com/

Course:     http://surveyresearch.weebly.com/1-survey-analysis-workshop-spss.html

Research: http://surveyresearch.weebly.com/3-subjective-social-indicators-quality-of-life.html

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD