Counter-intuitive statement about uncompressed vs compressed data files

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Counter-intuitive statement about uncompressed vs compressed data files

Bruce Weaver
Administrator
In another thread, Jon informed us that the old SPSS Community at www.ibm.com/developerworks/spssdevcentral has been migrated to the new IBM SPSS Predictive Analytics Community at https://developer.ibm.com/predictiveanalytics/.  I took a quick look, and was a bit surprised by what I read in this post:

https://developer.ibm.com/predictiveanalytics/2015/09/30/spss-statistics-data-files-can-be-smaller-and-faster-if-saved-as-uncompressed/

The counter-intuitive bit (for me) was that an uncompressed file could actually be smaller.  

Figuring that many members of this group might not frequent the Predictive Analytics Community forum, I thought I should share the link.  

HTH.
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Counter-intuitive statement about uncompressed vs compressed data files

Jon K Peck
The case where the uncompressed file could be slightly smaller is when the compression algorithm used in sav files is completely ineffective.  If none of the data consists of small integer values (and no string variables have trailing blanks or are empty), then no fields can be compressed, but the compression flag is still required if compression is on.  The futile compression overhead is small, however.  

Zsav compression, introduced in V21, uses a general algorithm that is broadly effective.  For example, one year of the famous airline data is 853MB in sav format with compression but only 275MB with zsav.  Zsav files may process slightly slower, depending on a number of factors, but the compression works with all sorts of content.

The csv version of that airline data is 689MB.  Most of the values are 1-4 bytes while the sav format stores all numbers (before compression) as 8-byte floating point numbers.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Bruce Weaver <[hidden email]>
To:        [hidden email]
Date:        10/07/2015 09:07 AM
Subject:        [SPSSX-L] Counter-intuitive statement about uncompressed vs compressed data files
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




In another thread, Jon informed us that the old SPSS Community at
www.ibm.com/developerworks/spssdevcentralhas been migrated to the new IBM
SPSS Predictive Analytics Community at
https://developer.ibm.com/predictiveanalytics/.  I took a quick look, and
was a bit surprised by what I read in this post:

https://developer.ibm.com/predictiveanalytics/2015/09/30/spss-statistics-data-files-can-be-smaller-and-faster-if-saved-as-uncompressed/

The counter-intuitive bit (for me) was that an uncompressed file could
actually be smaller.  

Figuring that many members of this group might not frequent the Predictive
Analytics Community forum, I thought I should share the link.  

HTH.




-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Counter-intuitive-statement-about-uncompressed-vs-compressed-data-files-tp5730748.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD