Large Data Files

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Large Data Files

Marcos Sanches
Hi all,

I wonder if anybody has any suggestion for working with large data file in SPSS. My data has around 10 millions observation and 30 variables and everything I do takes a looooooong time...

Thanks a lot!

Marcos




Reply | Threaded
Open this post in threaded view
|

Re: Large Data Files

mpirritano

I work with large files > 3 GB, > 4 million lines.

 

  1. For big data processing jobs use python without the spss front end. Much faster.
  2. Start with the main file. Eliminate all unnecessary variables and cases for each analysis. Or if possible use aggregate to pare down the size of the file. The first step or two will take some time, but then the file gets smaller and things speed up.
  3. I’ve not tried this one but have read on the list that 64 bit processor with multiple cpu’s and max ram majorly speeds things up.

 

Thanks

Matt

 

Matthew Pirritano, Ph.D.

Research Analyst IV

Medical Services Initiative (MSI)

Orange County Health Care Agency

(714) 568-5648


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Marcos Sanches
Sent: Wednesday, September 08, 2010 2:04 PM
To: [hidden email]
Subject: Large Data Files

 

Hi all,

 

I wonder if anybody has any suggestion for working with large data file in SPSS. My data has around 10 millions observation and 30 variables and everything I do takes a looooooong time...

 

Thanks a lot!

 

Marcos

 

 

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Large Data Files

Marcos Sanches
Thanks Matthew!

Yes, your second suggestion might help! I will try and aggregate the file and use the "n_break" variable as weight, I think this is a good idea! And eliminate some useless string variables...

My SPSS is version 12, so I think your first suggestion does not work for me, but it might give me a good justification for upgrading...

Thanks a lot!

Marcos



On Wed, Sep 8, 2010 at 5:14 PM, Pirritano, Matthew <[hidden email]> wrote:

I work with large files > 3 GB, > 4 million lines.

 

  1. For big data processing jobs use python without the spss front end. Much faster.
  2. Start with the main file. Eliminate all unnecessary variables and cases for each analysis. Or if possible use aggregate to pare down the size of the file. The first step or two will take some time, but then the file gets smaller and things speed up.
  3. I’ve not tried this one but have read on the list that 64 bit processor with multiple cpu’s and max ram majorly speeds things up.

 

Thanks

Matt

 

Matthew Pirritano, Ph.D.

Research Analyst IV

Medical Services Initiative (MSI)

Orange County Health Care Agency

(714) 568-5648


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Marcos Sanches
Sent: Wednesday, September 08, 2010 2:04 PM

Subject: Large Data Files

 

Hi all,

 

I wonder if anybody has any suggestion for working with large data file in SPSS. My data has around 10 millions observation and 30 variables and everything I do takes a looooooong time...

 

Thanks a lot!

 

Marcos

 

 

 

 


Reply | Threaded
Open this post in threaded view
|

Re: Large Data Files

J. R. Carroll
In reply to this post by mpirritano
I haven't looked into it too much, but I would imagine that a RAID-0 setup with faster RPM's HD's (faster read/write) or 'stroked HDs' (faster seek time I believe) would also increase 'speed'. 

Question:  Also does anyone know if  "Set Workspace" increases performance (I know the help files say to only use when SPSS says it's out of memory, and it only works for 'certain procedures').

Also I've read somewhere that SPSS single-client version can only utilize a single 'core' of a multi-cored processor.  Meaning, your home/work computer is probably running at a fraction of the processing speed that it is capable of.  For instance, my home DIY-computer has 6 cores, and SPSS can only utilize one of them and the other 5 are used by other applications.  Whereas the SPSS-Server edition can utilize all cores of a processor - dramatically increasing processing speed.  I am not sure, and was unable to find any "google derived" evidence to support this claim, but I 100% positive I read it just a few months back.  Can anyone confirm this?

My files are not as large as the ones you guys are using (measured in GB), but mine range in the 100k's of cases, and 1000+ variables sometimes (400-600mb in file sizes).  I use both SPSS 15 and 17, and both client and server editions, and I know that if I run a procedure like Crosstabs on the single-client version it takes about 5-10 min for it to run, whereas if I run it on the server edition it takes about 30 seconds. 

//////

Some quick references:


Forum post on RAIDs: http://forums.hexus.net/hexus-hardware/130603-how-much-speed-difference-there-raid-0-a.html
Article by SPSS on Hardware recommendations (dated 2008):  http://www.spss.com/media/collateral/SSSWP-0608.pdf
Discussion a few months back on this Listserv: http://spssx-discussion.1045642.n5.nabble.com/Quad-Core-Processors-td1092004.html

//////

HTH,

J. R. Carroll
Grad. Student in Pre-Doc Psychology at CSUS
Research Assistant for Just About Everyone.
Email:  [hidden email]   -or-   [hidden email]
Phone:  (916) 628-4204


On Wed, Sep 8, 2010 at 2:14 PM, Pirritano, Matthew <[hidden email]> wrote:

I work with large files > 3 GB, > 4 million lines.

 

  1. For big data processing jobs use python without the spss front end. Much faster.
  2. Start with the main file. Eliminate all unnecessary variables and cases for each analysis. Or if possible use aggregate to pare down the size of the file. The first step or two will take some time, but then the file gets smaller and things speed up.
  3. I’ve not tried this one but have read on the list that 64 bit processor with multiple cpu’s and max ram majorly speeds things up.

 

Thanks

Matt

 

Matthew Pirritano, Ph.D.

Research Analyst IV

Medical Services Initiative (MSI)

Orange County Health Care Agency

(714) 568-5648


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Marcos Sanches
Sent: Wednesday, September 08, 2010 2:04 PM
To: [hidden email]
Subject: Large Data Files

 

Hi all,

 

I wonder if anybody has any suggestion for working with large data file in SPSS. My data has around 10 millions observation and 30 variables and everything I do takes a looooooong time...

 

Thanks a lot!

 

Marcos

 

 

 

 


Reply | Threaded
Open this post in threaded view
|

Re: Large Data Files

J. R. Carroll
Also, in terms of speeding things up:

Most users (here comes a dangerous superlative) of computers think that 'speed' is limited to one facet of their computer (e.g. just software, operating system, or hardware), at least in their approach/discussions to increase 'speed'.  I find that there is a substantial body of evidence to support that it is a combination of many factors (posting reference articles notwithstanding).

For instance: I have personally witnessed dramatically slower processing times in my PC for "processor-heavy applications/functions" (namely SPSS and a lot of my audio-mixing applications) when the processor/RAM and several other components are running above a certain temperature.  For the last few years I have used a liquid cooled radiator for my processor, and with my newest PC build I have put substantially more "case fans".  With each new addition, I am able to keep my processor/hardware components cooler, and while I haven't officially tested (but I do monitor them when I have my PC turned on), I believe there is a substantial difference when I was running my processor at 60deg (F) with the standard processor fan that comes OEM, whereas now I run my processor at 24-32deg (F) with my liquid cooled radiator.

While big things certainly contribute to speed like 64-bit OS, upgrades to the processor, server-edition SPSS, and optimizing the way you work with your SPSS file (i.e. removing unneeded variables/cases), I think that there is also credence in the "little things", like watching/controlling PC temps :P.

Do I have any listers concordance on this recommendation? *crickets?*

J. R. Carroll
Grad. Student in Pre-Doc Psychology at CSUS
Research Assistant for Just About Everyone.
Email:  [hidden email]   -or-   [hidden email]
Phone:  (916) 628-4204


On Wed, Sep 8, 2010 at 2:50 PM, Justin Carroll <[hidden email]> wrote:
I haven't looked into it too much, but I would imagine that a RAID-0 setup with faster RPM's HD's (faster read/write) or 'stroked HDs' (faster seek time I believe) would also increase 'speed'. 

Question:  Also does anyone know if  "Set Workspace" increases performance (I know the help files say to only use when SPSS says it's out of memory, and it only works for 'certain procedures').

Also I've read somewhere that SPSS single-client version can only utilize a single 'core' of a multi-cored processor.  Meaning, your home/work computer is probably running at a fraction of the processing speed that it is capable of.  For instance, my home DIY-computer has 6 cores, and SPSS can only utilize one of them and the other 5 are used by other applications.  Whereas the SPSS-Server edition can utilize all cores of a processor - dramatically increasing processing speed.  I am not sure, and was unable to find any "google derived" evidence to support this claim, but I 100% positive I read it just a few months back.  Can anyone confirm this?

My files are not as large as the ones you guys are using (measured in GB), but mine range in the 100k's of cases, and 1000+ variables sometimes (400-600mb in file sizes).  I use both SPSS 15 and 17, and both client and server editions, and I know that if I run a procedure like Crosstabs on the single-client version it takes about 5-10 min for it to run, whereas if I run it on the server edition it takes about 30 seconds. 

//////

Some quick references:


Forum post on RAIDs: http://forums.hexus.net/hexus-hardware/130603-how-much-speed-difference-there-raid-0-a.html
Article by SPSS on Hardware recommendations (dated 2008):  http://www.spss.com/media/collateral/SSSWP-0608.pdf
Discussion a few months back on this Listserv: http://spssx-discussion.1045642.n5.nabble.com/Quad-Core-Processors-td1092004.html

//////

HTH,

J. R. Carroll
Grad. Student in Pre-Doc Psychology at CSUS
Research Assistant for Just About Everyone.
Email:  [hidden email]   -or-   [hidden email]
Phone:  (916) 628-4204


On Wed, Sep 8, 2010 at 2:14 PM, Pirritano, Matthew <[hidden email]> wrote:

I work with large files > 3 GB, > 4 million lines.

 

  1. For big data processing jobs use python without the spss front end. Much faster.
  2. Start with the main file. Eliminate all unnecessary variables and cases for each analysis. Or if possible use aggregate to pare down the size of the file. The first step or two will take some time, but then the file gets smaller and things speed up.
  3. I’ve not tried this one but have read on the list that 64 bit processor with multiple cpu’s and max ram majorly speeds things up.

 

Thanks

Matt

 

Matthew Pirritano, Ph.D.

Research Analyst IV

Medical Services Initiative (MSI)

Orange County Health Care Agency

(714) 568-5648


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Marcos Sanches
Sent: Wednesday, September 08, 2010 2:04 PM
To: [hidden email]
Subject: Large Data Files

 

Hi all,

 

I wonder if anybody has any suggestion for working with large data file in SPSS. My data has around 10 millions observation and 30 variables and everything I do takes a looooooong time...

 

Thanks a lot!

 

Marcos

 

 

 

 



Reply | Threaded
Open this post in threaded view
|

Re: Large Data Files

Richard Ristow
In reply to this post by Marcos Sanches
At 05:04 PM 9/8/2010, Marcos Sanches wrote:

>I wonder if anybody has any suggestion for working with large data
>file in SPSS. My data has around 10 millions observation and 30
>variables and everything I do takes a looooooong time...

You've seen a lot of it: basically, minimize the amount of data
you're reading or writing, and the number of times you do it.
Specifics depend a lot on what operations you are doing, and how
you're doing them.

I'm glad that AGGREGATEing looks helpful.

You don't have any EXECUTE ('exe.') statements in your code, do you?
With rare, specific exceptions, EXECUTEs aren't needed, and every one
of them makes SPSS re-read your whole file.

Finally, depending on what you're doing, it may help to add a second
disk drive, and structure your jobs so data's read from one of them
while it's written to the other.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Large Data Files

Matthew Pirritano
The only thing I left out that I've done is using ramdisk. It's a simple freeware (up to the size used by most pc users) program that lets you use a designated amount of your ram for your paging file. This is memory that your computer usually uses your hard drive for. It's quicker if it can use solid-state hardware like ram instead of the hard drive which is limited by how fast it moves.

Here's the site for ramdisk:

http://memory.dataram.com/products-and-services/software/ramdisk


 
Matthew Pirritano, Ph.D.
Email: [hidden email]



From: Richard Ristow <[hidden email]>
To: [hidden email]
Sent: Wed, September 8, 2010 5:09:38 PM
Subject: Re: Large Data Files

At 05:04 PM 9/8/2010, Marcos Sanches wrote:

>I wonder if anybody has any suggestion for working with large data
>file in SPSS. My data has around 10 millions observation and 30
>variables and everything I do takes a looooooong time...

You've seen a lot of it: basically, minimize the amount of data
you're reading or writing, and the number of times you do it.
Specifics depend a lot on what operations you are doing, and how
you're doing them.

I'm glad that AGGREGATEing looks helpful.

You don't have any EXECUTE ('exe.') statements in your code, do you?
With rare, specific exceptions, EXECUTEs aren't needed, and every one
of them makes SPSS re-read your whole file.

Finally, depending on what you're doing, it may help to add a second
disk drive, and structure your jobs so data's read from one of them
while it's written to the other.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Large Data Files

Albert-Jan Roskam
In reply to this post by J. R. Carroll
Hi,

I think many of us do not have any influence on the computer hardware. Here are some things that I find useful:

-Create a random sample of your data (SAMPLE, SET SEED) for debugging purposes.
-Use and get only the data that you need (KEEP, SELECT)
-Avoid using EXECUTE
-If the source data are from a database: Do as much as possible of the preprocessing on the database server.
-If the data are on a network disk: consider using CACHE; EXECUTE.
-Use the production facility and/or run jobs at night.

Regarding hardware: vacuum clean the innards of your computer and put some oil on the cpu fan ;-)

 
Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



From: Justin Carroll <[hidden email]>
To: [hidden email]
Sent: Thu, September 9, 2010 1:40:28 AM
Subject: Re: [SPSSX-L] Large Data Files

Also, in terms of speeding things up:

Most users (here comes a dangerous superlative) of computers think that 'speed' is limited to one facet of their computer (e.g. just software, operating system, or hardware), at least in their approach/discussions to increase 'speed'.  I find that there is a substantial body of evidence to support that it is a combination of many factors (posting reference articles notwithstanding).

For instance: I have personally witnessed dramatically slower processing times in my PC for "processor-heavy applications/functions" (namely SPSS and a lot of my audio-mixing applications) when the processor/RAM and several other components are running above a certain temperature.  For the last few years I have used a liquid cooled radiator for my processor, and with my newest PC build I have put substantially more "case fans".  With each new addition, I am able to keep my processor/hardware components cooler, and while I haven't officially tested (but I do monitor them when I have my PC turned on), I believe there is a substantial difference when I was running my processor at 60deg (F) with the standard processor fan that comes OEM, whereas now I run my processor at 24-32deg (F) with my liquid cooled radiator.

While big things certainly contribute to speed like 64-bit OS, upgrades to the processor, server-edition SPSS, and optimizing the way you work with your SPSS file (i.e. removing unneeded variables/cases), I think that there is also credence in the "little things", like watching/controlling PC temps :P.

Do I have any listers concordance on this recommendation? *crickets?*

J. R. Carroll
Grad. Student in Pre-Doc Psychology at CSUS
Research Assistant for Just About Everyone.
Email:  [hidden email]   -or-   [hidden email]
Phone:  (916) 628-4204


On Wed, Sep 8, 2010 at 2:50 PM, Justin Carroll <[hidden email]> wrote:
I haven't looked into it too much, but I would imagine that a RAID-0 setup with faster RPM's HD's (faster read/write) or 'stroked HDs' (faster seek time I believe) would also increase 'speed'. 

Question:  Also does anyone know if  "Set Workspace" increases performance (I know the help files say to only use when SPSS says it's out of memory, and it only works for 'certain procedures').

Also I've read somewhere that SPSS single-client version can only utilize a single 'core' of a multi-cored processor.  Meaning, your home/work computer is probably running at a fraction of the processing speed that it is capable of.  For instance, my home DIY-computer has 6 cores, and SPSS can only utilize one of them and the other 5 are used by other applications.  Whereas the SPSS-Server edition can utilize all cores of a processor - dramatically increasing processing speed.  I am not sure, and was unable to find any "google derived" evidence to support this claim, but I 100% positive I read it just a few months back.  Can anyone confirm this?

My files are not as large as the ones you guys are using (measured in GB), but mine range in the 100k's of cases, and 1000+ variables sometimes (400-600mb in file sizes).  I use both SPSS 15 and 17, and both client and server editions, and I know that if I run a procedure like Crosstabs on the single-client version it takes about 5-10 min for it to run, whereas if I run it on the server edition it takes about 30 seconds. 

//////

Some quick references:


Forum post on RAIDs: http://forums.hexus.net/hexus-hardware/130603-how-much-speed-difference-there-raid-0-a.html
Article by SPSS on Hardware recommendations (dated 2008):  http://www.spss.com/media/collateral/SSSWP-0608.pdf
Discussion a few months back on this Listserv: http://spssx-discussion.1045642.n5.nabble.com/Quad-Core-Processors-td1092004.html

//////

HTH,

J. R. Carroll
Grad. Student in Pre-Doc Psychology at CSUS
Research Assistant for Just About Everyone.
Email:  [hidden email]   -or-   [hidden email]
Phone:  (916) 628-4204


On Wed, Sep 8, 2010 at 2:14 PM, Pirritano, Matthew <[hidden email]> wrote:

I work with large files > 3 GB, > 4 million lines.

 

  1. For big data processing jobs use python without the spss front end. Much faster.
  2. Start with the main file. Eliminate all unnecessary variables and cases for each analysis. Or if possible use aggregate to pare down the size of the file. The first step or two will take some time, but then the file gets smaller and things speed up.
  3. I’ve not tried this one but have read on the list that 64 bit processor with multiple cpu’s and max ram majorly speeds things up.

 

Thanks

Matt

 

Matthew Pirritano, Ph.D.

Research Analyst IV

Medical Services Initiative (MSI)

Orange County Health Care Agency

(714) 568-5648


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Marcos Sanches
Sent: Wednesday, September 08, 2010 2:04 PM
To: [hidden email]
Subject: Large Data Files

 

Hi all,

 

I wonder if anybody has any suggestion for working with large data file in SPSS. My data has around 10 millions observation and 30 variables and everything I do takes a looooooong time...

 

Thanks a lot!

 

Marcos

 

 

 

 




Reply | Threaded
Open this post in threaded view
|

Re: Large Data Files

David Marso
Administrator
In reply to this post by Marcos Sanches
On Wed, 8 Sep 2010 17:04:06 -0400, Marcos Sanches <[hidden email]> wrote:

>Hi all,
>
>I wonder if anybody has any suggestion for working with large data file in
>SPSS. My data has around 10 millions observation and 30 variables and
>everything I do takes a looooooong time...
>
>Thanks a lot!
>
>Marcos
Hi Marcos,
I concur with the advise provided by several others:
1. Restrict variables to those required for immediate processing.
GET FILE /KEEP ...
2. Aggregate and weight where possible.
3. Ban EXE from your code except when running LAG and XSAVE.
4. Minimize data passes (use split file for example).
Also use RECODE where possible rather than a scad of IF statements.
If using DO IF... place the most common occurrences at the top.
Same with IF ANY( Not sure if these last 2 help, but knowing that SPSS was
designed by intelligent monkeys they would short circuit these operations).

I don't know EXACTLY how much things have changed over the years wrt
sorting, but years ago sorting HUGE files was a royal PIA, and sometimes a
DOOMED proposition.
If you ever have to sort such a beast consider the following: This saved my
butt in the distant past.  Not sure whether it would be an advantage NOW
because rumor has it that SPSS has become smarter wrt SORT, but certainly
worth trying and see if it improves performance.
Back in the golden era of tiny hard drives I had a 30 million case file that
I tried to sort repeatedly allowing it to run overnight.
I returned 2 days in a row to a crashed SPSS and cryptic useless error message.
During the day the same sort ran ALL DAY without resolution!  With the
following logic the whole thing ran in about an hour.
Idea is to sort and merge 5 6M case files rather than one 30M case file.
COMPUTE #RANDOM=UNIFORM(10).
DO IF RANGE(#RANDOM,0,.2).
+    XSAVE OUTFILE "T1" .
ELSE IF RANGE(#RANDOM,.2,.4).
+    XSAVE OUTFILE "T2" .
ELSE IF RANGE(#RANDOM,.4,.6).
+    XSAVE OUTFILE "T3" .
ELSE IF RANGE(#RANDOM,.6,.8).
+    XSAVE OUTFILE "T4" .
ELSE IF RANGE(#RANDOM,.8,1.0).
+    XSAVE OUTFILE "T5" .
END IF.
EXECUTE.
* Do the multiple open data set business here...
* I have an old version so can't verify...
GET FILE "T1".
SORT CASES BY KEY_Variables.
GET FILE "T2"
..
...
GET FILE "T5".
SORT CASES BY KEY_Variables.
ADD FILES
        / FILE "F1" / FILE="F2" ......./FILE="F5"
        / BY KEY_Variables.
* Why else do you need the file sorted???.
MATCH FILES
    / FILE *
    / TABLE = whatever
    / BY Key_Variables.
Oh yeah..
SPLIT FILE BY Key_Variables.

Do the real work... ;-)
Basically the idea is DIVIDE and CONQUER!
Maybe someone with idle curiosity, lots of disk space and too much time on
their hands can run this up the flagpole and confirm or repudiate this
proposition.  (Report back please).
I KNOW someone out there would LOVE to prove me wrong ;-)

There are probably a lot of other things you can do to fine tune your code
and minimize data passes, but this is what comes to mind at 9:00 AM
following a sleepless night of programming and design work.
HTH, David

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Large Data Files

Alina Schreiner
In reply to this post by Marcos Sanches
Goodday Macro.
I have found that when doing things with very big files pspp is faster than spss.  Espesially if you do it from the enter line not the graphical surround.

Alina
Reply | Threaded
Open this post in threaded view
|

Re: Large Data Files

Marcos Sanches
Interesting Alina, thanks for the hint, I am not very familiar with PPSP but I might try it to compare how is the performance in my case!

Marcos

On Sat, Sep 11, 2010 at 3:11 AM, Alina Schreiner <[hidden email]> wrote:
Goodday Macro.
I have found that when doing things with very big files pspp is faster than
spss.  Espesially if you do it from the enter line not the graphical
surround.

Alina
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Large-Data-Files-tp2817057p2835895.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: Large Data Files

David Marso
Administrator
In reply to this post by Marcos Sanches
While I'm still thinking of this (private communication/reply to Marcos)
I figured I would share with the List!
---:
Hi Marcos,
Here is another bit of arcane/somewhat counterintuitive but VERY useful
knowledge when working
with HUGE (partially ordered)data sets.

If your data are "clumpy" wrt the break variables
(i.e you have LOTS of the same values contiguous in the file you can do the
following "trick").
(NO PRESORT REQUIRED).

AGGREGATE OUTFILE *    / PRESORTED    / BREAK breaks
   / Summary functions......
   / N=N.
SORT CASES BY breaks.
WEIGHT BY N.
AGGREGATE OUTFILE *    / PRESORTED .....

One might think SPSS (I REFUSE TO REFER TO IT AS PSWA -WTF-? Pshaw) would
throw a screaming hissy fit
and barf out a cryptic message about an unsorted file or an access violation
in the backwater rectocranial inversion procedure...
BUT NO ;-)
It rolls on with a smiley face like a goosestepping starwars stormtrooper
busting through the time/space data clog like lye through a backed up loo,
like beer through a cub's fan.............?.

It basically rips through and aggregates all of the contiguous break values.
Then you have a disordered but summarized file which should be much smaller
that the raw file.
Then sort the smaller file, weight it and reaggregate it.
This totally depends upon which summary functions (min. Max, Mean etc are
fine).  Median IS NOT!
 (SD won't work without additional work ie converting to Sum Squares.
SS=SD**2*(N-1).
AGG...
   /SS=SUM (SS)...../N=SUM(N)
SD=SQRT(SS/(N-1)).

For the mean?
AGGMean(i)=SUM(X(i))/N(i).

Weighted Aggregated Mean = Sum(  Ni*(SumXi/Ni) / Sum(Ni) = Sum(SumXi)/Sum(Ni).
So Mean of weighted means = Overall mean. ;-))

HTH, David

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Implementation of AGGREGATE (was, re: Large Data Files)

Richard Ristow
At 03:03 AM 9/12/2010, David Marso wrote:

>If your data are "clumpy" wrt the break variables (i.e you have LOTS
>of the same values contiguous in the file) you can do the following
>"trick" (NO PRESORT REQUIRED).
>
>AGGREGATE OUTFILE *    / PRESORTED    / BREAK breaks
>    / Summary functions......
>    / N=N.
>SORT CASES BY breaks.
>WEIGHT BY N.
>AGGREGATE OUTFILE *    / PRESORTED .....
>
>One might think SPSS would throw a screaming hissy fit about an
>unsorted file BUT NO ;-) ...  It rips through and aggregates all of
>the contiguous break values. Then you have a disordered but
>summarized file which should be much smaller that the raw file.

The records with the same break values don't even need to be
contiguous; my recent post "Re: sorting out a nested data structure"
(Thu, 9 Sep 2010 01:18:32 -0400) uses that.  And the resulting file
IS sorted by the break groups, unless you use MODE=ADDVARIABLES which
effectively de-sorts it again.

As I understand from Jon Peck's explanations (*), the default
behavior of AGGREGATE is to set up a hash table with an entry for
each set of BREAK values encountered, and accumulate aggregated
values in those table records. So, AGGREGATE can always run without
pre-sorting the file.

However, if there are very many break groups (like hundreds of
thousands), the hash table may fill up memory and start paging, and
performance slows by what may be orders of magnitude. When that
happens, pre-sorting and specifying /PRESORTED on AGGREGATE will run
much faster.
======================
(*)Date:  Mon, 13 Sep 2004 08:01:41 -0500
From:     "Peck, Jon" <[hidden email]>
Subject:  Re: factors influencing speed
To:       [hidden email]

   and, off-list,
Subject:  Re: [SPSSX-L] SORT CASES algorithm
X-MimeOLE: Produced By Microsoft Exchange V6.0.6556.0
Date:     Fri, 15 Oct 2004 07:26:39 -0500
Message-ID: <[hidden email]>
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
Thread-Topic:      Re: [SPSSX-L] SORT CASES algorithm
Thread-Index: AcSySiZ3s/F+IR7ZRPeI0+xjRddwiAAZmdrw
From:     "Peck, Jon" <[hidden email]>
To:       "Raynald Levesque"
<[hidden email]>,  <[hidden email]>
X-OriginalArrivalTime: 15 Oct 2004 12:26:39.0726 (UTC)
FILETIME=[38A1B0E0:01C4B2B2]
X-ELNK-AV: 0

Sorting before running AGGREGATE is in most cases NOT a performance
improvement.  AGGREGATE doesn't need it except in situations where
there are very many distinct break values.  Most of the time,
AGGREGATE without the /PRESORTED subcommand will be faster than
sorting followed by AGGREGATE with presort.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Implementation of AGGREGATE (was, re: Large Data Files)

David Marso
Administrator
On Mon, 13 Sep 2010 23:07:24 -0400, Richard Ristow <[hidden email]>
wrote:

>At 03:03 AM 9/12/2010, David Marso wrote:
>
>>If your data are "clumpy" wrt the break variables (i.e you have LOTS
>>of the same values contiguous in the file) you can do the following
>>"trick" (NO PRESORT REQUIRED).
>>
>>AGGREGATE OUTFILE *    / PRESORTED    / BREAK breaks
>>    / Summary functions......
>>    / N=N.
>>SORT CASES BY breaks.
>>WEIGHT BY N.
>>AGGREGATE OUTFILE *    / PRESORTED .....
>>
>>One might think SPSS would throw a screaming hissy fit about an
>>unsorted file BUT NO ;-) ...  It rips through and aggregates all of
>>the contiguous break values. Then you have a disordered but
>>summarized file which should be much smaller that the raw file.
>
>The records with the same break values don't even need to be
>contiguous; my recent post "Re: sorting out a nested data structure"
>(Thu, 9 Sep 2010 01:18:32 -0400) uses that.
 This is a partial of the file as viewed in FireFox and logged into:
http://listserv.uga.edu/cgi-bin/wa

&nbsp;<br>
We can, using&nbsp; concatenation and duplicate functions, identify
duplicate children across agencies (e.g. with different family ids) �
this unique child id is a string variable:<br>
2.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Is there an �assign� function that
will automate assigning unique ids to children using the string variable
we have constructed?</blockquote><br>
I'm missing what you need, here. Do you already have &quot;a unique child
id, [which] is a string variable&quot;, or do you need to construct
one?<br><br>
If the latter, given that you can identify duplicate children (I take it,
that means you can recognize when two records represent the same child,
even though the records are from different agencies), then if you sort
the data so all records for each child are together in the file (I
presume that's possible), and you have a way (with LAG or something) to
determine when a record represents the same child as its predecessor,
than something like (untested)<br><br>
<tt><font size=2>NUMERIC Our_Child_ID(F6).<br>
LEAVE&nbsp;&nbsp; Our_Child_ID.<br>
DO IF&nbsp;&nbsp; $CASENUM EQ 1.<br>
.&nbsp; COMPUTE Our_Child_ID = 1.<br>
ELSE IF NOT &lt;same child as previous record&gt;.<br>
.&nbsp; COMPUTE Our_Child_ID = Our_Child_ID + 1.<br>
END IF.<br>
</font></tt>&nbsp;<br>
That assigns a <i>numeric</i> ID, which is easier to calculate. You can
convert it to a string using the STRING function, but I'd see no need to,
unless I'm misunderstanding your needs.<br><br>
<blockquote type=cite class=cite cite="">
1.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Because the dataset is so large,
manually combing thru the duplicates to assign our own unique family or
child identifiers is not practical.</blockquote><br>
For child identifiers, see above. <br><br>
To construct family identifiers, I'd start with the view that you have a
single family identifying key, consisting of an agency ID and <i>that
agency's</i> family identifier. Your problem then is, a family may be in
the file under several keys, and you want to recognize which (different)
keys refer to the same family.<br><br>
<blockquote type=cite class=cite cite="">
3.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Once we figure out how to assign
unique child identifiers we are still faced with the problem of finding
some automated way of grouping all the children within families so each
family has a unique identifier, regardless of how many individual
agencies/programs are providing services to the children w/i that
family.&nbsp; Since the family is our unit of analysis this is
critical.&nbsp;&nbsp; Any suggestions?</blockquote><br>

<SNIP>


And the resulting file
>IS sorted by the break groups, unless you use MODE=ADDVARIABLES which
>effectively de-sorts it again.
>
You are *NOT* using a PRESORTED subcommand in your code so of course it will
come back ordered.  I don't believe you processed the subtlety of my
posting.  I was describing a situation where your HUGE data file is LARGELY
"grouped" but are NOT sorted specifically by these groups.
by ***NOT*** sorting the file but indicating ***PRESORTED*** one ends up
with a partially aggregated file which is out of order and has multiple
records of a given GROUP.  By WEIGHTING and reaggregating, one can obtain a
final aggregated file.  I suspect that the implementation of AGGREGATE has
changed significantly.  Back in the good old days (if I'm not mistaken) it
used to sort the file internally and this presorted trick worked wonders to
make the file to be sorted smaller in cases where blocks of cases are
contiguous.  It is really a function of the data.  If your data are NOT
grouped then my "trick" will probably hurt performance.  OTOH, it is good to
know that plain old AGGREGATE without a dedicated SORT will run with
efficiency.
To verify my proposal, one would need to construct various HUGE files with
differing characteristics wrt "Clumping", number of distinct key
combinations and number of variables in the aggregate functions.
I don't have the time to spend on it and rarely encounter these sort of
files, so it's really not worth my time to pursue.


>As I understand from Jon Peck's explanations (*), the default
>behavior of AGGREGATE is to set up a hash table with an entry for
>each set of BREAK values encountered, and accumulate aggregated
>values in those table records. So, AGGREGATE can always run without
>pre-sorting the file.

I wonder what the *NONDEFAULT* behavior is (ie if one specifies /PRESORTED)
I'll bet it doesn't create the hash table (why would it?).  It probably just
rolls along and spits out cases as it hits a new set of break values..
OTOH, I search online for SPSS agorithms and they don't seem to be easy to
find if at all.

So, think about it Richard,  If I tell it the file is PRESORTED but don't
sort the file it should bypass creating the hash table.  I'll end up with a
much smaller file which I can then further work with.
Consider a HUGE file consisting of millions of cases where the file is 50
state files concatenated together but are not sorted by STATE.
Rather than sorting you can aggregate /PRESORTED ...
I suspect that the current version of AGGREGATE simply is implemented in a
smart way where the previous incarnation was NOT and us intrepid ones broke
the 'rules' to get the job done ;-).

>
>However, if there are very many break groups (like hundreds of
>thousands), the hash table may fill up memory and start paging, and
>performance slows by what may be orders of magnitude. When that
>happens, pre-sorting and specifying /PRESORTED on AGGREGATE will run
>much faster.
>======================
>(*)Date:  Mon, 13 Sep 2004 08:01:41 -0500
>From:     "Peck, Jon" <[hidden email]>
>Subject:  Re: factors influencing speed
>To:       [hidden email]
>
>   and, off-list,
>Subject:  Re: [SPSSX-L] SORT CASES algorithm
>X-MimeOLE: Produced By Microsoft Exchange V6.0.6556.0
>Date:     Fri, 15 Oct 2004 07:26:39 -0500
>Message-ID: <[hidden email]>
>X-MS-Has-Attach:
>X-MS-TNEF-Correlator:
>Thread-Topic:      Re: [SPSSX-L] SORT CASES algorithm
>Thread-Index: AcSySiZ3s/F+IR7ZRPeI0+xjRddwiAAZmdrw
>From:     "Peck, Jon" <[hidden email]>
>To:       "Raynald Levesque"
><[hidden email]>,  <[hidden email]>
>X-OriginalArrivalTime: 15 Oct 2004 12:26:39.0726 (UTC)
>FILETIME=[38A1B0E0:01C4B2B2]
>X-ELNK-AV: 0
>
>Sorting before running AGGREGATE is in most cases NOT a performance
>improvement.  AGGREGATE doesn't need it except in situations where
>there are very many distinct break values.  Most of the time,
>AGGREGATE without the /PRESORTED subcommand will be faster than
>sorting followed by AGGREGATE with presort.
>
>=====================
>To manage your subscription to SPSSX-L, send a message to
>[hidden email] (not to SPSSX-L), with no body text except the
>command. To leave the list, send the command
>SIGNOFF SPSSX-L
>For a list of commands to manage subscriptions, send the command
>INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"