I have a population of 1.5 Million records that I want to use in SPSS. Will it work? What platform will I need to handle something like this? I will be taking random samples but will need to get descriptive statistics from the entire population.
|
Administrator
|
1.5M records is *NOT* large WRT capacity.
IIRC: Something like 2 Billion cases/2 Billion Variables. Pretty much only limited by disk space. --- The following simulates .5M records wit 120 vars in a few minutes. Generates Descriptives in about 1 minute on my MacBookPro (circa 2007). * Simulate raw data *. INPUT PROGRAM. LOOP CASEID=1 TO 1500000. DO REPEAT v=v001 to v120. compute v=trunc(uniform(100)). END REPEAT. END CASE. END LOOP. END FILE. END INPUT PROGRAM. exe. DESC ALL. ----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
I am out of the office the afternoon of July 9th and will reply to emails at that time.
If you need immediate assistance, please contact Kay Gates 303-982-6565 or
[hidden email] Heather
|
In reply to this post by btafoya
Take a look at SPSS Complex Samples too. Max. I have a population of 1.5 Million records that I want to use in SPSS. Will it work? What platform will I need to handle something like this? I will be taking random samples but will need to get descriptive statistics from the entire population. |
In reply to this post by David Marso
About a minute to create and run DESC on David's example using a five year
old 64 bit ThinkPad. No problems running 3.5 Bln records with two dozen variables on same PC. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Marso Sent: Tuesday, June 26, 2012 7:52 PM To: [hidden email] Subject: Re: SPSS Database Capacity 1.5M records is *NOT* large WRT capacity. IIRC: Something like 2 Billion cases/2 Billion Variables. Pretty much only limited by disk space. --- The following simulates .5M records wit 120 vars in a few minutes. Generates Descriptives in about 1 minute on my MacBookPro (circa 2007). * Simulate raw data *. INPUT PROGRAM. LOOP CASEID=1 TO 1500000. DO REPEAT v=v001 to v120. compute v=trunc(uniform(100)). END REPEAT. END CASE. END LOOP. END FILE. END INPUT PROGRAM. exe. DESC ALL. ---- btafoya wrote > > I have a population of 1.5 Million records that I want to use in SPSS. > Will it work? What platform will I need to handle something like this? > I will be taking random samples but will need to get descriptive > statistics from the entire population. > -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/SPSS-Database-Capacity-tp57138 09p5713810.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by btafoya
And of course you can always DL a trial and beat the hell out of it and see it suffices for you!
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by btafoya
1.5M records is not a large file by SPSS
standards.
Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: btafoya <[hidden email]> To: [hidden email] Date: 06/26/2012 08:39 PM Subject: [SPSSX-L] SPSS Database Capacity Sent by: "SPSSX(r) Discussion" <[hidden email]> I have a population of 1.5 Million records that I want to use in SPSS. Will it work? What platform will I need to handle something like this? I will be taking random samples but will need to get descriptive statistics from the entire population. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/SPSS-Database-Capacity-tp5713809.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
Hi Jon,
Please (when you have a moment) restate the actual *ABSOLUTE* max cases X max vars for both 32 bit and 64 bits maybe even 132 bit systems just for the hell of it (assuming infinite disk space and RAM). Would love a core dump of the theory.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Hi, See the line "How many variables and cases can SPSS for windows handle?" on http://www.spsstools.net/FAQ.htm Not sure what the differences are when on a 64 bit architecture. Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
I don't think the information on that page
is entirely accurate. I'm not aware of any theoretical limits on either
cases or variables. Your computing environment will determine practical
limits.
Rick Oliver Senior Information Developer IBM Business Analytics (SPSS) E-mail: [hidden email] Phone: 312.893.4922 | T/L: 206-4922 From: Albert-Jan Roskam <[hidden email]> To: [hidden email] Date: 06/29/2012 05:10 AM Subject: Re: SPSS Database Capacity Sent by: "SPSSX(r) Discussion" <[hidden email]> Hi, See the line "How many variables and cases can SPSS for windows handle?" on http://www.spsstools.net/FAQ.htm Not sure what the differences are when on a 64 bit architecture. Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From: David Marso <[hidden email]> To: [hidden email] Sent: Friday, June 29, 2012 5:56 AM Subject: Re: [SPSSX-L] SPSS Database Capacity Hi Jon, Please (when you have a moment) restate the actual *ABSOLUTE* max cases X max vars for both 32 bit and 64 bits maybe even 132 bit systems just for the hell of it (assuming infinite disk space and RAM). Would love a core dump of the theory. Jon K Peck wrote > > 1.5M records is not a large file by SPSS standards. > > Jon Peck (no "h") aka Kim > Senior Software Engineer, IBM > peck@.ibm > new phone: 720-342-5621 > > > > > From: btafoya <btafoya@> > To: SPSSX-L@.uga > Date: 06/26/2012 08:39 PM > Subject: [SPSSX-L] SPSS Database Capacity > Sent by: "SPSSX(r) Discussion" <SPSSX-L@.uga> > > > > I have a population of 1.5 Million records that I want to use in SPSS. > Will > it work? What platform will I need to handle something like this? I will > be > taking random samples but will need to get descriptive statistics from the > entire population. > > -- > View this message in context: > http://spssx-discussion.1045642.n5.nabble.com/SPSS-Database-Capacity-tp5713809.html > > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > LISTSERV@.UGA (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/SPSS-Database-Capacity-tp5713809p5713886.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by David Marso
Information on absolute limits on cases
and variables in Statistics isn't readily available, because those limits
are far beyond practical constraints imposed by the hardware and OS. IOW,
you can assume that no such limits exist on the part of Statistics. However,
there are a few procedures - mainly time series oriented, that require
the data to be held in memory and are thus likely to run out of resources
sooner than the general case. Since Statistics mostly does not hold
all the cases in memory, you will run out of patience before you run out
of memory.
Our QA testbed goes up to 66,000 variables and more than 4 billion cases. That is an absurd number of variables, but there you are. Certainly, keeping the number of variables down to a reasonable number will speed performance. And I often wonder why people are often so averse to taking random samples of vast datasets except in situations where they are searching for needle-in-haystack very rare events. BTW, I learned an interesting fact recently about the Watson system that won the Jeopardy contest. It had 155 terabytes of physical memory. Don't try that at home. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: David Marso <[hidden email]> To: [hidden email] Date: 06/28/2012 09:59 PM Subject: Re: [SPSSX-L] SPSS Database Capacity Sent by: "SPSSX(r) Discussion" <[hidden email]> Hi Jon, Please (when you have a moment) restate the actual *ABSOLUTE* max cases X max vars for both 32 bit and 64 bits maybe even 132 bit systems just for the hell of it (assuming infinite disk space and RAM). Would love a core dump of the theory. Jon K Peck wrote > > 1.5M records is not a large file by SPSS standards. > > Jon Peck (no "h") aka Kim > Senior Software Engineer, IBM > peck@.ibm > new phone: 720-342-5621 > > > > > From: btafoya <btafoya@> > To: SPSSX-L@.uga > Date: 06/26/2012 08:39 PM > Subject: [SPSSX-L] SPSS Database Capacity > Sent by: "SPSSX(r) Discussion" <SPSSX-L@.uga> > > > > I have a population of 1.5 Million records that I want to use in SPSS. > Will > it work? What platform will I need to handle something like this? I will > be > taking random samples but will need to get descriptive statistics from the > entire population. > > -- > View this message in context: > http://spssx-discussion.1045642.n5.nabble.com/SPSS-Database-Capacity-tp5713809.html > > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > LISTSERV@.UGA (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/SPSS-Database-Capacity-tp5713809p5713886.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Jon, Perhaps you would comment on this please. Recently somebody asked me off-list about display problems opening a 37Gb SPSS file with 90.546M records (rows). This person said he had a pop-up message that SPSS can ‘only’ display 90,500,000 rows. Are there display limits and if so, are those limits determined by the specifications of the computer. What specifications go into determining the limits and are there settings that can be tweaked to increase the limits? Thanks, Gene Maguin From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jon K Peck Information on absolute limits on cases and variables in Statistics isn't readily available, because those limits are far beyond practical constraints imposed by the hardware and OS. IOW, you can assume that no such limits exist on the part of Statistics. However, there are a few procedures - mainly time series oriented, that require the data to be held in memory and are thus likely to run out of resources sooner than the general case. Since Statistics mostly does not hold all the cases in memory, you will run out of patience before you run out of memory.
|
In reply to this post by Jon K Peck
I believe the aversion to random sampling (as I experience it) is that people often have a tendency to hoard. There is reluctance to give up the data, given
the time and money spent on collecting it. Often there is also a belief that if your sample is actually a population, then then you should analyze the entire population, instead of a sample of it, because its somehow more valid. There doesn’t seem to be
a clear understanding that you can infer with near total confidence everything you need from just a sample, as compared with the entire population. Added to that is the common misconceptions with regard to drawing multiple random samples to confirm results.
People often don’t understand the statistics they want or use, and don’t understand why we can’t just give them a single number. I believe its related to the same mindset that has led to numerous requests for how to do a multiple imputation and save out a
single data set. To echo what Jon has mentioned on some practical limits, some of my analysis work has required that I allow the program to run overnight or for a few hours,
and increased variables and cases are frequently the result. One example you could run into is ridge regression. Another one I use far more often are the mixed modeling (MLM/HLM, etc.). With regard to MLM, I have found that the most common cause of overnight
run times is a misspecification of the model in which I attempted to allow too many variables to be random to the point of absurdity. Occasionally this will happen with a reasonably specified model, where my a-priori theory that some factor be allowed to
vary random is reasonable, but where the number of varied levels and amount of cases is so large that it takes quite a long time to converge on a final model. Again the main cause is a large number of levels to said model, and many blocks within higher levels.
This is not an SPSS problem (maybe the efficiency of the algorithm, but I don’t know that others are better), rather it’s a hardware resource problem. Matthew J Poes Research Data Specialist Center for Prevention Research and Development University of Illinois 510 Devonshire Dr. Champaign, IL 61820 Phone: 217-265-4576 email:
[hidden email] From: SPSSX(r) Discussion [mailto:[hidden email]]
On Behalf Of Jon K Peck Information on absolute limits on cases and variables in Statistics isn't readily available, because those limits are far beyond practical constraints imposed by the hardware
and OS. IOW, you can assume that no such limits exist on the part of Statistics. However, there are a few procedures - mainly time series oriented, that require the data to be held in memory and are thus likely to run out of resources sooner than the general
case. Since Statistics mostly does not hold all the cases in memory, you will run out of patience before you run out of memory.
|
In reply to this post by Maguin, Eugene
That is a display issue with the Data Editor.
In some older versions of SPSS/Statistics (maybe still present, but
I think it is gone), the DE is limited in how many rows it will display
as a memory management issue. It has nothing to do with processing
capabilities in other parts of the system. Obviously no one is going
to page through huge numbers of rows. I don't think the 90 million
number is correct, though. I think the limit was more like 90,000,
but it was configurable (virtual rows). With Server, it is possible
to suppress the DE altogether, which is appropriate for very large datasets.
Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: "Maguin, Eugene" <[hidden email]> To: [hidden email] Date: 06/29/2012 09:21 AM Subject: Re: [SPSSX-L] SPSS Database Capacity Sent by: "SPSSX(r) Discussion" <[hidden email]> Jon, Perhaps you would comment on this please. Recently somebody asked me off-list about display problems opening a 37Gb SPSS file with 90.546M records (rows). This person said he had a pop-up message that SPSS can ‘only’ display 90,500,000 rows. Are there display limits and if so, are those limits determined by the specifications of the computer. What specifications go into determining the limits and are there settings that can be tweaked to increase the limits? Thanks, Gene Maguin From: SPSSX(r) Discussion [[hidden email]] On Behalf Of Jon K Peck Sent: Friday, June 29, 2012 11:03 AM To: [hidden email] Subject: Re: SPSS Database Capacity Information on absolute limits on cases and variables in Statistics isn't readily available, because those limits are far beyond practical constraints imposed by the hardware and OS. IOW, you can assume that no such limits exist on the part of Statistics. However, there are a few procedures - mainly time series oriented, that require the data to be held in memory and are thus likely to run out of resources sooner than the general case. Since Statistics mostly does not hold all the cases in memory, you will run out of patience before you run out of memory. Our QA testbed goes up to 66,000 variables and more than 4 billion cases. That is an absurd number of variables, but there you are. Certainly, keeping the number of variables down to a reasonable number will speed performance. And I often wonder why people are often so averse to taking random samples of vast datasets except in situations where they are searching for needle-in-haystack very rare events. BTW, I learned an interesting fact recently about the Watson system that won the Jeopardy contest. It had 155 terabytes of physical memory. Don't try that at home. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM peck@... new phone: 720-342-5621 From: David Marso <david.marso@...> To: [hidden email] Date: 06/28/2012 09:59 PM Subject: Re: [SPSSX-L] SPSS Database Capacity Sent by: "SPSSX(r) Discussion" <[hidden email]> Hi Jon, Please (when you have a moment) restate the actual *ABSOLUTE* max cases X max vars for both 32 bit and 64 bits maybe even 132 bit systems just for the hell of it (assuming infinite disk space and RAM). Would love a core dump of the theory. Jon K Peck wrote > > 1.5M records is not a large file by SPSS standards. > > Jon Peck (no "h") aka Kim > Senior Software Engineer, IBM > peck@.ibm > new phone: 720-342-5621 > > > > > From: btafoya <btafoya@> > To: [hidden email] > Date: 06/26/2012 08:39 PM > Subject: [SPSSX-L] SPSS Database Capacity > Sent by: "SPSSX(r) Discussion" <SPSSX-L@.uga> > > > > I have a population of 1.5 Million records that I want to use in SPSS. > Will > it work? What platform will I need to handle something like this? I will > be > taking random samples but will need to get descriptive statistics from the > entire population. > > -- > View this message in context: > http://spssx-discussion.1045642.n5.nabble.com/SPSS-Database-Capacity-tp5713809.html > > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > LISTSERV@.UGA (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/SPSS-Database-Capacity-tp5713809p5713886.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |