I am trying to find out how much data SPSS can handle. Can it do something like hadoop map/reduce to do parallel processing of very large data sets? Handling more than few millions of cases or hundreds of GB of data? What is current state-of-art?
What i found is only anecdotal [1] Thank you. Geetika [1] http://www.stat.columbia.edu/~cook/movabletype/archives/2007/08/they_started_me.html "Not sure that any language can really allow you to load up something like 10 G data file into it's memory and process away on it." James | February 24, 2010 5:45 PM | Reply "At least on my machine, SPSS/PASW is painfully slow with large datasets. By "large" I mean over 200,000 rows and 1,000 columns. Once I go over that limit even simple things like frequencies slow down dramatically. Simple Copy and Paste functions are also practically broken in SPSS/PASW for anything beyond a few thousand cases, which I find really annoying." Cori | May 25, 2010 6:50 AM | Reply "James, above, is right. I've been trying to get SPSS/PASW v18 to cope with a data set with 12 variables across 3 million cases, and it's just useless. If I want to simply paste a full stop into, say, a million of those cases I have to leave my 32 core, 16 Gig PC on overnight, and into most of the next day. Ugh." |
Geetika et al. I have found that version 19 seems to cope with larger data sets much better than previous versions but I have found the following:
·
The data editor doesn’t like files with 25 columns and over 95 million rows and you will get a warning to that effect if you try and load that many rows.
·
I am able to work with files of over 36 million records ( 7 Gbytes ) without too many problems, sorting and aggregating takes a few minutes but that’s much faster than the mainframe version I started on !
·
On my large data sets I don’t even attempt to use copy & paste or edit more than one cell; good old syntax comes to the fore !!
·
As I have a number the 7 Gbytes files ( one for each year I am studying ) I aggregate the data into new files and merge them; again this is not too long. Before anyone asks I’m running on a two year old 3.16 Ghz machine with 2 Gig of memory and a 100 Gbytes disc so it is not the latest super system. Now at risk of starting another ‘hot topic’ - when in the ‘Aggregate’ dialog it says “For large data sets” and prompts you to state that it is “already sorted”, or “sort before aggregating”:
·
What is a ‘large data set’
·
Why can you check / tick both boxes ? Perhaps Jon or someone else from SPSS can enlighten us ? Best Wishes John S. Lemon DIT ( Directorate of Information Technology ) -
Student Liaison Officer Edward Wright Building: Room
G86a
Tel: +44 1224 273350 -----Original Message----- I am trying to find out how much data SPSS can handle. Can it do something like hadoop map/reduce to do parallel processing of very large data sets? Handling more than few millions of cases or hundreds of GB of data? What is current state-of-art? What i found is only anecdotal [1] Thank you. Geetika [1] http://www.stat.columbia.edu/~cook/movabletype/archives/2007/08/they_started_me.html "Not sure that any language can really allow you to load up something like 10 G data file into it's memory and process away on it." James | February 24, 2010 5:45 PM | Reply "At least on my machine, SPSS/PASW is painfully slow with large datasets. By "large" I mean over 200,000 rows and 1,000 columns. Once I go over that limit even simple things like frequencies slow down dramatically. Simple Copy and Paste functions are also practically broken in SPSS/PASW for anything beyond a few thousand cases, which I find really annoying." Cori | May 25, 2010 6:50 AM | Reply "James, above, is right. I've been trying to get SPSS/PASW v18 to cope with a data set with 12 variables across 3 million cases, and it's just useless. If I want to simply paste a full stop into, say, a million of those cases I have
to leave my 32 core, 16 Gig PC on overnight, and into most of the next day. Ugh." -- View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/SPSS-and-large-data-sets-tp3359802p3359802.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands
to manage subscriptions, send the command INFO REFCARD The University of Aberdeen is a charity registered in Scotland, No SC013683. |
From: "Lemon, John" <[hidden email]> To: [hidden email] Date: 01/27/2011 08:35 AM Subject: Re: [SPSSX-L] SPSS and large data sets? Sent by: "SPSSX(r) Discussion" <[hidden email]> Geetika et al. I have found that version 19 seems to cope with larger data sets much better than previous versions but I have found the following: · The data editor doesn’t like files with 25 columns and over 95 million rows and you will get a warning to that effect if you try and load that many rows. >>>Using the DE with enormous files is really not sensible. Such files are not interactive-friendly. But Statistics will process the files correctly in all the backend operations. With recent versions, some procedures are multi-threaded, which speeds up processing · I am able to work with files of over 36 million records ( 7 Gbytes ) without too many problems, sorting and aggregating takes a few minutes but that’s much faster than the mainframe version I started on ! · On my large data sets I don’t even attempt to use copy & paste or edit more than one cell; good old syntax comes to the fore !! >>>As above. The system work required for random copy and paste operations on such files is painful to contemplate. · As I have a number the 7 Gbytes files ( one for each year I am studying ) I aggregate the data into new files and merge them; again this is not too long. Before anyone asks I’m running on a two year old 3.16 Ghz machine with 2 Gig of memory and a 100 Gbytes disc so it is not the latest super system. Now at risk of starting another ‘hot topic’ - when in the ‘Aggregate’ dialog it says “For large data sets” and prompts you to state that it is “already sorted”, or “sort before aggregating”: · What is a ‘large data set’ · Why can you check / tick both boxes ? >>>These checkboxes are rarely needed, and the circumstances where they are are very hardware and data dependent. AGGREGATE builds an internal hash table of break values, and if this fits in memory, especially physical memory, which is almost always the case for sensible breaks, you shouldn't check these boxes, although in a few cases, a presorted file will aggregate faster. But sorting is much slower than the hash table approach, so you wouldn't sort just for AGGREGATE unless you have to. Both boxes generate the /PRESORTED subcommand for AGGREGATE. Checking both does generate a SORT command, but SORT will quit after one pass if it discovers that the data are already sorted. The implicit third state is don't sort and don't assume the data are already sorted. I suppose this would be better as three radio buttons, but it's unlikely that the current design would confuse anybody. Regards, Jon Peck Perhaps Jon or someone else from SPSS can enlighten us ? Best Wishes John S. Lemon DIT ( Directorate of Information Technology ) - Student Liaison Officer University of Aberdeen Edward Wright Building: Room G86a Tel: +44 1224 273350
|
At 11:30 AM 1/27/2011, Jon K Peck wrote, responding to
Lemon, John, has asked, 01/27/2011 08:35 AM: >>When in the "Aggregate" dialog it says "For large data sets" and >>prompts you to state that it is "already sorted", or "sort before >>aggregating": > >These checkboxes are rarely needed, and the circumstances where they >are are very hardware and data dependent. AGGREGATE builds an >internal hash table of break values, and if this fits in memory, >especially physical memory, which is almost always the case for >sensible breaks, you shouldn't check these boxes, although in a few >cases, a presorted file will aggregate faster. But sorting is much >slower than the hash table approach, so you wouldn't sort just for >AGGREGATE unless you have to. Both boxes generate the /PRESORTED >subcommand for AGGREGATE. That is absolutely correct, but let me expand a little. The caution about "large data sets" is a little misleading. The size of the hash table (see above) doesn't depend on the size of the data set, but on the number of statistics requested and the number of 'break groups', or distinct values of the set of BREAK variables. I last looked hard at this on a 750 megabyte (memory) Windows machine, and found that, when requesting a half-dozen statistics or fewer, /PRESORTED started being faster when there were a few hundred thousand break groups. (The turning point may be much higher on larger machines.) Below that level, AGGREGATE without /PRESORTED is more efficient, often much more efficient, for any size file. Jon wrote that this is "almost always the case for sensible breaks". That's true up to a point, but it's well to remember about /PRESORTED, because AGGREGATE with a great many break groups is sometimes useful -- for example, to create a file in which keys are known to be unique, with a count of occurrences to flag which keys were originally duplicated. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by gtlakshm
Anyone gotten hadoop map/reduce running with SPSS?
|
Free forum by Nabble | Edit this page |