SPSSX Discussion

SPSS and large data sets?

Classic

List

Threaded

5 messages Options

gtlakshm

SPSS and large data sets?

I am trying to find out how much data SPSS can handle. Can it do something like hadoop map/reduce to do parallel processing of very large data sets? Handling more than few millions of cases or hundreds of GB of data? What is current state-of-art?

What i found is only anecdotal [1]

Thank you.
Geetika

[1]
http://www.stat.columbia.edu/~cook/movabletype/archives/2007/08/they_started_me.html
"Not sure that any language can really allow you to load up something like 10 G data file into it's memory and process away on it."
James | February 24, 2010 5:45 PM | Reply
"At least on my machine, SPSS/PASW is painfully slow with large datasets. By "large" I mean over 200,000 rows and 1,000 columns. Once I go over that limit even simple things like frequencies slow down dramatically. Simple Copy and Paste functions are also practically broken in SPSS/PASW for anything beyond a few thousand cases, which I find really annoying."
Cori | May 25, 2010 6:50 AM | Reply
"James, above, is right. I've been trying to get SPSS/PASW v18 to cope with a data set with 12 variables across 3 million cases, and it's just useless. If I want to simply paste a full stop into, say, a million of those cases I have to leave my 32 core, 16 Gig PC on overnight, and into most of the next day. Ugh."

Lemon, John S.

Re: SPSS and large data sets?

Geetika et al.

I have found that version 19 seems to cope with larger data sets much better than previous versions but I have found the following:

· The data editor doesn’t like files with 25 columns and over 95 million rows and you will get a warning to that effect if you try and load that many rows.

· I am able to work with files of over 36 million records ( 7 Gbytes ) without too many problems, sorting and aggregating takes a few minutes but that’s much faster than the mainframe version I started on !

· On my large data sets I don’t even attempt to use copy & paste or edit more than one cell; good old syntax comes to the fore !!

· As I have a number the 7 Gbytes files ( one for each year I am studying ) I aggregate the data into new files and merge them; again this is not too long.

Before anyone asks I’m running on a two year old 3.16 Ghz machine with 2 Gig of memory and a 100 Gbytes disc so it is not the latest super system.

Now at risk of starting another ‘hot topic’ - when in the ‘Aggregate’ dialog it says “For large data sets” and prompts you to state that it is “already sorted”, or “sort before aggregating”:

· What is a ‘large data set’

· Why can you check / tick both boxes ?

Perhaps Jon or someone else from SPSS can enlighten us ?

Best Wishes

John S. Lemon

DIT ( Directorate of Information Technology ) - Student Liaison Officer

University of Aberdeen

Edward Wright Building: Room G86a

Tel: +44 1224 273350

DIT news for Students

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of gtlakshm
Sent: 27 January 2011 14:35
To: [hidden email]
Subject: SPSS and large data sets?

I am trying to find out how much data SPSS can handle. Can it do something like hadoop map/reduce to do parallel processing of very large data sets?

Handling more than few millions of cases or hundreds of GB of data? What is current state-of-art?

What i found is only anecdotal [1]

Thank you.

Geetika

[1]

http://www.stat.columbia.edu/~cook/movabletype/archives/2007/08/they_started_me.html

"Not sure that any language can really allow you to load up something like 10 G data file into it's memory and process away on it."

James | February 24, 2010 5:45 PM | Reply

"At least on my machine, SPSS/PASW is painfully slow with large datasets. By "large" I mean over 200,000 rows and 1,000 columns. Once I go over that limit even simple things like frequencies slow down dramatically.

Simple Copy and Paste functions are also practically broken in SPSS/PASW for anything beyond a few thousand cases, which I find really annoying."

Cori | May 25, 2010 6:50 AM | Reply

"James, above, is right. I've been trying to get SPSS/PASW v18 to cope with a data set with 12 variables across 3 million cases, and it's just useless. If I want to simply paste a full stop into, say, a million of those cases I have to leave my 32 core, 16 Gig PC on overnight, and into most of the next day. Ugh."

View this message in context: http://spssx-discussion.1045642.n5.nabble.com/SPSS-and-large-data-sets-tp3359802p3359802.html

Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================

To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

The University of Aberdeen is a charity registered in Scotland, No SC013683.

Jon K Peck

Re: SPSS and large data sets?

From: "Lemon, John" <[hidden email]>
To: [hidden email]
Date: 01/27/2011 08:35 AM
Subject: Re: [SPSSX-L] SPSS and large data sets?
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Geetika et al.

I have found that version 19 seems to cope with larger data sets much better than previous versions but I have found the following:

· The data editor doesn’t like files with 25 columns and over 95 million rows and you will get a warning to that effect if you try and load that many rows.
>>>Using the DE with enormous files is really not sensible. Such files are not interactive-friendly. But Statistics will process the files correctly in all the backend operations. With recent versions, some procedures are multi-threaded, which speeds up processing

· I am able to work with files of over 36 million records ( 7 Gbytes ) without too many problems, sorting and aggregating takes a few minutes but that’s much faster than the mainframe version I started on !
· On my large data sets I don’t even attempt to use copy & paste or edit more than one cell; good old syntax comes to the fore !!
>>>As above. The system work required for random copy and paste operations on such files is painful to contemplate.

· As I have a number the 7 Gbytes files ( one for each year I am studying ) I aggregate the data into new files and merge them; again this is not too long.

Before anyone asks I’m running on a two year old 3.16 Ghz machine with 2 Gig of memory and a 100 Gbytes disc so it is not the latest super system.

Now at risk of starting another ‘hot topic’ - when in the ‘Aggregate’ dialog it says “For large data sets” and prompts you to state that it is “already sorted”, or “sort before aggregating”:

· What is a ‘large data set’
· Why can you check / tick both boxes ?

>>>These checkboxes are rarely needed, and the circumstances where they are are very hardware and data dependent. AGGREGATE builds an internal hash table of break values, and if this fits in memory, especially physical memory, which is almost always the case for sensible breaks, you shouldn't check these boxes, although in a few cases, a presorted file will aggregate faster. But sorting is much slower than the hash table approach, so you wouldn't sort just for AGGREGATE unless you have to. Both boxes generate the /PRESORTED subcommand for AGGREGATE. Checking both does generate a SORT command, but SORT will quit after one pass if it discovers that the data are already sorted. The implicit third state is don't sort and don't assume the data are already sorted. I suppose this would be better as three radio buttons, but it's unlikely that the current design would confuse anybody.

Regards,
Jon Peck

Perhaps Jon or someone else from SPSS can enlighten us ?

Best Wishes

John S. Lemon
DIT ( Directorate of Information Technology ) - Student Liaison Officer
University of Aberdeen
Edward Wright Building: Room G86a
Tel: +44 1224 273350

The University of Aberdeen is a charity registered in Scotland, No SC013683.

Richard Ristow

Re: SPSS and large data sets?

At 11:30 AM 1/27/2011, Jon K Peck wrote, responding to

Lemon, John, has asked, 01/27/2011 08:35 AM:

>>When in the "Aggregate" dialog it says "For large data sets" and
>>prompts you to state that it is "already sorted", or "sort before
>>aggregating":
>
>These checkboxes are rarely needed, and the circumstances where they
>are are very hardware and data dependent. AGGREGATE builds an
>internal hash table of break values, and if this fits in memory,
>especially physical memory, which is almost always the case for
>sensible breaks, you shouldn't check these boxes, although in a few
>cases, a presorted file will aggregate faster. But sorting is much
>slower than the hash table approach, so you wouldn't sort just for
>AGGREGATE unless you have to. Both boxes generate the /PRESORTED
>subcommand for AGGREGATE.

That is absolutely correct, but let me expand a little.

The caution about "large data sets" is a little misleading. The size
of the hash table (see above) doesn't depend on the size of the data
set, but on the number of statistics requested and the number of
'break groups', or distinct values of the set of BREAK variables.

I last looked hard at this on a 750 megabyte (memory) Windows
machine, and found that, when requesting a half-dozen statistics or
fewer, /PRESORTED started being faster when there were a few hundred
thousand break groups. (The turning point may be much higher on
larger machines.) Below that level, AGGREGATE without /PRESORTED is
more efficient, often much more efficient, for any size file.

Jon wrote that this is "almost always the case for sensible breaks".
That's true up to a point, but it's well to remember about
/PRESORTED, because AGGREGATE with a great many break groups is
sometimes useful -- for example, to create a file in which keys are
known to be unique, with a count of occurrences to flag which keys
were originally duplicated.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

ChetMeinzer

Re: SPSS and large data sets?

In reply to this post by gtlakshm

Anyone gotten hadoop map/reduce running with SPSS?