DEALING WITH LARGE DATASETS

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

DEALING WITH LARGE DATASETS

kalyannjoing
Hi All,

      I am very new to spss and tried to import a 30GB file into SPSS.Problem is it is taking lot of time to import. Kindly let me know the estimate time to import for the system configuration i have given below.


processor                 : Intel(R) Core(TM) i5-3330 CPU @3.00GHz 3.00GHz
RAM                        : 8.00GB
operating system       :Windows 7 professional 32 bit OS


the file size   :   30GB


Thanks and Regards
Kalyan
Reply | Threaded
Open this post in threaded view
|

Re: DEALING WITH LARGE DATASETS

David Marso
Administrator
Define:
"taking lot of time..."?? Minutes? Hours? Days?  Is the case counter running?
Import what? XLS?, Text?
"estimate time to import"
ROFL...
Are you serious?  
Nobody is going to take the time to replicate your vague description and waste whatever run time to get you some fictitious number.  There are a bajillion possible variables and pretty much impossible to provide any sort of estimate.
Suggestion?
Just let it run through,  save the system file and carry on.
FWIW: 30G is not a particularly LARGE data set.
OTOH:  If your hard drive is stuffed to capacity or requiring a tune up it may take longer than necessary.

kalyannjoing wrote
Hi All,

      I am very new to spss and tried to import a 30GB file into SPSS.Problem is it is taking lot of time to import. Kindly let me know the estimate time to import for the system configuration i have given below.


processor                 : Intel(R) Core(TM) i5-3330 CPU @3.00GHz 3.00GHz
RAM                        : 8.00GB
operating system       :Windows 7 professional 32 bit OS


the file size   :   30GB


Thanks and Regards
Kalyan
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: DEALING WITH LARGE DATASETS

kalyannjoing
thanks for responding..
           its still running from 3 hours... case counter running(100 million cases read and still running)...it's a csv file.....it contains 453 variables

thanks
Kalyan
Reply | Threaded
Open this post in threaded view
|

Re: DEALING WITH LARGE DATASETS

Michael Kruger
I have frequently worked with csv delimited text files of 4-5 million
cases reading approximately 200 variables and I can tell you right now
you are going to drive yourself crazy working with data files as large
as 30 GB on a PC. This type of data file size requires much more
computing power than you have.

Michael Kruger


--

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: DEALING WITH LARGE DATASETS

Art Kendall
In reply to this post by kalyannjoing
Are you reading it from a local disk or from a remote server?
If from a remote server, did you use the CACHE command?

How many cases do you anticipate there are in the input file?

Before you started, did you make sure that there was plenty of room for the output file on a disk?

Are your variables strings, long numbers, single digit numbers, etc.?


Art Kendall
Social Research Consultants
On 1/7/2013 4:51 AM, kalyannjoing wrote:
thanks for responding..
           its still running from 3 hours... case counter running(100
million cases read and still running)...it's a csv file.....it contains 453
variables

thanks
Kalyan



--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/DEALING-WITH-LARGE-DATASETS-tp5717247p5717251.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: DEALING WITH LARGE DATASETS

David Marso
Administrator
In reply to this post by kalyannjoing
In addition to the suggestion of locating more powerful hardware (perhaps not an option especially if you also need to acquire an SPSS license for a non PC version) you REALLY need to ask yourself whether you need to have access to ALL 453 variables and/or access to ALL whatever 100 million+ records of data.  
See SELECT IF and SAMPLE commands.
Dealing with CSV files in an efficient manner is a true PITA so you might see if the data can be restructured into a fixed format.  If so you can use various tools such as INPUT PROGRAM to hit the file with tweezers rather than the unwieldy hammer (this requires a certain amount of familiarity with intermediate to advanced SPSS knowledge -but subtlety over brute force comes with a price-).  Before even going after a "huge" file you really need to determine what you want/need to do with the data and plan out an efficient game plan otherwise you will be battling extreme hairline recession and a spinning hard disk.    You had better also do your research to calculate the eventual size of the resulting SAV file on disk and make sure there is adequate space.  Naively approaching a large file with the first thing that comes to mind (I would guess in this case DATA LIST LIST, reading everything in the file -whether you need it or not-) is likely to be an incredibly frustrating experience.

kalyannjoing wrote
thanks for responding..
           its still running from 3 hours... case counter running(100 million cases read and still running)...it's a csv file.....it contains 453 variables

thanks
Kalyan
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"