Working with Large Data Files

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Working with Large Data Files

Joe-256
I am having difficulty with some large claims data files (5 GB+).  I cannot
even get them to open most of the time, and when they do, running simple
frequencies can take up to 10 minutes.  A colleague of mine has no problems
working with these datasets in STATA on an Acer laptop.  I am working on a
Dell XPS 430 with an Intel Quad Core 2.5 GHz processor with 6.00 GB of RAM.
 The operating system is 32 bit Vista.

Is this an SPSS problem or a problem with my machine's capabilities?  Does
anyone work with datasets of this size?  If so, what kind of setup do you
have?  I understand purchasing SPSS server would most likely fix these
issues, but I was quoted a price of $30,000, and I work for a small business.

Thanks

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Working with Large Data Files

Richard Ristow
At 12:58 PM 12/18/2009, Joe wrote:

I am having difficulty with some large claims data files (5 GB+).  I cannot even get them to open most of the time, and when they do, running simple frequencies can take up to 10 minutes. I am working on a Dell XPS 430 with an Intel Quad Core 2.5 GHz processor with 6.00 GB of RAM.  The operating system is 32 bit Vista.

SPSS should handle that size file without trouble, and your machine is more than powerful enough.

Conceivably, there are Vista issues; I can't speak to that. Some more questions are,

  • What version of SPSS are you running? It shouldn't matter, but versions do have their little quirks.
  • How much free disk space do you have? You should have several times the size of the file you're working with. (But with modern disk sizes, I'd be surprised if this problem arises.)
  • How many variables in the file,, and how many cases? Generally, SPSS handles any number of cases in approximately linear time on the number of cases. It can get awkward with a great many variables. (But your large RAM will reduce the chance of many variables slowing SPSS down.)
  • How many different categories in your frequency table, counting all categories for all variables you're including in the FREQUENCIES.? A very large number of categories can bog down SPSS. (But very large counts within categories will not bog down SPSS.)
  • Finally, you write, "I cannot even get [the files] to open most of the time". Are you using the Data Editor? Loading very large files in the Data Editor can be cumbersome; in this instance, a large number of cases can slow processing.

If you are using the Data Editor, what happens if you skip it, and run through syntax only? For example (untested),

GET FILE=My5Gigs.
FREQUENCIES VariableOfInterest.

replacing the names of the file and variable(s) by the ones for your project.

-Best of luck to you,
 Richard
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD