SPSSX Discussion

Using a Ram Disk

Classic

List

Threaded

4 messages Options

David Futrell

Using a Ram Disk

Fellow Listers:

I have a syntax file that I run frequently that reads an employee database (40,000 records), looks at the employee to supervisor relationship, and ultimitately creates an hierarchy such that the resulting file contains every supervisor above a certain level in the organization and everyone who reports to him and anyone else below him in the organization.

The program works fine, but it's a brute-force method that requires re-reading the same file hundreds of times and creating many hundreds of interim files along the way. It takes about 25 minutes to run.

I thought that I could speed this up significantly by using a RAM disk, so I purchased a software package to do this and modified the syntax so that all of the files being read and written were on the RAM drive.

Unfortunately, although the RAM disk appears to work fine, it doesn't really speed up the processing much and it looks like SPSS is still accessing the hard drive frequently during the program execution.

I'd like some advice from someone who understands the inner workings of SPSS and how I might this problem and get the program to do all this work without accessing the hard drive.

Thanks,

David Futrell
Workforce Research Consultant
Eli Lilly and Company

Jon K Peck

Re: Using a Ram Disk

The first thing to consider is that Windows itself caches a lot of things in memory already. It tends to use unallocated memory for extra i/o buffers and to keep loaded modules around in memory as long as the memory isn't needed for something else. So I doubt that there are many situations where a RAM disk would help. A RAM disk would partition off some of the memory specifically for file contents, but the tradeoff is that that memory would not be available for other purposes, so it might induce more paging to disk in other areas.

I haven't used a RAM myself in quite a few years, but my guess is that it is no help in most situations.

There are two things that might help. First, a 64-bit OS with 64-bit SPSS would allow addressing more memory and, with more physical memory, could speed up processing.

Second, a rewrite of the job using Python programmability could probably eliminate most of the data passes and build the hierarchical relationships much more efficiently. 40,000 cases is not a lot of data, so it's likely that all the hierarchy traversals could be built in memory.

So the tradeoffs are throw more hardware at the problem or throw more programming resources at it.

HTH,
Jon Peck
SPSS, an IBM Company
[hidden email]
312-651-3435

From:	David Futrell <[hidden email]>
To:	[hidden email]
Date:	01/29/2010 07:44 AM
Subject:	[SPSSX-L] Using a Ram Disk
Sent by:	"SPSSX(r) Discussion" <[hidden email]>

Richard Ristow

Re: Using a Ram Disk

In reply to this post by David Futrell

At 09:37 AM 1/29/2010, David Futrell wrote:

>I have a syntax file that I run frequently that reads an employee
>database (40,000 records), looks at the employee to supervisor
>relationship, and ultimitately creates an hierarchy such that the
>resulting file contains every supervisor above a certain level in
>the organization and everyone who reports to him and anyone else
>below him in the organization.

That's called 'transitive closure': in your case, if A supervises B
and B supervises C, than A supervises C. I'm not digging out the last
code I wrote to do it, but it takes a number of passes about like the
longest chain.

That's still a multiple-pass loop, and Python's a good choice to run
it, including determining when it's time to stop. Alternatively, Jon
Peck may be thinking of stepping aside from native SPSS altogether,
and managing the data entirely in Python.

-Looping onward,
Richard

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon Fry

Re: Using a Ram Disk

In reply to this post by David Futrell

In addition to files you name in your commands, Statistics uses lots of temporary disk files. They are all located in the same directory, and you can specify that directory. Look in Edit->Options->File Locations for the temporary folder name.

Jonathan Fry

> > Fellow Listers: > > I have a syntax file that I run frequently that reads an employee > database (40,000 records), looks at the employee to supervisor > relationship, and ultimitately creates an hierarchy such that the > resulting file contains every supervisor above a certain level in > the organization and everyone who reports to him and anyone else > below him in the organization. > > The program works fine, but it's a brute-force method that requires > re-reading the same file hundreds of times and creating many > hundreds of interim files along the way. It takes about 25 minutes to run. > > I thought that I could speed this up significantly by using a RAM > disk, so I purchased a software package to do this and modified the > syntax so that all of the files being read and written were on the RAM drive. > > Unfortunately, although the RAM disk appears to work fine, it > doesn't really speed up the processing much and it looks like SPSS > is still accessing the hard drive frequently during the program execution. > > I'd like some advice from someone who understands the inner workings > of SPSS and how I might this problem and get the program to do all > this work without accessing the hard drive. > > Thanks, > > David Futrell > Workforce Research Consultant > Eli Lilly and Company