Do number of variables affect the speed of calculations ?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Do number of variables affect the speed of calculations ?

Mark Webb-3
My file has about 500 variables & 800k cases.

Computations are slow.

Would it help to make a sub-set of the relevant variables, do the
computations, and merge the results back into the total data base ?

Regards

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Do number of variables affect the speed of calculations ?

Mark A Davenport MADAVENP
It'sd been several version since I last had to deal with datasets this big
but I seem to recall that there was a way to force SPSS to use more memory
(page size) to work more efficeintly.  One wouold guess that this is
automatic now but I wonder if soemthing like this might help.  One of the
SPSS techies may have to chime in on this.  One other thing you might try
is g=oing to Edit, Options, and then to the Data tab.  Set the system to
calculate only as needed, not immediately.  This may not be an issue but
it is worth a try.  BTW, it would help if we knew if you are using v15 or
16.  16 seems to be a bit slower regardless.

Mark

***************************************************************************************************************************************************************
Mark A. Davenport Ph.D.
Senior Research Analyst
Office of Institutional Research
The University of North Carolina at Greensboro
336.256.0395
[hidden email]

'An approximate answer to the right question is worth a good deal more
than an exact answer to an approximate question.' --a paraphrase of J. W.
Tukey (1962)






Mark Webb <[hidden email]>
Sent by: "SPSSX(r) Discussion" <[hidden email]>
11/02/2007 01:06 AM
Please respond to
Mark Webb <[hidden email]>


To
[hidden email]
cc

Subject
Do number of variables affect the speed of calculations ?






My file has about 500 variables & 800k cases.

Computations are slow.

Would it help to make a sub-set of the relevant variables, do the
computations, and merge the results back into the total data base ?

Regards

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Do number of variables affect the speed of calculations ?

Fry, Jonathan B.
In reply to this post by Mark Webb-3
Try doing a simple Descriptives for just the mean of one variable.  That statistic takes almost no time to calculate, so the time required is just the time to pass the data.

If that seems long, create an extract with the variables needed for some analysis and try the same experiment.

Jonathan Fry
SPSS Inc.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mark Webb
Sent: Thursday, November 01, 2007 11:58 PM
To: [hidden email]
Subject: Do number of variables affect the speed of calculations ?

My file has about 500 variables & 800k cases.

Computations are slow.

Would it help to make a sub-set of the relevant variables, do the
computations, and merge the results back into the total data base ?

Regards

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Do number of variables affect the speed of calculations ?

Richard Ristow
In reply to this post by Mark Webb-3
At 12:57 AM 11/2/2007, Mark Webb wrote:

>My file has about 500 variables & 800k cases. Computations are slow.
>
>Would it help to make a sub-set of the relevant variables, do the
>computations, and merge the results back into the total data base ?

That shouldn't help with the computations, themselves. What it CAN help
with is the speed to read your file, which is the limiting factor in
most computations.

But that leads me to: What computations are you doing? And: do you have
any EXECUTE statements? This is exactly the case where they may slow yo
down badly.

(Working on a subset of the variables does, as I say, make it faster to
read the file; so if you do have to read the file multiple times, it
will help. But first, see if you can avoid reading it multiple times.)

-Best of luck,
  Richard



Below is a FAQ (still in draft) on number of variables and cases, with
more on this issue.


FAQ: How many variables and cases are allowed in SPSS?

Below is a discussion by Jon Peck of SPSS, Inc., which applies to all
SPSS versions that are likely to be in use - back to release 9, at
least.

I add to what Jon wrote,

. For most operations, increasing the number of cases will increase the
running time about in proportion. Usually, SPSS can handle a great many
cases gracefully. Jon, below, notes some operations for which many
cases may slow SPSS badly, but often work-around can be found even for
these.

. Increasing the number of variables will generally increase the
running time about in proportion, even if you're not using them all,
because the running time is dominated by the time to read the file from
disk, i.e. the total file size

. After some point hard to estimate (though larger if the machine has
more RAM), increasing the number of variables will increase the running
time out of all proportion, because putting the whole dictionary and
data for one case in RAM may require paging.

. I emphasize Jon's point that "modern database practice would be to
break up your variables into cohesive subsets", i.e. to restructure
with more cases and fewer variables. A typical example is changing from
one record per entity with data for many years, to one record per
entity per year. I've posted a number of solutions in which data is
given such a 'long' representation with many cases, instead of a 'wide'
representation with many variables.

At 10:25 AM 6/5/2003, Peck, Jon [of SPSS, Inc.] wrote:

>There are several points to making regarding very wide files and huge
>datasets.
>
>First, the theoretical SPSS limits are
>
>Number of variables: (2**31) -1
>Number of cases: (2**31) - 1
>
>In calculating these limits, count one for each 8 bytes or part
>thereof of a string variable.  An a10 variable counts as two
>variables, for example.
>
>Approaching the theoretical limit on the number of variables, however,
>is a very bad idea in practice for several reasons.
>
>1. These are the theoretical limits in that you absolutely cannot go
>beyond them.  But there are other environmentally imposed limits that
>you will surely hit first.  For example, Windows applications are
>absolutely limited to 2GB of addressable memory, and 1GB is a more
>practical limit.  Each dictionary entry requires about 100 bytes of
>memory, because in addition to the variable name, other variable
>properties also have to be stored.  (On non-Windows platforms, SPSS
>Server could, of course, face different environmental
>limits.)  Numerical variable values take 8 bytes as they are held as
>double precision floating point values.
>
>2. The overhead of reading and writing extremely wide cases when you
>are doubtless not using more than a small fraction of them will limit
>performance.  And you don't want to be paging the variable
>dictionary.  If you have lots of RAM, you can probably reach between
>32,000 and 100,000 variables before memory paging degrades performance
>seriously.
>
>3. Dialog boxes cannot display very large variable lists.  You can use
>variable sets to restrict the lists to the variables you are really
>using, but lists with thousands of variables will always be awkward.
>
>4. Memory usage is not just about the dictionary.  The operating
>system will almost always be paging code and data between memory and
>disk.  (You can look at paging rates via the Windows Task
>Manager).  The more you page, the slower things get, but the variable
>dictionary is only one among many objects that the operating system is
>juggling.  However, there is another effect.  On NT and later, Windows
>automatically caches files (code or data) in memory so that it can
>retrieve it quickly.  This cache occupies memory that is otherwise
>surplus, so if any application needs it, portions of the cache are
>discarded to make room.  You can see this effect quite clearly if you
>start SPSS or any other large application; then shut it down and start
>it again.  It will load much more quickly the second time, because it
>is retrieving the code modules needed at startup from memory rather
>than disk.  The Windows cache, unfortunately, will not help data
>access very much unless most of the dataset stays in memory, because
>the cache will generally hold the most recently accessed data.  If you
>are reading cases sequentially, the one you just finished with is the
>LAST one you will want again.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD