massive numbers of variables

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

massive numbers of variables

Rebecca Barber
I have a data set of around 210k records.  I want to do what is
called "fixed effects" modeling, which basically allows me to do a
regression controlling for all those things about the person that I didn't
know before.  The way I've been told to do this is to create a dummy
variable for each and every record on each and every record.  Something
like this:

identifier   person1  person2  person3  person4 ...
person1        1        0        0        0
person2        0        1        0        0
person3        0        0        1        0
person4        0        0        0        1
...

I have a dataset of around 20 people and tested creating all these dummies
using the data/restructure function - it worked fine.  (person contained a
1 for each case so that it would do the dummy variable when converted)
Syntax from my test:
CASESTOVARS
 /ID = p_ID2 person
 /INDEX = personen
 /GROUPBY = VARIABLE
 /VIND.

However I started it at about 5:15pm yesterday on my big file and as of
10:45 this morning it was still running.  The Memory usage keeps going up
slowly and the CPU usage is constant (I have a dual core system and this
is using one of them, so 50% of the overall system cpu).  The system has
plenty of temp space and memory (3G, spss is currently using just under
400M).  I created a data set that had just my identifiers in it to run
this process against, which was under 5M, so clearly this isn't a resource
issue.

Can I expect this thing to ever finish?

If not, what options do I have other than breaking this file into much
smaller pieces and trying to run this process against those?  Any ideas
how small they will need to be to complete?

Is there a better way to create all these variables than using the
restructure?

Is there some limit in SPSS that I need to know about that would prevent
me adding 210k dummy variables on to my cases?  (I have around 350 other
variables, which is nothing in comparison)
Reply | Threaded
Open this post in threaded view
|

Re: massive numbers of variables

Richard Ristow
At 12:56 PM 3/3/2007, Rebecca Barber wrote:

>I have a data set of around 210k records.  I want to do what is called
>"fixed effects" modeling, which basically allows me to do a regression
>controlling for all those things about the person that I didn't know
>before.  The way I've been told to do this is to create a dummy
>variable for each and every record on each and every
>record.  Something like this:
>
>identifier   person1  person2  person3  person4 ...
>person1        1        0        0        0
>person2        0        1        0        0
>person3        0        0        1        0
>person4        0        0        0        1
>...
>
>I have a dataset of around 20 people and tested creating all these
>dummies using the data/restructure function - it worked fine. [...]
>Syntax from my test:
>CASESTOVARS
>  /ID = p_ID2 person
>  /INDEX = personen
>  /GROUPBY = VARIABLE
>  /VIND.
>
>However I started it at about 5:15pm yesterday on my big file and as
>of 10:45 this morning it was still running.  The Memory usage keeps
>going up. [...] Clearly this isn't a resource issue.

FIRST, I think it is a resource issue. Running CASESTOVARS to create
hundreds of thousands of variables is inherently a huge user of memory,
and inherently very slow(1). Given that all you want to do is create
dummy variables, and you know how many of them you'll need, there are
much quicker ways.

SECOND, I'm not going to post a 'much quicker way' at this point,
because I think the approach is inherently wrong.

It sounds like you're going to enter all 210K dummy variables into your
model. 210K variables for 210K cases isn't just inadequate sample size;
it's underdetermined. You can fit the data exactly, without even using
any variables but the dummies.

THIRD, if instead of having one observation per person you have a
reasonable number for each (maybe 10-20), then you have a legitimate
ANOVA or ANCOVA problem with a great many cells on one of the
dimensions. Forget about modelling any interactions with that dimension
(person number), but otherwise I think it's a proper analysis.

Caveats:

- Don't take my word for the propriety of this. ANOVA experts on the
list (Marta? Stephen[Statisticsdoc]?) I think it'll take care in
interpretation, at the least. Don't rely on the p-value for the F-test;
I think it's likely to come strongly significant for an effect too
small to mean anything.

- Does anybody know how to run such a problem - ANOVA with many, many
cells on one dimension - in SPSS? The procedure she was thinking of,
with a dummy variable for every cell (except one, or suppress the
constant) is mathematically correct, but it'll be very slow at best,
with 210K+ cells in memory.

I had a problem like this, but it was a half-dozen years ago. We did it
in SYSTAT, and I don't have the code. I recall that if the data is
sorted by cell number, it can be reasonably efficient; it's not
necessary to store all the cell data at once. But I'm rusty on the
math, and don't know a way to do it in SPSS (though I'd like one).
..........

It's an interesting problem, if more subtle than you may have thought.

-Good wishes and good luck,
  Richard
.................................................
(1) CASESTOVARS, as it applies to your situation.

CASESTOVARS can't know how many variables will be in the output file,
until it reads the whole input. So it can't write *anything* out until
then. It probably reads the input and builds a table of the new
variables as it 'learns' about them, and keeps the table in memory. (I
don't know whether it keeps whole input file. Making a second pass
through the input, once it 'knows' what the output will look like,
would save a great deal of memory for small cost in time.)

In a case like yours, as the run progresses,

a. The table gets bigger and bigger as new variables are added; hence,
more memory used.

b. CASESTVARS looks up every value it reads, in that table, to see
whether the corresponding new variable is already there or must be
added. As the table gets bigger, that lookup gets slower and slower.
Possibly the lookup is written with small tables in mind, isn't fast
with very big ones. Anyhow, the corresponding lookup with AGGREGATE
(which builds a similar table) slows down dramatically when the number
of entries (BREAK groups) is very large.