I have a data set of around 210k records. I want to do what is
called "fixed effects" modeling, which basically allows me to do a regression controlling for all those things about the person that I didn't know before. The way I've been told to do this is to create a dummy variable for each and every record on each and every record. Something like this: identifier person1 person2 person3 person4 ... person1 1 0 0 0 person2 0 1 0 0 person3 0 0 1 0 person4 0 0 0 1 ... I have a dataset of around 20 people and tested creating all these dummies using the data/restructure function - it worked fine. (person contained a 1 for each case so that it would do the dummy variable when converted) Syntax from my test: CASESTOVARS /ID = p_ID2 person /INDEX = personen /GROUPBY = VARIABLE /VIND. However I started it at about 5:15pm yesterday on my big file and as of 10:45 this morning it was still running. The Memory usage keeps going up slowly and the CPU usage is constant (I have a dual core system and this is using one of them, so 50% of the overall system cpu). The system has plenty of temp space and memory (3G, spss is currently using just under 400M). I created a data set that had just my identifiers in it to run this process against, which was under 5M, so clearly this isn't a resource issue. Can I expect this thing to ever finish? If not, what options do I have other than breaking this file into much smaller pieces and trying to run this process against those? Any ideas how small they will need to be to complete? Is there a better way to create all these variables than using the restructure? Is there some limit in SPSS that I need to know about that would prevent me adding 210k dummy variables on to my cases? (I have around 350 other variables, which is nothing in comparison) |
At 12:56 PM 3/3/2007, Rebecca Barber wrote:
>I have a data set of around 210k records. I want to do what is called >"fixed effects" modeling, which basically allows me to do a regression >controlling for all those things about the person that I didn't know >before. The way I've been told to do this is to create a dummy >variable for each and every record on each and every >record. Something like this: > >identifier person1 person2 person3 person4 ... >person1 1 0 0 0 >person2 0 1 0 0 >person3 0 0 1 0 >person4 0 0 0 1 >... > >I have a dataset of around 20 people and tested creating all these >dummies using the data/restructure function - it worked fine. [...] >Syntax from my test: >CASESTOVARS > /ID = p_ID2 person > /INDEX = personen > /GROUPBY = VARIABLE > /VIND. > >However I started it at about 5:15pm yesterday on my big file and as >of 10:45 this morning it was still running. The Memory usage keeps >going up. [...] Clearly this isn't a resource issue. FIRST, I think it is a resource issue. Running CASESTOVARS to create hundreds of thousands of variables is inherently a huge user of memory, and inherently very slow(1). Given that all you want to do is create dummy variables, and you know how many of them you'll need, there are much quicker ways. SECOND, I'm not going to post a 'much quicker way' at this point, because I think the approach is inherently wrong. It sounds like you're going to enter all 210K dummy variables into your model. 210K variables for 210K cases isn't just inadequate sample size; it's underdetermined. You can fit the data exactly, without even using any variables but the dummies. THIRD, if instead of having one observation per person you have a reasonable number for each (maybe 10-20), then you have a legitimate ANOVA or ANCOVA problem with a great many cells on one of the dimensions. Forget about modelling any interactions with that dimension (person number), but otherwise I think it's a proper analysis. Caveats: - Don't take my word for the propriety of this. ANOVA experts on the list (Marta? Stephen[Statisticsdoc]?) I think it'll take care in interpretation, at the least. Don't rely on the p-value for the F-test; I think it's likely to come strongly significant for an effect too small to mean anything. - Does anybody know how to run such a problem - ANOVA with many, many cells on one dimension - in SPSS? The procedure she was thinking of, with a dummy variable for every cell (except one, or suppress the constant) is mathematically correct, but it'll be very slow at best, with 210K+ cells in memory. I had a problem like this, but it was a half-dozen years ago. We did it in SYSTAT, and I don't have the code. I recall that if the data is sorted by cell number, it can be reasonably efficient; it's not necessary to store all the cell data at once. But I'm rusty on the math, and don't know a way to do it in SPSS (though I'd like one). .......... It's an interesting problem, if more subtle than you may have thought. -Good wishes and good luck, Richard ................................................. (1) CASESTOVARS, as it applies to your situation. CASESTOVARS can't know how many variables will be in the output file, until it reads the whole input. So it can't write *anything* out until then. It probably reads the input and builds a table of the new variables as it 'learns' about them, and keeps the table in memory. (I don't know whether it keeps whole input file. Making a second pass through the input, once it 'knows' what the output will look like, would save a great deal of memory for small cost in time.) In a case like yours, as the run progresses, a. The table gets bigger and bigger as new variables are added; hence, more memory used. b. CASESTVARS looks up every value it reads, in that table, to see whether the corresponding new variable is already there or must be added. As the table gets bigger, that lookup gets slower and slower. Possibly the lookup is written with small tables in mind, isn't fast with very big ones. Anyhow, the corresponding lookup with AGGREGATE (which builds a similar table) slows down dramatically when the number of entries (BREAK groups) is very large. |
Free forum by Nabble | Edit this page |