At 04:01 PM 4/3/2014, bwyker wrote:
>The syntax is pretty simple. Just a restructure, case to vars: > >CASESTOVARS > /ID=ID > /GROUPBY=VARIABLE. > >>Useful info here would be to know the actual number of unique ID's, the max >>number of Rx values and dates. Is the file PRESORTED by ID and Date? > >I understand there are going to be A LOT of variables. At most, >probably somewhere around 100 - 200 IDs. Most of the 45 variables >are actually the same for all IDs... This is a shot in the dark, but the next thing I'd try is CASESTOVARS /ID=ID /GROUPBY = VARIABLE /FIXED = <list of variables you know don't vary> /AUTOFIX = NO. AUTOFIX requires checking each variable in the file to see whether it varies or not, and I can see it taking a lot of time in a large file. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by bwyker
Well, it seems that anything you do will take a significant amount of
time. However, if you chop up the file into different files based on number of ids, you can run several analyses concurrently if you have the additional PCs to use. Depending upon what resources are available to you, this type of distributed processing could save a significant amount of time. -Mike Palij New York University [hidden email] ----- Original Message ----- From: "bwyker" <[hidden email]> To: <[hidden email]> Sent: Thursday, April 03, 2014 4:04 PM Subject: Re: Cases to vars with a very large dataset > Thanks Mike. I have thought of this. The trouble is, anything I do > with this > data take a long time, including cutting it up into chunks. I may be > forced > to do this, but I'm hoping there is a way to do it without having to > spend > the time slicing and dicing datasets. > > > > -- > View this message in context: > http://spssx-discussion.1045642.n5.nabble.com/Cases-to-vars-with-a-very-large-dataset-tp5725171p5725248.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except > the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by David Marso
Thanks for the assistance. I took a sample of a few thousand cases and the maximum number of rows per ID is 103. So, using the previous example, the variable list for this sample should look like this after the transpose:
ID RXDate1 RXDate2 ... RXDate103 RXCode1 RXCode2 ... RXCode103 So, it sounds like what you and others are suggesting makes sense. If I must do this transpose, it's best to do it in pieces and then stitch it back together. I think I'll take out the static variables and then see what's left. Thanks again for the help. |
In reply to this post by Richard Ristow
Ahh...I did not know about the Autofix. This is helpful, thanks!
|
At 01:47 PM 4/4/2014, bwyker wrote:
>Ahh...I did not know about the Autofix. This is helpful, thanks! You'd written, >I'm struggling restructure an enormous file in SPSS. The main file >is a merge of data from 14 states, totaling over 411 million rows of >data (45 variables) in a file that is 124 GB in size. I also have >these data separated by state and year (2007 and 2008), which brings >the file down to about 12 GB. I'll be very interested in how it goes, if you try one of the 12GB files with AUTOFIX = NO. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Richard,
Yes, the AUTOFIX appears to be the hiccup that was keeping the restructure from working. I ran the CASESTOVARS command with the fixed variables defined and I had a transposed dataset in about an hour. I can't thank you enough - huge help! I did end up with a set of 840 variables for each transposed (e.g., var1.1 var1.2...var1.804). This sounds like a data cleaning issue, but, overall, I think I'm in good shape. Thanks again, Brett |
At 03:03 PM 4/7/2014, bwyker wrote:
>Yes, the AUTOFIX appears to be the hiccup that was keeping the >restructure from working. I ran the CASESTOVARS command with the >fixed variables defined and I had a transposed dataset in about an >hour. I can't thank you enough - huge help! Great! I'm really glad. >I did end up with a set of 840 variables for each transposed (e.g., >var1.1 var1.2...var1.804). This sounds like a data cleaning issue, >but, overall, I think I'm in good shape. This is a place where it's important to look at the underlying meaning of your input variables. From your original posting, it seems clear that you have the ID; multiple pairs of RXDate and RXCode for each ID; and 40-some other variables. Your variables should fall into two categories, and you have a data-structure problem if they don't: . Those that are constant for any value of ID (which should be named on the FIXED= list for CASESTOVARS) . Those that pertain to the prescriptions. It should be clear from each variable's definition which category it falls into; if it isn't, or if there are any variables that aren't in one of those categories, there's a problem with understanding the meaning of your data, not with SPSS coding. And if any variables that should be FIXED within ID turn out not to be so, you're right, it's a data-cleaning issue. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
"And if any variables that should be FIXED within ID turn out not to
be so, you're right, it's a data-cleaning issue. " Which will be horrible without a few shortcuts/techniques. For variables which should be fixed within ID consider AGGREGATE breaking on ID and getting MIN and MAX for each of them. If these are not equal within a variable then you have a data issue. DATASET DECLARE agg. AGGREGATE OUTFILE agg / BREAK ID / <fixed vars min list> = MIN(fixed var list) / <fixed vars max list> = MAX(fixed var list) / COUNT=N. DATASE ACTIVATE agg. DO REPEAT min=<fixed vars min list>/max=<fixed vars max list>/var=<list of quoted fixed varnames>. DO IF (min NE max). PRINT / ID var min max. END IF. EXECUTE. /* or other data passing command */
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Free forum by Nabble | Edit this page |