I have data that is arranged in the following way:
ID Month Value 1 1 34 1 2 44 ... 1 23 44 2 1 55 2 2 43 ... 2 23 45 ... I would typically pivot the data using CASESTOVARS so that one ID is one case and the values go across as 23 variables (Value.1 Value.2 ...Value.23). My issue right now is that I have 4 million rows that need to be pivoted and even after I tried to split the file into 5 datasets it is taking very long to do so. Does anybody know whether this is a linear progress in SPSS (or whatever the technical term) in that 4 million rows take 10 times as long as 400,000 rows or is this a nonlinear relationship? If it is a non-linear relationship, where do things start to become slow? Or is there another suggestion how to pivot the data? ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
I don't know that this method will avoid the problem you are having with VtoC and I don't know that it will be faster but it is the old alternative, apropos of David's comments this morning. There's always a chance that there might be syntax problems, so it'd be good to see the syntax.
numeric Value(23). Do repeat v=value1 to value23/m=1 to 23. + if (month eq m) v=value. End repeat. * I specify 'presorted' because I assume your file is presorted. If not, take it out. Run time will go up but I * have no idea how much. Aggregate outfile=*/presorted/break=id/value1 to value23=first(value1 to value23). Execute. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Matt Kretschmer Sent: Friday, September 28, 2012 3:16 PM To: [hidden email] Subject: Pivot via CASESTOVARS I have data that is arranged in the following way: ID Month Value 1 1 34 1 2 44 ... 1 23 44 2 1 55 2 2 43 ... 2 23 45 ... I would typically pivot the data using CASESTOVARS so that one ID is one case and the values go across as 23 variables (Value.1 Value.2 ...Value.23). My issue right now is that I have 4 million rows that need to be pivoted and even after I tried to split the file into 5 datasets it is taking very long to do so. Does anybody know whether this is a linear progress in SPSS (or whatever the technical term) in that 4 million rows take 10 times as long as 400,000 rows or is this a nonlinear relationship? If it is a non-linear relationship, where do things start to become slow? Or is there another suggestion how to pivot the data? ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by Eric Black
My guess is it would be linear. What do you mean by very long?
One guess is that SPSS needs to scan the data to determine the maximum month and then construct the dictionary. Not sure if this will be faster but alternatively (old SKOOL). VECTOR value_ (23). COMPUTE value_(Month)=Value. AGGREGATE OUTFILE * / PRESORTED/ BREAK ID / Value.1 TO Value.23=MAX(Value_1 TO Value_23). Perhaps report back to let us know if it is faster or slower?
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Administrator
|
In reply to this post by Maguin, Eugene
Much better to use the VECTOR approach posted below for reasons of efficiency.
i.e. Using the month as a 'pointer' into the vector. --- **This first line will issue error ** !. numeric Value(23). ** You are doing 23 checks to place a single value **. Do repeat v=value1 to value23/m=1 to 23. + if (month eq m) v=value. End repeat.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Eric Black
Thanks so much. Let me check how that works. Right now, 1/5 of the data is
still running after a couple of hours. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Eric Black
Much faster, not even a minute.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by Eric Black
I simulated some 5 million rows and both the C2V and the V+A ran in less than 1 minute.
Something must be weird on your system. Are you reading the file over a network? -- CASESTOVARS /ID=id /INDEX=month /GROUPBY=VARIABLE. Resources Processor Time 00:00:12.54 Elapsed Time 00:00:13.38
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Eric Black
Are you perhaps
operating on a remote server?
These times were produced by the syntax below. Processing Statistics Cases In 4000000 Cases Out 160000 Cases In/Cases Out 25.0 Variables In 3 Variables Out 26 Index Values 25 {use fixed format to see clearly] 'start casesto vars' 3768952:43:13.379 'end timer' 3768952:43:18.698 This syntax new file. input program. loop id = 1 to 160000. leave id. loop month = 1 to 25. compute myvalue = rnd(rv.normal(50,10)). end case. end loop. end loop. end file. end input program. DO IF $CASENUM=1. PRINT /"'start generation'" $time (time20.3). END IF. execute. do IF $CASENUM=1. PRINT /"'start casesto vars'" $time (time20.3). END IF. CASESTOVARS /ID=id /INDEX=month /GROUPBY=VARIABLE. DO IF $CASENUM=1. PRINT /"'end timer'" $time (time20.3). END IF. execute. Art Kendall Social Research ConsultantsOn 9/28/2012 3:15 PM, Matt Kretschmer wrote: I have data that is arranged in the following way: ID Month Value 1 1 34 1 2 44 ... 1 23 44 2 1 55 2 2 43 ... 2 23 45 ... I would typically pivot the data using CASESTOVARS so that one ID is one case and the values go across as 23 variables (Value.1 Value.2 ...Value.23). My issue right now is that I have 4 million rows that need to be pivoted and even after I tried to split the file into 5 datasets it is taking very long to do so. Does anybody know whether this is a linear progress in SPSS (or whatever the technical term) in that 4 million rows take 10 times as long as 400,000 rows or is this a nonlinear relationship? If it is a non-linear relationship, where do things start to become slow? Or is there another suggestion how to pivot the data? ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
Free forum by Nabble | Edit this page |