Pivot via CASESTOVARS

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Pivot via CASESTOVARS

Eric Black
I have data that is arranged in the following way:

ID Month Value
1 1 34
1 2 44
...
1 23 44
2 1 55
2 2 43
...
2 23 45
...

I would typically pivot the data using CASESTOVARS so that one ID is one
case and the values go across as 23 variables (Value.1 Value.2 ...Value.23).
My issue right now is that I have 4 million rows that need to be pivoted and
even after I tried to split the file into 5 datasets it is taking very long
to do so.

Does anybody know whether this is a linear progress in SPSS (or whatever the
technical term) in that 4 million rows take 10 times as long as 400,000 rows
or is this a nonlinear relationship?
If it is a non-linear relationship, where do things start to become slow?
Or is there another suggestion how to pivot the data?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Pivot via CASESTOVARS

Maguin, Eugene
I don't know that this method will avoid the problem you are having with VtoC and I don't know that it will be faster but it is the old alternative, apropos of David's comments this morning. There's always a chance that there might be syntax problems, so it'd be good to see the syntax.

numeric Value(23).
Do repeat v=value1 to value23/m=1 to 23.
+  if (month eq m) v=value.
End repeat.

*   I specify 'presorted' because I assume your file is presorted. If not, take it out. Run time will go up but I
*   have no idea how much.

Aggregate outfile=*/presorted/break=id/value1 to value23=first(value1 to value23).
Execute.



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Matt Kretschmer
Sent: Friday, September 28, 2012 3:16 PM
To: [hidden email]
Subject: Pivot via CASESTOVARS

I have data that is arranged in the following way:

ID Month Value
1 1 34
1 2 44
...
1 23 44
2 1 55
2 2 43
...
2 23 45
...

I would typically pivot the data using CASESTOVARS so that one ID is one case and the values go across as 23 variables (Value.1 Value.2 ...Value.23).
My issue right now is that I have 4 million rows that need to be pivoted and even after I tried to split the file into 5 datasets it is taking very long to do so.

Does anybody know whether this is a linear progress in SPSS (or whatever the technical term) in that 4 million rows take 10 times as long as 400,000 rows or is this a nonlinear relationship?
If it is a non-linear relationship, where do things start to become slow?
Or is there another suggestion how to pivot the data?

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Pivot via CASESTOVARS

David Marso
Administrator
In reply to this post by Eric Black
My guess is it would be linear.  What do you mean by very long?
One guess is that SPSS needs to scan the data to determine the maximum month and then construct the dictionary.

Not sure if this will be faster but alternatively (old SKOOL).

VECTOR value_ (23).
COMPUTE value_(Month)=Value.
AGGREGATE OUTFILE * / PRESORTED/ BREAK ID / Value.1 TO Value.23=MAX(Value_1 TO Value_23).

Perhaps report back to let us know if it is faster or slower?

matti wrote
I have data that is arranged in the following way:

ID Month Value
1 1 34
1 2 44
...
1 23 44
2 1 55
2 2 43
...
2 23 45
...

I would typically pivot the data using CASESTOVARS so that one ID is one
case and the values go across as 23 variables (Value.1 Value.2 ...Value.23).
My issue right now is that I have 4 million rows that need to be pivoted and
even after I tried to split the file into 5 datasets it is taking very long
to do so.

Does anybody know whether this is a linear progress in SPSS (or whatever the
technical term) in that 4 million rows take 10 times as long as 400,000 rows
or is this a nonlinear relationship?
If it is a non-linear relationship, where do things start to become slow?
Or is there another suggestion how to pivot the data?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Pivot via CASESTOVARS

David Marso
Administrator
In reply to this post by Maguin, Eugene
Much better to use the VECTOR approach posted below for reasons of efficiency.
i.e. Using the month as a 'pointer' into the vector.
---
**This first line will issue error  ** !.
numeric Value(23).
** You are doing 23 checks to place a single value **.
Do repeat v=value1 to value23/m=1 to 23.
+  if (month eq m) v=value.
End repeat.

Maguin, Eugene wrote
I don't know that this method will avoid the problem you are having with VtoC and I don't know that it will be faster but it is the old alternative, apropos of David's comments this morning. There's always a chance that there might be syntax problems, so it'd be good to see the syntax.

numeric Value(23).
Do repeat v=value1 to value23/m=1 to 23.
+  if (month eq m) v=value.
End repeat.

*   I specify 'presorted' because I assume your file is presorted. If not, take it out. Run time will go up but I
*   have no idea how much.

Aggregate outfile=*/presorted/break=id/value1 to value23=first(value1 to value23).
Execute.



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Matt Kretschmer
Sent: Friday, September 28, 2012 3:16 PM
To: [hidden email]
Subject: Pivot via CASESTOVARS

I have data that is arranged in the following way:

ID Month Value
1 1 34
1 2 44
...
1 23 44
2 1 55
2 2 43
...
2 23 45
...

I would typically pivot the data using CASESTOVARS so that one ID is one case and the values go across as 23 variables (Value.1 Value.2 ...Value.23).
My issue right now is that I have 4 million rows that need to be pivoted and even after I tried to split the file into 5 datasets it is taking very long to do so.

Does anybody know whether this is a linear progress in SPSS (or whatever the technical term) in that 4 million rows take 10 times as long as 400,000 rows or is this a nonlinear relationship?
If it is a non-linear relationship, where do things start to become slow?
Or is there another suggestion how to pivot the data?

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Pivot via CASESTOVARS

Eric Black
In reply to this post by Eric Black
Thanks so much.  Let me check how that works.  Right now, 1/5 of the data is
still running after a couple of hours.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Pivot via CASESTOVARS

Eric Black
In reply to this post by Eric Black
Much faster, not even a minute.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Pivot via CASESTOVARS

David Marso
Administrator
In reply to this post by Eric Black
I simulated some 5 million rows and both the C2V and the V+A ran in less than 1 minute.
Something must be weird on your system.  Are you reading the file over a network?
--
CASESTOVARS
  /ID=id
  /INDEX=month
  /GROUPBY=VARIABLE.

Resources Processor Time 00:00:12.54
        Elapsed Time 00:00:13.38
matti wrote
Thanks so much.  Let me check how that works.  Right now, 1/5 of the data is
still running after a couple of hours.







=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Pivot via CASESTOVARS

Art Kendall
In reply to this post by Eric Black
Are you perhaps operating on a remote server?
These times were produced by the syntax below.
Processing Statistics   
Cases In    4000000
Cases Out    160000
Cases In/Cases Out    25.0
Variables In    3
Variables Out    26
Index Values    25
{use fixed format to see clearly]
'start casesto vars'   3768952:43:13.379
'end timer'            3768952:43:18.698




This syntax
new file.
input program.
   loop id = 1 to 160000.
      leave id.
      loop month = 1 to 25.
         compute myvalue = rnd(rv.normal(50,10)).
         end case.
      end loop.
   end loop.
   end file.
end input program.
DO IF $CASENUM=1.
   PRINT /"'start generation'"   $time (time20.3).
END IF.
execute.
do IF $CASENUM=1.
   PRINT /"'start casesto vars'"   $time (time20.3).
END IF.
CASESTOVARS
  /ID=id
  /INDEX=month
  /GROUPBY=VARIABLE.
DO IF $CASENUM=1.
   PRINT /"'end timer'"   $time (time20.3).
END IF.
execute.



Art Kendall
Social Research Consultants
On 9/28/2012 3:15 PM, Matt Kretschmer wrote:
I have data that is arranged in the following way:

ID Month Value
1 1 34
1 2 44
...
1 23 44
2 1 55
2 2 43
...
2 23 45
...

I would typically pivot the data using CASESTOVARS so that one ID is one
case and the values go across as 23 variables (Value.1 Value.2 ...Value.23).
My issue right now is that I have 4 million rows that need to be pivoted and
even after I tried to split the file into 5 datasets it is taking very long
to do so.

Does anybody know whether this is a linear progress in SPSS (or whatever the
technical term) in that 4 million rows take 10 times as long as 400,000 rows
or is this a nonlinear relationship?
If it is a non-linear relationship, where do things start to become slow?
Or is there another suggestion how to pivot the data?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants