SPSSX Discussion

Pivot via CASESTOVARS

Classic

List

Threaded

8 messages Options

Eric Black

Pivot via CASESTOVARS

I have data that is arranged in the following way:

ID Month Value
1 1 34
1 2 44
...
1 23 44
2 1 55
2 2 43
...
2 23 45
...

I would typically pivot the data using CASESTOVARS so that one ID is one
case and the values go across as 23 variables (Value.1 Value.2 ...Value.23).
My issue right now is that I have 4 million rows that need to be pivoted and
even after I tried to split the file into 5 datasets it is taking very long
to do so.

Does anybody know whether this is a linear progress in SPSS (or whatever the
technical term) in that 4 million rows take 10 times as long as 400,000 rows
or is this a nonlinear relationship?
If it is a non-linear relationship, where do things start to become slow?
Or is there another suggestion how to pivot the data?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Maguin, Eugene

Re: Pivot via CASESTOVARS

I don't know that this method will avoid the problem you are having with VtoC and I don't know that it will be faster but it is the old alternative, apropos of David's comments this morning. There's always a chance that there might be syntax problems, so it'd be good to see the syntax.

numeric Value(23).
Do repeat v=value1 to value23/m=1 to 23.
+ if (month eq m) v=value.
End repeat.

* I specify 'presorted' because I assume your file is presorted. If not, take it out. Run time will go up but I
* have no idea how much.

Aggregate outfile=*/presorted/break=id/value1 to value23=first(value1 to value23).
Execute.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Matt Kretschmer
Sent: Friday, September 28, 2012 3:16 PM
To: [hidden email]
Subject: Pivot via CASESTOVARS

I have data that is arranged in the following way:

ID Month Value
1 1 34
1 2 44
...
1 23 44
2 1 55
2 2 43
...
2 23 45
...

I would typically pivot the data using CASESTOVARS so that one ID is one case and the values go across as 23 variables (Value.1 Value.2 ...Value.23).
My issue right now is that I have 4 million rows that need to be pivoted and even after I tried to split the file into 5 datasets it is taking very long to do so.

Does anybody know whether this is a linear progress in SPSS (or whatever the technical term) in that 4 million rows take 10 times as long as 400,000 rows or is this a nonlinear relationship?
If it is a non-linear relationship, where do things start to become slow?
Or is there another suggestion how to pivot the data?

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

David Marso

Re: Pivot via CASESTOVARS

Administrator

In reply to this post by Eric Black

My guess is it would be linear. What do you mean by very long?
One guess is that SPSS needs to scan the data to determine the maximum month and then construct the dictionary.

Not sure if this will be faster but alternatively (old SKOOL).

VECTOR value_ (23).
COMPUTE value_(Month)=Value.
AGGREGATE OUTFILE * / PRESORTED/ BREAK ID / Value.1 TO Value.23=MAX(Value_1 TO Value_23).

Perhaps report back to let us know if it is faster or slower?

matti wrote

I have data that is arranged in the following way:

ID Month Value
1 1 34
1 2 44
...
1 23 44
2 1 55
2 2 43
...
2 23 45
...

I would typically pivot the data using CASESTOVARS so that one ID is one
case and the values go across as 23 variables (Value.1 Value.2 ...Value.23).
My issue right now is that I have 4 million rows that need to be pivoted and
even after I tried to split the file into 5 datasets it is taking very long
to do so.

Does anybody know whether this is a linear progress in SPSS (or whatever the
technical term) in that 4 million rows take 10 times as long as 400,000 rows
or is this a nonlinear relationship?
If it is a non-linear relationship, where do things start to become slow?
Or is there another suggestion how to pivot the data?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

David Marso

Re: Pivot via CASESTOVARS

Administrator

In reply to this post by Maguin, Eugene

Much better to use the VECTOR approach posted below for reasons of efficiency.
i.e. Using the month as a 'pointer' into the vector.
---
**This first line will issue error ** !.
numeric Value(23).
** You are doing 23 checks to place a single value **.
Do repeat v=value1 to value23/m=1 to 23.
+ if (month eq m) v=value.
End repeat.

Maguin, Eugene wrote

I don't know that this method will avoid the problem you are having with VtoC and I don't know that it will be faster but it is the old alternative, apropos of David's comments this morning. There's always a chance that there might be syntax problems, so it'd be good to see the syntax.

numeric Value(23).
Do repeat v=value1 to value23/m=1 to 23.
+ if (month eq m) v=value.
End repeat.

* I specify 'presorted' because I assume your file is presorted. If not, take it out. Run time will go up but I
* have no idea how much.

Aggregate outfile=*/presorted/break=id/value1 to value23=first(value1 to value23).
Execute.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Matt Kretschmer
Sent: Friday, September 28, 2012 3:16 PM
To: [hidden email]
Subject: Pivot via CASESTOVARS

I have data that is arranged in the following way:

ID Month Value
1 1 34
1 2 44
...
1 23 44
2 1 55
2 2 43
...
2 23 45
...

I would typically pivot the data using CASESTOVARS so that one ID is one case and the values go across as 23 variables (Value.1 Value.2 ...Value.23).
My issue right now is that I have 4 million rows that need to be pivoted and even after I tried to split the file into 5 datasets it is taking very long to do so.

Does anybody know whether this is a linear progress in SPSS (or whatever the technical term) in that 4 million rows take 10 times as long as 400,000 rows or is this a nonlinear relationship?
If it is a non-linear relationship, where do things start to become slow?
Or is there another suggestion how to pivot the data?

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Eric Black

Re: Pivot via CASESTOVARS

In reply to this post by Eric Black

Thanks so much. Let me check how that works. Right now, 1/5 of the data is
still running after a couple of hours.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Eric Black

Re: Pivot via CASESTOVARS

In reply to this post by Eric Black

Much faster, not even a minute.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

David Marso

Re: Pivot via CASESTOVARS

Administrator

In reply to this post by Eric Black

I simulated some 5 million rows and both the C2V and the V+A ran in less than 1 minute.
Something must be weird on your system. Are you reading the file over a network?
--
CASESTOVARS
/ID=id
/INDEX=month
/GROUPBY=VARIABLE.

Resources Processor Time 00:00:12.54
Elapsed Time 00:00:13.38

matti wrote

Thanks so much. Let me check how that works. Right now, 1/5 of the data is
still running after a couple of hours.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Art Kendall

Re: Pivot via CASESTOVARS

In reply to this post by Eric Black

Are you perhaps operating on a remote server?
These times were produced by the syntax below.
Processing Statistics
Cases In    4000000
Cases Out    160000
Cases In/Cases Out    25.0
Variables In    3
Variables Out    26
Index Values    25
{use fixed format to see clearly]
'start casesto vars'   3768952:43:13.379'end timer'    3768952:43:18.698

This syntax
new file.
input program.
   loop id = 1 to 160000.
      leave id.
      loop month = 1 to 25.
         compute myvalue = rnd(rv.normal(50,10)).
         end case.
      end loop.
   end loop.
   end file.
end input program.
DO IF $CASENUM=1.
   PRINT /"'start generation'"   $time (time20.3).
END IF.
execute.
do IF $CASENUM=1.
   PRINT /"'start casesto vars'"   $time (time20.3).
END IF.
CASESTOVARS
/ID=id
/INDEX=month
/GROUPBY=VARIABLE.
DO IF $CASENUM=1.
   PRINT /"'end timer'"   $time (time20.3).
END IF.
execute.

Art Kendall
Social Research Consultants

On 9/28/2012 3:15 PM, Matt Kretschmer wrote:

I have data that is arranged in the following way:

ID Month Value
1 1 34
1 2 44
...
1 23 44
2 1 55
2 2 43
...
2 23 45
...

I would typically pivot the data using CASESTOVARS so that one ID is one
case and the values go across as 23 variables (Value.1 Value.2 ...Value.23).
My issue right now is that I have 4 million rows that need to be pivoted and
even after I tried to split the file into 5 datasets it is taking very long
to do so.

Does anybody know whether this is a linear progress in SPSS (or whatever the
technical term) in that 4 million rows take 10 times as long as 400,000 rows
or is this a nonlinear relationship?
If it is a non-linear relationship, where do things start to become slow?
Or is there another suggestion how to pivot the data?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Art Kendall
Social Research Consultants