SPSSX Discussion

VARSTOCASES and CASESTOVARS: speed issues

Classic

List

Threaded

6 messages Options

Richard Ristow

VARSTOCASES and CASESTOVARS: speed issues

VARSTOCASES and CASESTOVARS are extraordinarily useful commands for
handling data with more than a simple 'flat' structure. It's worth
emphasizing, then (and worth reporting to the IBM selection line, as
I'm doing, here) that recent exchanges on mailing list SPSSX-L have
demonstrated that both have serious speed issues when working on large files.

A. VARSTOCASES and propagated variables(1): Having a significant
number of propagated variables (those that are copied intact from the
source to the output) can dramatically slow a large VARSTOCASES. Andy
Wheeler wrote(2),

>when the VARSTOCASES happens, it creates a lot of redundant data for
>any variable that is a not on a /MAKE command

If that's so, it's a serious infelicity, because it should be easy to
propagate variables without making those multiple copies.

B. CASETOVARS and AUTOFIX(3): On a large dataset with a significant
number of fixed variables, AUTOFIX can slow CASESTOVARS to the point
of unusability.

Taking the default options on the two commands runs straight into
these problems. Indeed, the menu interface for CASETOVARS gives no
indication AUTOFIX is in effect, and no way to disable it.

==================================
(1) VARSTOCASES and propagated variables:
Thread "varstocases extremely slow on big datasets";
initial posting,
Date: Wed, 27 Nov 2013 11:57:22 -0500
From: Michaela Stubbers <[hidden email]>
Subject: varstocases extremely slow on big datasets
To: [hidden email]
X-ELNK-Received-Info: spv=0;
X-ELNK-AV: 0
X-ELNK-Info: sbv=0; sbrc=.0; sbf=bb; sbw=000;
(2) and resolution,
Date: Sun, 1 Dec 2013 18:19:37 -0800
From: Andy W <[hidden email]>
Subject: Re: varstocases extremely slow on big datasets
To: [hidden email]

(3) CASESTOVARS and AUTOFIX:
Thread "Cases to vars with a very large dataset";
initial posting,
Date: Wed, 2 Apr 2014 08:10:43 -0700
From: bwyker <[hidden email]>
Subject: Cases to vars with a very large dataset
To: [hidden email]
and resolution,
Date: Thu, 3 Apr 2014 16:30:04 -0400
From: Richard Ristow <[hidden email]>
Subject: Re: Cases to vars with a very large dataset
To: [hidden email]
and
Date: Mon, 7 Apr 2014 12:03:09 -0700
From: bwyker <[hidden email]>
Subject: Re: Cases to vars with a very large dataset
To: [hidden email]

>Yes, the AUTOFIX appears to be the hiccup that was keeping the
>restructure from working.
X-ELNK-Received-Info: spv=0;
X-ELNK-AV: 0
X-ELNK-Info: sbv=0; sbrc=.0; sbf=bb; sbw=000;

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Bruce Weaver

Re: VARSTOCASES and CASESTOVARS: speed issues

Administrator

To make life a bit easier for those who explore these archives later on (who knows, it could be me!), I'm inserting links to the posts that Richard cites. See below in the quoted material.

Richard Ristow wrote

VARSTOCASES and CASESTOVARS are extraordinarily useful commands for
handling data with more than a simple 'flat' structure. It's worth
emphasizing, then (and worth reporting to the IBM selection line, as
I'm doing, here) that recent exchanges on mailing list SPSSX-L have
demonstrated that both have serious speed issues when working on large files.

A. VARSTOCASES and propagated variables(1): Having a significant
number of propagated variables (those that are copied intact from the
source to the output) can dramatically slow a large VARSTOCASES. Andy
Wheeler wrote(2),

>when the VARSTOCASES happens, it creates a lot of redundant data for
>any variable that is a not on a /MAKE command

If that's so, it's a serious infelicity, because it should be easy to
propagate variables without making those multiple copies.

B. CASETOVARS and AUTOFIX(3): On a large dataset with a significant
number of fixed variables, AUTOFIX can slow CASESTOVARS to the point
of unusability.

Taking the default options on the two commands runs straight into
these problems. Indeed, the menu interface for CASETOVARS gives no
indication AUTOFIX is in effect, and no way to disable it.

==================================
(1) VARSTOCASES and propagated variables:
Thread "varstocases extremely slow on big datasets";
initial posting,
Date: Wed, 27 Nov 2013 11:57:22 -0500
From: Michaela Stubbers <[hidden email]>
Subject: varstocases extremely slow on big datasets

LINK: http://spssx-discussion.1045642.n5.nabble.com/varstocases-extremely-slow-on-big-datasets-tp5723346.html

(2) and resolution,
Date: Sun, 1 Dec 2013 18:19:37 -0800
From: Andy W <[hidden email]>
Subject: Re: varstocases extremely slow on big datasets

LINK: http://spssx-discussion.1045642.n5.nabble.com/varstocases-extremely-slow-on-big-datasets-tp5723346p5723380.html

(3) CASESTOVARS and AUTOFIX:
Thread "Cases to vars with a very large dataset";
initial posting,
Date: Wed, 2 Apr 2014 08:10:43 -0700
From: bwyker <[hidden email]>
Subject: Cases to vars with a very large dataset

LINK: http://spssx-discussion.1045642.n5.nabble.com/Cases-to-vars-with-a-very-large-dataset-tp5725171.html

and resolution,
Date: Thu, 3 Apr 2014 16:30:04 -0400
From: Richard Ristow <[hidden email]>
Subject: Re: Cases to vars with a very large dataset

LINK: http://spssx-discussion.1045642.n5.nabble.com/Cases-to-vars-with-a-very-large-dataset-tp5725171p5725255.html

and
Date: Mon, 7 Apr 2014 12:03:09 -0700
From: bwyker <[hidden email]>
Subject: Re: Cases to vars with a very large dataset

LINK: http://spssx-discussion.1045642.n5.nabble.com/Cases-to-vars-with-a-very-large-dataset-tp5725171p5725349.html

>Yes, the AUTOFIX appears to be the hiccup that was keeping the
>restructure from working.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

Andy W

Re: VARSTOCASES and CASESTOVARS: speed issues

In reply to this post by Richard Ristow

For the VARSTOCASES I wasn't speaking as to the underlying mechanism that the data is actually reshaped, I was speaking to what happens in the data before and after.

So if I have a one line dataset:

Id Var1 Var2 Var3

By near necessity when you reshape Var1 to Var3 into one column you will duplicate the Id variable

Id Var1
Id Var2
Id Var3

What was previously held in 4 cells is now held in 6, and the Id variable is replicated 3 times. I say near because you don't necessarily need to keep the Id variable, but most applications in which you want to further process the data you will probably need at least one key field to match to another table further down the line.

The point I was making with my initial statement was that my guess as to what was happening was that the individual had something like:

Id Var1 Var2 Var3 Junk1 ...... Junk100

So when the reshape happens, you then have:

Id Var1 Junk1 ...... Junk100
Id Var2 Junk1 ...... Junk100
Id Var3 Junk1 ...... Junk100

SPSS needs to write that redundant data to disk (no way around that), so if you have alot of junk and a big dataset to begin with then yes it is going to be an expensive operation. Clearly in such a situation you should just keep the one Id variable for the VARSTOCASES and drop the other junk beforehand. Then later on merge in Junk1 ...... Junk100 via MATCH FILES if you really need them. (This is basically just a "normalize your database" properly argument under the guise of dealing with separate SPSS files.)

In short, I don't see this as a problem with functionality within SPSS, just some individuals with large data are having some growing pains in learning how to deal with it.

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

Jon K Peck

Re: VARSTOCASES and CASESTOVARS: speed issues

Remember also that although CtoV and VtoC are very useful, they are wrenchingly disruptive of the dataset and require a lot of work to carry out, since the dataset cannot be assumed to fit into memory. That's why I avoid using them unless absolutely necessary.

P.S., Richard,
Posting comments on the X list is not a guaranteed way to get issues to SPSS. The official channels should be used whenever possible - either Technical Support or the suggest email ([hidden email]), which appears on the SPSS Community front page,

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: Andy W <[hidden email]>
To: [hidden email],
Date: 04/10/2014 12:16 PM
Subject: Re: [SPSSX-L] VARSTOCASES and CASESTOVARS: speed issues
Sent by: "SPSSX(r) Discussion" <[hidden email]>

For the VARSTOCASES I wasn't speaking as to the underlying mechanism that the data is actually reshaped, I was speaking to what happens in the data before and after. So if I have a one line dataset: Id Var1 Var2 Var3 By near necessity when you reshape Var1 to Var3 into one column you will duplicate the Id variable Id Var1 Id Var2 Id Var3 What was previously held in 4 cells is now held in 6, and the Id variable is replicated 3 times. I say near because you don't necessarily need to keep the Id variable, but most applications in which you want to further process the data you will probably need at least one key field to match to another table further down the line. The point I was making with my initial statement was that my guess as to what was happening was that the individual had something like: Id Var1 Var2 Var3 Junk1 ...... Junk100 So when the reshape happens, you then have: Id Var1 Junk1 ...... Junk100 Id Var2 Junk1 ...... Junk100 Id Var3 Junk1 ...... Junk100 SPSS needs to write that redundant data to disk (no way around that), so if you have alot of junk and a big dataset to begin with then yes it is going to be an expensive operation. Clearly in such a situation you should just keep the one Id variable for the VARSTOCASES and drop the other junk beforehand. Then later on merge in Junk1 ...... Junk100 via MATCH FILES if you really need them. (This is basically just a "normalize your database" properly argument under the guise of dealing with separate SPSS files.) In short, I don't see this as a problem with functionality within SPSS, just some individuals with large data are having some growing pains in learning how to deal with it. ----- Andy W [hidden email]http://andrewpwheeler.wordpress.com/-- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/VARSTOCASES-and-CASESTOVARS-speed-issues-tp5725387p5725389.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

David Marso

Re: VARSTOCASES and CASESTOVARS: speed issues

Administrator

In reply to this post by Andy W

"Clearly in such a situation you should just keep the one Id variable for the VARSTOCASES and drop the other junk beforehand. Then later on merge in Junk1 ...... Junk100 via MATCH FILES if you really need them. (This is basically just a "normalize your database" properly argument under the guise of dealing with separate SPSS files.)".

Great minds think alike ;-)
I don't see why so many people have so much difficulty comprehending this simple fact of data organization. Data is like clay! Shape it to fit your immediate requirements, store it in as many pieces as necessary.

Andy W wrote

For the VARSTOCASES I wasn't speaking as to the underlying mechanism that the data is actually reshaped, I was speaking to what happens in the data before and after.

So if I have a one line dataset:

Id Var1 Var2 Var3

By near necessity when you reshape Var1 to Var3 into one column you will duplicate the Id variable

Id Var1
Id Var2
Id Var3

What was previously held in 4 cells is now held in 6, and the Id variable is replicated 3 times. I say near because you don't necessarily need to keep the Id variable, but most applications in which you want to further process the data you will probably need at least one key field to match to another table further down the line.

The point I was making with my initial statement was that my guess as to what was happening was that the individual had something like:

Id Var1 Var2 Var3 Junk1 ...... Junk100

So when the reshape happens, you then have:

Id Var1 Junk1 ...... Junk100
Id Var2 Junk1 ...... Junk100
Id Var3 Junk1 ...... Junk100

SPSS needs to write that redundant data to disk (no way around that), so if you have alot of junk and a big dataset to begin with then yes it is going to be an expensive operation. Clearly in such a situation you should just keep the one Id variable for the VARSTOCASES and drop the other junk beforehand. Then later on merge in Junk1 ...... Junk100 via MATCH FILES if you really need them. (This is basically just a "normalize your database" properly argument under the guise of dealing with separate SPSS files.)

In short, I don't see this as a problem with functionality within SPSS, just some individuals with large data are having some growing pains in learning how to deal with it.

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

David Marso

Re: VARSTOCASES and CASESTOVARS: speed issues

Administrator

In reply to this post by Jon K Peck

That may be but there aren't good alternatives.
Besides, the whole point of the exercise is to "wrenchingly disrupt the dataset".
V2C:XSAVE in a LOOP or RESHAPE in MATRIX.
C2V:VECTOR after constructing an INDEX then AGGREGATE.
In my experience CtoV and VtoC are fairly efficient.
I ran a test earlier and V2C on 1,000,000 cases with 5 variables to 'flip' and 5 'fixed' ran very quickly (less than a second).

Jon K Peck wrote

Remember also that although CtoV and VtoC are very useful, they are
wrenchingly disruptive of the dataset and require a lot of work to carry
out, since the dataset cannot be assumed to fit into memory. That's why I
avoid using them unless absolutely necessary.

P.S., Richard,
Posting comments on the X list is not a guaranteed way to get issues to
SPSS. The official channels should be used whenever possible - either
Technical Support or the suggest email ([hidden email]), which appears
on the SPSS Community front page,

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: Andy W <[hidden email]>
To: [hidden email],
Date: 04/10/2014 12:16 PM
Subject: Re: [SPSSX-L] VARSTOCASES and CASESTOVARS: speed issues
Sent by: "SPSSX(r) Discussion" <[hidden email]>

For the VARSTOCASES I wasn't speaking as to the underlying mechanism that
the
data is actually reshaped, I was speaking to what happens in the data
before
and after.

So if I have a one line dataset:

Id Var1 Var2 Var3

By near necessity when you reshape Var1 to Var3 into one column you will
duplicate the Id variable

Id Var1
Id Var2
Id Var3

What was previously held in 4 cells is now held in 6, and the Id variable
is
replicated 3 times. I say near because you don't necessarily need to keep
the Id variable, but most applications in which you want to further
process
the data you will probably need at least one key field to match to another
table further down the line.

The point I was making with my initial statement was that my guess as to
what was happening was that the individual had something like:

Id Var1 Var2 Var3 Junk1 ...... Junk100

So when the reshape happens, you then have:

Id Var1 Junk1 ...... Junk100
Id Var2 Junk1 ...... Junk100
Id Var3 Junk1 ...... Junk100

SPSS needs to write that redundant data to disk (no way around that), so
if
you have alot of junk and a big dataset to begin with then yes it is going
to be an expensive operation. Clearly in such a situation you should just
keep the one Id variable for the VARSTOCASES and drop the other junk
beforehand. Then later on merge in Junk1 ...... Junk100 via MATCH FILES if
you really need them. (This is basically just a "normalize your database"
properly argument under the guise of dealing with separate SPSS files.)

In short, I don't see this as a problem with functionality within SPSS,
just
some individuals with large data are having some growing pains in learning
how to deal with it.

-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/VARSTOCASES-and-CASESTOVARS-speed-issues-tp5725387p5725389.html

Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD