VARSTOCASES and CASESTOVARS are extraordinarily useful commands for
handling data with more than a simple 'flat' structure. It's worth emphasizing, then (and worth reporting to the IBM selection line, as I'm doing, here) that recent exchanges on mailing list SPSSX-L have demonstrated that both have serious speed issues when working on large files. A. VARSTOCASES and propagated variables(1): Having a significant number of propagated variables (those that are copied intact from the source to the output) can dramatically slow a large VARSTOCASES. Andy Wheeler wrote(2), >when the VARSTOCASES happens, it creates a lot of redundant data for >any variable that is a not on a /MAKE command If that's so, it's a serious infelicity, because it should be easy to propagate variables without making those multiple copies. B. CASETOVARS and AUTOFIX(3): On a large dataset with a significant number of fixed variables, AUTOFIX can slow CASESTOVARS to the point of unusability. Taking the default options on the two commands runs straight into these problems. Indeed, the menu interface for CASETOVARS gives no indication AUTOFIX is in effect, and no way to disable it. ================================== (1) VARSTOCASES and propagated variables: Thread "varstocases extremely slow on big datasets"; initial posting, Date: Wed, 27 Nov 2013 11:57:22 -0500 From: Michaela Stubbers <[hidden email]> Subject: varstocases extremely slow on big datasets To: [hidden email] X-ELNK-Received-Info: spv=0; X-ELNK-AV: 0 X-ELNK-Info: sbv=0; sbrc=.0; sbf=bb; sbw=000; (2) and resolution, Date: Sun, 1 Dec 2013 18:19:37 -0800 From: Andy W <[hidden email]> Subject: Re: varstocases extremely slow on big datasets To: [hidden email] (3) CASESTOVARS and AUTOFIX: Thread "Cases to vars with a very large dataset"; initial posting, Date: Wed, 2 Apr 2014 08:10:43 -0700 From: bwyker <[hidden email]> Subject: Cases to vars with a very large dataset To: [hidden email] and resolution, Date: Thu, 3 Apr 2014 16:30:04 -0400 From: Richard Ristow <[hidden email]> Subject: Re: Cases to vars with a very large dataset To: [hidden email] and Date: Mon, 7 Apr 2014 12:03:09 -0700 From: bwyker <[hidden email]> Subject: Re: Cases to vars with a very large dataset To: [hidden email] >Yes, the AUTOFIX appears to be the hiccup that was keeping the >restructure from working. X-ELNK-Received-Info: spv=0; X-ELNK-AV: 0 X-ELNK-Info: sbv=0; sbrc=.0; sbf=bb; sbw=000; ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
To make life a bit easier for those who explore these archives later on (who knows, it could be me!), I'm inserting links to the posts that Richard cites. See below in the quoted material.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
In reply to this post by Richard Ristow
For the VARSTOCASES I wasn't speaking as to the underlying mechanism that the data is actually reshaped, I was speaking to what happens in the data before and after.
So if I have a one line dataset: Id Var1 Var2 Var3 By near necessity when you reshape Var1 to Var3 into one column you will duplicate the Id variable Id Var1 Id Var2 Id Var3 What was previously held in 4 cells is now held in 6, and the Id variable is replicated 3 times. I say near because you don't necessarily need to keep the Id variable, but most applications in which you want to further process the data you will probably need at least one key field to match to another table further down the line. The point I was making with my initial statement was that my guess as to what was happening was that the individual had something like: Id Var1 Var2 Var3 Junk1 ...... Junk100 So when the reshape happens, you then have: Id Var1 Junk1 ...... Junk100 Id Var2 Junk1 ...... Junk100 Id Var3 Junk1 ...... Junk100 SPSS needs to write that redundant data to disk (no way around that), so if you have alot of junk and a big dataset to begin with then yes it is going to be an expensive operation. Clearly in such a situation you should just keep the one Id variable for the VARSTOCASES and drop the other junk beforehand. Then later on merge in Junk1 ...... Junk100 via MATCH FILES if you really need them. (This is basically just a "normalize your database" properly argument under the guise of dealing with separate SPSS files.) In short, I don't see this as a problem with functionality within SPSS, just some individuals with large data are having some growing pains in learning how to deal with it. |
Remember also that although CtoV and VtoC
are very useful, they are wrenchingly disruptive of the dataset and require
a lot of work to carry out, since the dataset cannot be assumed to fit
into memory. That's why I avoid using them unless absolutely necessary.
P.S., Richard, Posting comments on the X list is not a guaranteed way to get issues to SPSS. The official channels should be used whenever possible - either Technical Support or the suggest email ([hidden email]), which appears on the SPSS Community front page, Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Andy W <[hidden email]> To: [hidden email], Date: 04/10/2014 12:16 PM Subject: Re: [SPSSX-L] VARSTOCASES and CASESTOVARS: speed issues Sent by: "SPSSX(r) Discussion" <[hidden email]> For the VARSTOCASES I wasn't speaking as to the underlying mechanism that the data is actually reshaped, I was speaking to what happens in the data before and after. So if I have a one line dataset: Id Var1 Var2 Var3 By near necessity when you reshape Var1 to Var3 into one column you will duplicate the Id variable Id Var1 Id Var2 Id Var3 What was previously held in 4 cells is now held in 6, and the Id variable is replicated 3 times. I say near because you don't necessarily need to keep the Id variable, but most applications in which you want to further process the data you will probably need at least one key field to match to another table further down the line. The point I was making with my initial statement was that my guess as to what was happening was that the individual had something like: Id Var1 Var2 Var3 Junk1 ...... Junk100 So when the reshape happens, you then have: Id Var1 Junk1 ...... Junk100 Id Var2 Junk1 ...... Junk100 Id Var3 Junk1 ...... Junk100 SPSS needs to write that redundant data to disk (no way around that), so if you have alot of junk and a big dataset to begin with then yes it is going to be an expensive operation. Clearly in such a situation you should just keep the one Id variable for the VARSTOCASES and drop the other junk beforehand. Then later on merge in Junk1 ...... Junk100 via MATCH FILES if you really need them. (This is basically just a "normalize your database" properly argument under the guise of dealing with separate SPSS files.) In short, I don't see this as a problem with functionality within SPSS, just some individuals with large data are having some growing pains in learning how to deal with it. ----- Andy W [hidden email] http://andrewpwheeler.wordpress.com/ -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/VARSTOCASES-and-CASESTOVARS-speed-issues-tp5725387p5725389.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by Andy W
"Clearly in such a situation you should just keep the one Id variable for the VARSTOCASES and drop the other junk beforehand. Then later on merge in Junk1 ...... Junk100 via MATCH FILES if you really need them. (This is basically just a "normalize your database" properly argument under the guise of dealing with separate SPSS files.)".
Great minds think alike ;-) I don't see why so many people have so much difficulty comprehending this simple fact of data organization. Data is like clay! Shape it to fit your immediate requirements, store it in as many pieces as necessary.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Administrator
|
In reply to this post by Jon K Peck
That may be but there aren't good alternatives.
Besides, the whole point of the exercise is to "wrenchingly disrupt the dataset". V2C:XSAVE in a LOOP or RESHAPE in MATRIX. C2V:VECTOR after constructing an INDEX then AGGREGATE. In my experience CtoV and VtoC are fairly efficient. I ran a test earlier and V2C on 1,000,000 cases with 5 variables to 'flip' and 5 'fixed' ran very quickly (less than a second).
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Free forum by Nabble | Edit this page |