Cases to vars with a very large dataset

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

Richard Ristow
At 04:01 PM 4/3/2014, bwyker wrote:

>The syntax is pretty simple. Just a restructure, case to vars:
>
>CASESTOVARS
>   /ID=ID
>   /GROUPBY=VARIABLE.
>
>>Useful info here would be to know the actual number of unique ID's, the max
>>number of Rx values and dates.  Is the file PRESORTED by ID and Date?
>
>I understand there are going to be A LOT of variables. At most,
>probably somewhere around 100  - 200 IDs. Most of the 45 variables
>are actually the same for all IDs...

This is a shot in the dark, but the next thing I'd try is

CASESTOVARS
   /ID=ID
   /GROUPBY = VARIABLE
   /FIXED   = <list of variables you know don't vary>
   /AUTOFIX = NO.

AUTOFIX requires checking each variable in the file to see whether it
varies or not, and I can see it taking a lot of time in a large file.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

Mike
In reply to this post by bwyker
Well, it seems that anything you do will take a significant amount of
time.  However, if you chop up the file into different files based on
number of ids, you can run several analyses concurrently if you
have the additional PCs to use.  Depending upon what resources
are available to you, this type of distributed processing could save
a significant amount of time.

-Mike Palij
New York University
[hidden email]


----- Original Message -----
From: "bwyker" <[hidden email]>
To: <[hidden email]>
Sent: Thursday, April 03, 2014 4:04 PM
Subject: Re: Cases to vars with a very large dataset


> Thanks Mike. I have thought of this. The trouble is, anything I do
> with this
> data take a long time, including cutting it up into chunks. I may be
> forced
> to do this, but I'm hoping there is a way to do it without having to
> spend
> the time slicing and dicing datasets.
>
>
>
> --
> View this message in context:
> http://spssx-discussion.1045642.n5.nabble.com/Cases-to-vars-with-a-very-large-dataset-tp5725171p5725248.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except
> the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

bwyker
In reply to this post by David Marso
Thanks for the assistance. I took a sample of a few thousand cases and the maximum number of rows per ID is 103. So, using the previous example, the variable list for this sample should look like this after the transpose:

ID          RXDate1          RXDate2    ...      RXDate103      RXCode1          RXCode2     ...     RXCode103              

So, it sounds like what you and others are suggesting makes sense. If I must do this transpose, it's best to do it in pieces and then stitch it back together. I think I'll take out the static variables and then see what's left.

Thanks again for the help.

Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

bwyker
In reply to this post by Richard Ristow
Ahh...I did not know about the Autofix. This is helpful, thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

Richard Ristow
At 01:47 PM 4/4/2014, bwyker wrote:

>Ahh...I did not know about the Autofix. This is helpful, thanks!

You'd written,
>I'm struggling restructure an enormous file in SPSS. The main file
>is a merge of data from 14 states, totaling over 411 million rows of
>data (45 variables) in a file that is 124 GB in size.  I also have
>these data separated by state and year (2007 and 2008), which brings
>the file down to about 12 GB.

I'll be very interested in how it goes, if you try one of the 12GB
files with AUTOFIX = NO.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

bwyker
Richard,

Yes, the AUTOFIX appears to be the hiccup that was keeping the restructure from working. I ran the CASESTOVARS command with the fixed variables defined and I had a transposed dataset in about an hour. I can't thank you enough - huge help!

I did end up with a set of 840 variables for each transposed (e.g., var1.1 var1.2...var1.804). This sounds like a data cleaning issue, but, overall, I think I'm in good shape.

Thanks again,
Brett
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

Richard Ristow
At 03:03 PM 4/7/2014, bwyker wrote:

>Yes, the AUTOFIX appears to be the hiccup that was keeping the
>restructure from working. I ran the CASESTOVARS command with the
>fixed variables defined and I had a transposed dataset in about an
>hour. I can't thank you enough - huge help!

Great! I'm really glad.

>I did end up with a set of 840 variables for each transposed (e.g.,
>var1.1 var1.2...var1.804). This sounds like a data cleaning issue,
>but, overall, I think I'm in good shape.

This is a place where it's important to look at the underlying
meaning of your input variables.

 From your original posting, it seems clear that you have the ID;
multiple pairs of RXDate and RXCode for each ID; and 40-some other variables.

Your variables should fall into two categories, and you have a
data-structure problem if they don't:
. Those that are constant for any value of ID (which should be named
on the FIXED= list for CASESTOVARS)
. Those that pertain to the prescriptions.

It should be clear from each variable's definition which category it
falls into; if it isn't, or if there are any variables that aren't in
one of those categories, there's a problem with understanding the
meaning of your data, not with SPSS coding.

And if any variables that should be FIXED within ID turn out not to
be so, you're right, it's a data-cleaning issue.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

David Marso
Administrator
"And if any variables that should be FIXED within ID turn out not to
be so, you're right, it's a data-cleaning issue. "

Which will be horrible without a few shortcuts/techniques.
For variables which should be fixed within ID consider AGGREGATE breaking on ID and getting MIN and MAX for each of them.  If these are not equal within a variable then you have a data issue.

DATASET DECLARE agg.
AGGREGATE OUTFILE agg
 / BREAK ID
 / <fixed vars min list> = MIN(fixed var list)
 / <fixed vars max list> = MAX(fixed var list)
 / COUNT=N.
DATASE ACTIVATE agg.
DO REPEAT min=<fixed vars min list>/max=<fixed vars max list>/var=<list of quoted fixed varnames>.
DO IF (min NE max).
PRINT / ID var min max.
END IF.
EXECUTE. /* or other data passing command */


Richard Ristow wrote
At 03:03 PM 4/7/2014, bwyker wrote:

>Yes, the AUTOFIX appears to be the hiccup that was keeping the
>restructure from working. I ran the CASESTOVARS command with the
>fixed variables defined and I had a transposed dataset in about an
>hour. I can't thank you enough - huge help!

Great! I'm really glad.

>I did end up with a set of 840 variables for each transposed (e.g.,
>var1.1 var1.2...var1.804). This sounds like a data cleaning issue,
>but, overall, I think I'm in good shape.

This is a place where it's important to look at the underlying
meaning of your input variables.

 From your original posting, it seems clear that you have the ID;
multiple pairs of RXDate and RXCode for each ID; and 40-some other variables.

Your variables should fall into two categories, and you have a
data-structure problem if they don't:
. Those that are constant for any value of ID (which should be named
on the FIXED= list for CASESTOVARS)
. Those that pertain to the prescriptions.

It should be clear from each variable's definition which category it
falls into; if it isn't, or if there are any variables that aren't in
one of those categories, there's a problem with understanding the
meaning of your data, not with SPSS coding.

And if any variables that should be FIXED within ID turn out not to
be so, you're right, it's a data-cleaning issue.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
12