Cases to vars with a very large dataset

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Cases to vars with a very large dataset

bwyker
This post was updated on .
Hello all,

I've been struggling for months to restructure an enormous file in SPSS and would greatly appreciate any help. The main file is a merge of data from 14 states, totaling over 411 million rows of data (45 variables) in a file that is 124 GB in size.  I also have these data separated by state and year (2007 and 2008), which brings the file down to about 12 GB.

In each file, each participant has several rows of data, which looks something like this:

ID           RXDate                              RXCode…..
1001      1/1/2007                            1234
1001      7/12/2007                          5678
9999      2/12/2007                          1369
9999      8/1/2007                            2468
9999      9/20/2007                          1234

I’m trying to perform a relatively simple task of transposing the rows of data into columns, grouped by ID. It should look something like this:

ID          RXDate1          RXDate2          RXDate3          RXCode1          RXCode2          RXCode3              
1001      1/1/2007         7/12/2007       .                     1234                5678                .                    
9999      2/12/2007       8/1/2007         9/20/2007      1369                 2468                1234                    

I’ve run the command to do this in SPSS several times. It executes the command, but never ends – both on the large 124 GB file and the smaller 12 GB files. I literally had the large file running around the clock for more than a month without completing.

It's not looking promising for SPSS Statistics, but maybe I'm missing something. I also have SPSS Modeler. I'm not that familiar, but it seems there might be some combination of the aggregate, transpose and/or restructure nodes that might makes this work.

Any suggestions?

Thanks in advance.
Brett
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

Bruce Weaver
Administrator
Hello Brett.  Why do you want one row per ID?  What type of analysis (or analyses) are you planning on doing?  I ask, because perhaps there's a way to do them with the long file format.  



bwyker wrote
Hello all,

I've been struggling for months to restructure an enormous file in SPSS and would greatly appreciate any help. The main file is a merge of data from 14 states, totaling over 411 million rows of data (45 variables) in a file that is 124 GB in size.  I also have these data separated by state and year (2007 and 2008), which brings the file down to about 12 GB.

In each file, each participant has several rows of data, which looks something like this:

ID           RXDate                              RXCode…..
1001      1/1/2007                            1234
1001      7/12/2007                          5678
9999      2/12/2007                          1369
9999      8/1/2007                            2468
9999      9/20/2007                          1234

I’m trying to perform a relatively simple task of transposing the rows of data into columns, grouped by ID. It should look something like this:

ID          RXDate1          RXDate2          RXDate3          RXCode1          RXCode2          RXCode3              
1001      1/1/2007         7/12/2007       .                     1234                5678                .                    
9999      2/12/2007       8/1/2007         9/20/2007      1369                 2468                1234                    

I’ve run the command to do this in SPSS several times. It executes the command, but never ends – both on the large 124 GB file and the smaller 12 GB files. I literally had the large file running around the clock for more than a month without completing.

It's not looking promising for SPSS Statistics, but maybe I'm missing something. I also have SPSS Modeler. I'm not that familiar, but it seems there might be some combination of the aggregate, transpose and/or restructure nodes that might makes this work.

Any suggestions?

Thanks in advance.
Brett
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

bwyker
Hi Bruce,
Thanks for the reply. Good point. Unfortunately, I do not know what analyses will be run. I've just been given the task of merging and structuring the data.
Brett
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

Andy W
In reply to this post by bwyker
1 - Are you sure the code is correct on a subset of smaller data?

2 - Have you dropped all unneeded variables from the dataset before the reshaping?

FYI since you are introducing missing cells in the wide format, the data size will be larger than in the long format - potentially much larger - depending on the maximum and average number of repeats every individual has.
Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

David Marso
Administrator
In reply to this post by bwyker
Please read the TIPS on posting questions to SPSS list topic I bumped yesterday.
In this case you should have posted your syntax.
Useful info here would be to know the actual number of unique ID's, the max number of Rx values and dates.  Is the file PRESORTED by ID and Date?
Maybe your hardware isn't up to the file size?  How much free disk space?
Maybe break into several sets of IDs, run task, then ADD files?
Maybe see VECTOR/AGGREGATE approach (in the FM).
IMNSHO denormalizing the data IS a stupid thing to do.
Are you sure the data base people on the project aren't rectocranially inverted?

bwyker wrote
Hello all,

I've been struggling for months to restructure an enormous file in SPSS and would greatly appreciate any help. The main file is a merge of data from 14 states, totaling over 411 million rows of data (45 variables) in a file that is 124 GB in size.  I also have these data separated by state and year (2007 and 2008), which brings the file down to about 12 GB.

In each file, each participant has several rows of data, which looks something like this:

ID           RXDate                              RXCode…..
1001      1/1/2007                            1234
1001      7/12/2007                          5678
9999      2/12/2007                          1369
9999      8/1/2007                            2468
9999      9/20/2007                          1234

I’m trying to perform a relatively simple task of transposing the rows of data into columns, grouped by ID. It should look something like this:

ID          RXDate1          RXDate2          RXDate3          RXCode1          RXCode2          RXCode3              
1001      1/1/2007         7/12/2007       .                     1234                5678                .                    
9999      2/12/2007       8/1/2007         9/20/2007      1369                 2468                1234                    

I’ve run the command to do this in SPSS several times. It executes the command, but never ends – both on the large 124 GB file and the smaller 12 GB files. I literally had the large file running around the clock for more than a month without completing.

It's not looking promising for SPSS Statistics, but maybe I'm missing something. I also have SPSS Modeler. I'm not that familiar, but it seems there might be some combination of the aggregate, transpose and/or restructure nodes that might makes this work.

Any suggestions?

Thanks in advance.
Brett
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

hillel vardi
In reply to this post by bwyker
Shalom

when doing the transposing that you are attempting you have  to remember
that the new table will have your 45 variables (codes) time the maximum
lines per ID
If one of your id have 100 line and the average lines per id is 5 you
new table will be multiplied  by almost 20 .
I think you should check what is the   maximum lines per ID  and may be
drop some cases with too many lines

HIllel Vardi




On 02/04/2014 18:10, bwyker wrote:

> Hello all,
>
> I've been struggling for months to restructure an enormous file in SPSS and
> would greatly appreciate any help. The main file is a merge of data from 14
> states, totaling over 411 million rows of data (45 variables) in a file that
> is 124 GB in size.  I also have these data separated by state and year (2007
> and 2008), which brings the file down to about 12 GB.
>
> In each file, each participant has several rows of data, which looks
> something like this:
>
> ID           RXDate                              RXCode…..
> 1001      1/1/2007                            1234
> 1001      7/12/2007                          5678
> 9999      2/12/2007                          1369
> 9999      8/1/2007                            2468
> 9999      9/20/2007                          1234
>
> I’m trying to perform a relatively simple task of transposing the rows of
> data into columns, grouped by ID. It should look something like this:
>
> ID          RXDate1          RXDate2          RXDate3          RXCode1
> RXCode2          RXCode3
> 1001      1/1/2007         7/12/2007       .                     1234
> 5678                .
> 9999      2/12/2007       8/1/2007         9/20/2007      1369
> 2468                1234
>
> I’ve run the command to do this in SPSS several times. It executes the
> command, but never ends – both on the large 124 GB file and the smaller 12
> GB files. I literally had the large file running around the clock for more
> than a month without completing.
>
> It's not looking promising for SPSS Statistics, but maybe I'm missing
> something. I also have SPSS Modeler. I'm not that familiar, but it seems
> there might be some combination of the aggregate, transpose and/or
> restructure nodes that might makes this work.
>
> Any suggestions?
>
> Thanks in advance.
> Brett
>
>
>
>
>
> --
> View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Cases-to-vars-with-a-very-large-dataset-tp5725171.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

JosephYoungblood
In reply to this post by bwyker
Brett,

I first starting using SPSS in 1995, and even with all the hardware
limitations at the time, I was able to load and analyze data from a 10-year
longitudinal study.

If you all you need to do is shrink the size of the data set, I'd recommend
looking at data "Type" for each variable.  The amount of space SPSS allocates
each variable is based on how the user defines each variable (e.g., Numeric,
Dot, Comma or string).  Also, as Bruce mentioned, think about the analyses
and results.  Not every analysis will need or require ALL VARS, so perhaps
you could determine a set of analyses that are not related, even if some may
share the same variables, and strip out all data not relevant to that
particular analysis.

Good luck,

Joseph

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

Mike
In reply to this post by bwyker
I'd suggest getting a count of records for each ID, splitting the
data into different files based on number of records (e.g.,
all IDs that have only one record would be in a file and would only
need to have their variable names changed).  I assume that
SPSS chokes when it has a lot of records per ID and it would
make sense to run these separately from the rest. Then
merge the files together.

-Mike Palij
New York University
[hidden email]


----- Original Message -----
From: "bwyker" <[hidden email]>
To: <[hidden email]>
Sent: Wednesday, April 02, 2014 11:25 AM
Subject: Re: Cases to vars with a very large dataset


> Hi Bruce,
> Thanks for the reply. Good point. Unfortunately, I do not know what
> analyses
> will be run. I've just been given the task of merging and structuring
> the
> data.
> Brett
>
>
>
> --
> View this message in context:
> http://spssx-discussion.1045642.n5.nabble.com/Cases-to-vars-with-a-very-large-dataset-tp5725171p5725174.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except
> the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

David Marso
Administrator
In reply to this post by JosephYoungblood
"The amount of space SPSS allocates
each variable is based on how the user defines each variable (e.g., Numeric,
Dot, Comma or string)."

Actually, all NUMERIC fields take the same amount of space (8 bytes).  Dot, Comma, Date, F, N etc.
Strings a multiple of 8 bytes [(A9) takes actually 16 bytes].  Actually maybe double or triple that due to Unicode. I'm sure Jon will chime in to correct me if that is incorrect.
BUT my point is that all numeric values are created equal WRT memory requirements!

Joseph Youngblood wrote
Brett,

I first starting using SPSS in 1995, and even with all the hardware
limitations at the time, I was able to load and analyze data from a 10-year
longitudinal study.

If you all you need to do is shrink the size of the data set, I'd recommend
looking at data "Type" for each variable.  The amount of space SPSS allocates
each variable is based on how the user defines each variable (e.g., Numeric,
Dot, Comma or string).  Also, as Bruce mentioned, think about the analyses
and results.  Not every analysis will need or require ALL VARS, so perhaps
you could determine a set of analyses that are not related, even if some may
share the same variables, and strip out all data not relevant to that
particular analysis.

Good luck,

Joseph

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

JosephYoungblood
In reply to this post by bwyker
On Wed, 2 Apr 2014 11:17:01 -0700, David Marso <[hidden email]>
wrote:

>"The amount of space SPSS allocates
>each variable is based on how the user defines each variable (e.g.,
Numeric,

>Dot, Comma or string)."
>
>Actually, all NUMERIC fields take the same amount of space (8 bytes).  Dot,
>Comma, Date, F, N etc.
>Strings a multiple of 8 bytes [(A9) takes actually 16 bytes].  Actually
>maybe double or triple that due to Unicode. I'm sure Jon will chime in to
>correct me if that is incorrect.
>BUT my point is that all numeric values are created equal WRT memory
>requirements!
>

Thank you for clarifying David, and I while I admit I don't know why this is
the case, I have been able to reduce total file size on very large data set
by as much 20% by changing data type.

Cheers,

Joseph

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

Jon K Peck
In reply to this post by David Marso
David is right that all numerics are 8 bytes, and string space is allowed in multiples of 8 bytes.  But the A size is in bytes, so An takes the same number of bytes whether in Unicode or code page mode.

sav file compression depends on the actual contents with small integer values and blank strings heavily compressed, but that does not affect memory usage when passing the data.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        David Marso <[hidden email]>
To:        [hidden email],
Date:        04/02/2014 12:17 PM
Subject:        Re: [SPSSX-L] Cases to vars with a very large dataset
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




"The amount of space SPSS allocates
each variable is based on how the user defines each variable (e.g., Numeric,
Dot, Comma or string)."

Actually, all NUMERIC fields take the same amount of space (8 bytes).  Dot,
Comma, Date, F, N etc.
Strings a multiple of 8 bytes [(A9) takes actually 16 bytes].  Actually
maybe double or triple that due to Unicode. I'm sure Jon will chime in to
correct me if that is incorrect.
BUT my point is that all numeric values are created equal WRT memory
requirements!


Joseph Youngblood wrote
> Brett,
>
> I first starting using SPSS in 1995, and even with all the hardware
> limitations at the time, I was able to load and analyze data from a
> 10-year
> longitudinal study.
>
> If you all you need to do is shrink the size of the data set, I'd
> recommend
> looking at data "Type" for each variable.  The amount of space SPSS
> allocates
> each variable is based on how the user defines each variable (e.g.,
> Numeric,
> Dot, Comma or string).  Also, as Bruce mentioned, think about the analyses
> and results.  Not every analysis will need or require ALL VARS, so perhaps
> you could determine a set of analyses that are not related, even if some
> may
> share the same variables, and strip out all data not relevant to that
> particular analysis.
>
> Good luck,
>
> Joseph
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD





-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Cases-to-vars-with-a-very-large-dataset-tp5725171p5725185.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

hillel vardi
In reply to this post by hillel vardi
Shalom

After some more thinking It looks to me that you are using the date
variable as the index variable in casesstovars   in that case you will
have 45 variables time the number of days from the minimum date to the
maximum date  if your average line per id is 5 and you have data for 5
year then you are trying to create a table 365 time bigger then your
original table that is a table of some petabytes, may be you don't have
the memory for satch a big table.

Hillel Vardi

On 02/04/2014 20:33, Hillel vardi wrote:

> Shalom
>
> when doing the transposing that you are attempting you have  to remember
> that the new table will have your 45 variables (codes) time the maximum
> lines per ID
> If one of your id have 100 line and the average lines per id is 5 you
> new table will be multiplied  by almost 20 .
> I think you should check what is the   maximum lines per ID  and may be
> drop some cases with too many lines
>
> HIllel Vardi
>
>
>
>
> On 02/04/2014 18:10, bwyker wrote:
>> Hello all,
>>
>> I've been struggling for months to restructure an enormous file in
>> SPSS and
>> would greatly appreciate any help. The main file is a merge of data
>> from 14
>> states, totaling over 411 million rows of data (45 variables) in a
>> file that
>> is 124 GB in size.  I also have these data separated by state and
>> year (2007
>> and 2008), which brings the file down to about 12 GB.
>>
>> In each file, each participant has several rows of data, which looks
>> something like this:
>>
>> ID           RXDate                              RXCode…..
>> 1001      1/1/2007                            1234
>> 1001      7/12/2007                          5678
>> 9999      2/12/2007                          1369
>> 9999      8/1/2007                            2468
>> 9999      9/20/2007                          1234
>>
>> I’m trying to perform a relatively simple task of transposing the
>> rows of
>> data into columns, grouped by ID. It should look something like this:
>>
>> ID          RXDate1          RXDate2          RXDate3 RXCode1
>> RXCode2          RXCode3
>> 1001      1/1/2007         7/12/2007       . 1234
>> 5678                .
>> 9999      2/12/2007       8/1/2007         9/20/2007      1369
>> 2468                1234
>>
>> I’ve run the command to do this in SPSS several times. It executes the
>> command, but never ends – both on the large 124 GB file and the
>> smaller 12
>> GB files. I literally had the large file running around the clock for
>> more
>> than a month without completing.
>>
>> It's not looking promising for SPSS Statistics, but maybe I'm missing
>> something. I also have SPSS Modeler. I'm not that familiar, but it seems
>> there might be some combination of the aggregate, transpose and/or
>> restructure nodes that might makes this work.
>>
>> Any suggestions?
>>
>> Thanks in advance.
>> Brett
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://spssx-discussion.1045642.n5.nabble.com/Cases-to-vars-with-a-very-large-dataset-tp5725171.html
>> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>>
>> =====================
>> To manage your subscription to SPSSX-L, send a message to
>> [hidden email] (not to SPSSX-L), with no body text except the
>> command. To leave the list, send the command
>> SIGNOFF SPSSX-L
>> For a list of commands to manage subscriptions, send the command
>> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

David Marso
Administrator
"It looks to me that you are using the date
variable as the index variable in casesstovars   "

I don't see where you get this idea.
My only advice to OP is to confront whoever is suggesting this denormalized format and demand some sort of justification.  Review downstream analytical requirements before building out this monster.
In any case the show stopper is the possibly large number of records for possibly a small number of people which will blow up the problem size.  Note also that SPSS needs to evaluate the entire dataset to create the dictionary before building out the resulting dataset.  Is it necessary to have all variables in the file per case or is it ok to have them in separate data files?  So, moving forward?  What needs to be done with this data?  Just because some manager/hack decided that all the data needs to be on one record per ID doesn't mean that is a best or even good solution (especially if it has been running for a month ;-).

Hillel Vardi wrote
Shalom

After some more thinking It looks to me that you are using the date
variable as the index variable in casesstovars   in that case you will
have 45 variables time the number of days from the minimum date to the
maximum date  if your average line per id is 5 and you have data for 5
year then you are trying to create a table 365 time bigger then your
original table that is a table of some petabytes, may be you don't have
the memory for satch a big table.

Hillel Vardi

On 02/04/2014 20:33, Hillel vardi wrote:
> Shalom
>
> when doing the transposing that you are attempting you have  to remember
> that the new table will have your 45 variables (codes) time the maximum
> lines per ID
> If one of your id have 100 line and the average lines per id is 5 you
> new table will be multiplied  by almost 20 .
> I think you should check what is the   maximum lines per ID  and may be
> drop some cases with too many lines
>
> HIllel Vardi
>
>
>
>
> On 02/04/2014 18:10, bwyker wrote:
>> Hello all,
>>
>> I've been struggling for months to restructure an enormous file in
>> SPSS and
>> would greatly appreciate any help. The main file is a merge of data
>> from 14
>> states, totaling over 411 million rows of data (45 variables) in a
>> file that
>> is 124 GB in size.  I also have these data separated by state and
>> year (2007
>> and 2008), which brings the file down to about 12 GB.
>>
>> In each file, each participant has several rows of data, which looks
>> something like this:
>>
>> ID           RXDate                              RXCode…..
>> 1001      1/1/2007                            1234
>> 1001      7/12/2007                          5678
>> 9999      2/12/2007                          1369
>> 9999      8/1/2007                            2468
>> 9999      9/20/2007                          1234
>>
>> I’m trying to perform a relatively simple task of transposing the
>> rows of
>> data into columns, grouped by ID. It should look something like this:
>>
>> ID          RXDate1          RXDate2          RXDate3 RXCode1
>> RXCode2          RXCode3
>> 1001      1/1/2007         7/12/2007       . 1234
>> 5678                .
>> 9999      2/12/2007       8/1/2007         9/20/2007      1369
>> 2468                1234
>>
>> I’ve run the command to do this in SPSS several times. It executes the
>> command, but never ends – both on the large 124 GB file and the
>> smaller 12
>> GB files. I literally had the large file running around the clock for
>> more
>> than a month without completing.
>>
>> It's not looking promising for SPSS Statistics, but maybe I'm missing
>> something. I also have SPSS Modeler. I'm not that familiar, but it seems
>> there might be some combination of the aggregate, transpose and/or
>> restructure nodes that might makes this work.
>>
>> Any suggestions?
>>
>> Thanks in advance.
>> Brett
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://spssx-discussion.1045642.n5.nabble.com/Cases-to-vars-with-a-very-large-dataset-tp5725171.html
>> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>>
>> =====================
>> To manage your subscription to SPSSX-L, send a message to
>> [hidden email] (not to SPSSX-L), with no body text except the
>> command. To leave the list, send the command
>> SIGNOFF SPSSX-L
>> For a list of commands to manage subscriptions, send the command
>> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

Richard Ristow
In reply to this post by bwyker
At 11:10 AM 4/2/2014, bwyker wrote:

>I've been struggling for months to restructure
>an enormous file. The main file is a merge of
>data from 14 states, totaling over 411 million
>rows of data (45 variables) in a file that is
>124 GB in size.  I also have these data
>separated by state and year (2007 and 2008),
>which brings the file down to about 12 GB.
>
>In each file, each participant has several rows
>of data, which looks something like this:
>
>ID           RXDate       RXCode…..
>1001      1/1/2007        1234
>1001      7/12/2007       5678
>9999      2/12/2007       1369
>9999      8/1/2007        2468
>9999      9/20/2007       1234
>
>I'm trying to transpose the rows of data into columns, grouped by ID

A question I haven't seen asked, in the responses to your posting:

You write of 45 variable in the file; but your
problem depends on only three variables -- ID,
RXDate, RXCode. What are the other variables --
other prescriptions, or something different? What
do you want to do with them, when you've gone
from many records for each ID to only one?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

Andreas Sprenger
In reply to this post by bwyker
Coming back to Brett's primary question: how to reorganize data (cases to vars)? In case of a very large dataset I would use a different software (e.g. Matlab) and prepare the data for SPSS to be analyzed. I would guess 20 lines in Matlab will be sufficient to read, transpose and write the data.

Andreas
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

David Marso
Administrator
Really?  You base this on what amount of experience with SPSS?
I'll see your BS and raise you two cow pies.

Andreas Sprenger wrote
Coming back to Brett's primary question: how to reorganize data (cases to vars)? In case of a very large dataset I would use a different software (e.g. Matlab) and prepare the data for SPSS to be analyzed. I would guess 20 lines in Matlab will be sufficient to read, transpose and write the data.

Andreas
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

bwyker
In reply to this post by David Marso
> Please read the TIPS on posting questions to SPSS list topic I bumped yesterday.

Will do. Just found this forum this morning. Using for the first time, apologies.

> In this case you should have posted your syntax.

The syntax is pretty simple. Just a restructure, case to vars:

CASESTOVARS
  /ID=ID
  /GROUPBY=VARIABLE.

>Useful info here would be to know the actual number of unique ID's, the max number of Rx values and dates.  Is the file PRESORTED by ID and Date?

Sorry I did not provide this info. It varies, but the short answer is I understand there are going to be A LOT of variables. At most, probably somewhere around 100  - 200 IDs. Most of the 45 variables are actually the same for all IDs (e.g. state of the respondent), so not every variable will be duplicated by the number of IDs.  At any rate, as I understand it, there isn't a limit to the number of variables SPSS will include in a dataset, even though there are other reasons to not have thousands of variables in a dataset.

> Maybe your hardware isn't up to the file size?  How much free disk space?

I've tried on several devices and have quite a bit of space. I don't believe this is a hardware problem.

> Maybe break into several sets of IDs, run task, then ADD files?

I've definitely considered the brute force option as a last resort. However, this would take a considerable amount of time and manpower.

> Maybe see VECTOR/AGGREGATE approach (in the FM).

I'm not familiar, but will look into this. Thanks.

> IMNSHO denormalizing the data IS a stupid thing to do.
Are you sure the data base people on the project aren't rectocranially inverted?

Not my call, unfortunately.
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

bwyker
In reply to this post by Mike
Thanks Mike. I have thought of this. The trouble is, anything I do with this data take a long time, including cutting it up into chunks. I may be forced to do this, but I'm hoping there is a way to do it without having to spend the time slicing and dicing datasets.
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

David Marso
Administrator
In reply to this post by bwyker
"Sorry I did not provide this info. It varies, but the short answer is I understand there are going to be A LOT of variables. At most, probably somewhere around 100  - 200 IDs."

So this megamonster is going to end up sitting in a mere 100-200 rows?
I'm afraid you have lost me here.
Or are you switching references and 100-200 records per ID as defined in the CASESTOVARS code?
Another concern in the usability of such a file from any analytical context.
---
This might take awhile to run, but get some context here on the problem size?

DATASET DECLARE agg.
AGGREGATE OUTFILE agg / BREAK=ID / RecordCountsPerId=N.
DATASET ACTIVATE agg.
DESCRIPTIVES RecordCountsPerId / STATISTICS MEAN MIN MAX STDDEV.

"The syntax is pretty simple. Just a restructure, case to vars:
CASESTOVARS  /ID=ID  /GROUPBY=VARIABLE."...

Deceptively simple...
SPSS has to rip through the entire file, determining a LOT of information,
determine which variables are constant for all records within an ID -for all ids- (so it can treat them as FIXED), determine the maximum records across all ids for each variable in order to build the dictionary.
then it can actually read and flatten the data...
Enough said.
---------------------------------------


bwyker wrote
> Please read the TIPS on posting questions to SPSS list topic I bumped yesterday.

Will do. Just found this forum this morning. Using for the first time, apologies.

> In this case you should have posted your syntax.

The syntax is pretty simple. Just a restructure, case to vars:

CASESTOVARS
  /ID=ID
  /GROUPBY=VARIABLE.

>Useful info here would be to know the actual number of unique ID's, the max number of Rx values and dates.  Is the file PRESORTED by ID and Date?

Sorry I did not provide this info. It varies, but the short answer is I understand there are going to be A LOT of variables. At most, probably somewhere around 100  - 200 IDs. Most of the 45 variables are actually the same for all IDs (e.g. state of the respondent), so not every variable will be duplicated by the number of IDs.  At any rate, as I understand it, there isn't a limit to the number of variables SPSS will include in a dataset, even though there are other reasons to not have thousands of variables in a dataset.

> Maybe your hardware isn't up to the file size?  How much free disk space?

I've tried on several devices and have quite a bit of space. I don't believe this is a hardware problem.

> Maybe break into several sets of IDs, run task, then ADD files?

I've definitely considered the brute force option as a last resort. However, this would take a considerable amount of time and manpower.

> Maybe see VECTOR/AGGREGATE approach (in the FM).

I'm not familiar, but will look into this. Thanks.

> IMNSHO denormalizing the data IS a stupid thing to do.
Are you sure the data base people on the project aren't rectocranially inverted?

Not my call, unfortunately.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Cases to vars with a very large dataset

Rich Ulrich
In reply to this post by Andreas Sprenger
[re-posting through Nabble after failure]

Here is a practical consideration: Wouldn't using Matlab assume
that you have a terabyte of memory to hold all the data?

I thought that efficiency would be served by the advice to create
separate file for IDs with different lengths of created data.  (Assuming
the project is worth doing at all, which seems a bit doubtful.)

Another practical consideration: Most PCs don't have internal drives
with multi-terabyte storage.  For an external drive: Is your transfer port
using the latest (fastest) possible transfer rate?

Another practical consideration:  Your users will want a smaller file
available (assuming they have any experience at all) for testing new
syntax, transformations, etc.  Start now by creating a test file for them,
in order to prod them to finding whether there is really a practical
demand for this non-normalized data file.  I suggest creating a typical
set of IDs that use 1 part per 1440 of the original data:  This way, the
eventual execution time may be estimated by simply re-reading minutes
as days.

--
Rich Ulrich
12