This post was updated on .
Hello all,
I've been struggling for months to restructure an enormous file in SPSS and would greatly appreciate any help. The main file is a merge of data from 14 states, totaling over 411 million rows of data (45 variables) in a file that is 124 GB in size. I also have these data separated by state and year (2007 and 2008), which brings the file down to about 12 GB. In each file, each participant has several rows of data, which looks something like this: ID RXDate RXCode….. 1001 1/1/2007 1234 1001 7/12/2007 5678 9999 2/12/2007 1369 9999 8/1/2007 2468 9999 9/20/2007 1234 I’m trying to perform a relatively simple task of transposing the rows of data into columns, grouped by ID. It should look something like this: ID RXDate1 RXDate2 RXDate3 RXCode1 RXCode2 RXCode3 1001 1/1/2007 7/12/2007 . 1234 5678 . 9999 2/12/2007 8/1/2007 9/20/2007 1369 2468 1234 I’ve run the command to do this in SPSS several times. It executes the command, but never ends – both on the large 124 GB file and the smaller 12 GB files. I literally had the large file running around the clock for more than a month without completing. It's not looking promising for SPSS Statistics, but maybe I'm missing something. I also have SPSS Modeler. I'm not that familiar, but it seems there might be some combination of the aggregate, transpose and/or restructure nodes that might makes this work. Any suggestions? Thanks in advance. Brett |
Administrator
|
Hello Brett. Why do you want one row per ID? What type of analysis (or analyses) are you planning on doing? I ask, because perhaps there's a way to do them with the long file format.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
Hi Bruce,
Thanks for the reply. Good point. Unfortunately, I do not know what analyses will be run. I've just been given the task of merging and structuring the data. Brett |
In reply to this post by bwyker
1 - Are you sure the code is correct on a subset of smaller data?
2 - Have you dropped all unneeded variables from the dataset before the reshaping? FYI since you are introducing missing cells in the wide format, the data size will be larger than in the long format - potentially much larger - depending on the maximum and average number of repeats every individual has. |
Administrator
|
In reply to this post by bwyker
Please read the TIPS on posting questions to SPSS list topic I bumped yesterday.
In this case you should have posted your syntax. Useful info here would be to know the actual number of unique ID's, the max number of Rx values and dates. Is the file PRESORTED by ID and Date? Maybe your hardware isn't up to the file size? How much free disk space? Maybe break into several sets of IDs, run task, then ADD files? Maybe see VECTOR/AGGREGATE approach (in the FM). IMNSHO denormalizing the data IS a stupid thing to do. Are you sure the data base people on the project aren't rectocranially inverted?
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by bwyker
Shalom
when doing the transposing that you are attempting you have to remember that the new table will have your 45 variables (codes) time the maximum lines per ID If one of your id have 100 line and the average lines per id is 5 you new table will be multiplied by almost 20 . I think you should check what is the maximum lines per ID and may be drop some cases with too many lines HIllel Vardi On 02/04/2014 18:10, bwyker wrote: > Hello all, > > I've been struggling for months to restructure an enormous file in SPSS and > would greatly appreciate any help. The main file is a merge of data from 14 > states, totaling over 411 million rows of data (45 variables) in a file that > is 124 GB in size. I also have these data separated by state and year (2007 > and 2008), which brings the file down to about 12 GB. > > In each file, each participant has several rows of data, which looks > something like this: > > ID RXDate RXCode….. > 1001 1/1/2007 1234 > 1001 7/12/2007 5678 > 9999 2/12/2007 1369 > 9999 8/1/2007 2468 > 9999 9/20/2007 1234 > > I’m trying to perform a relatively simple task of transposing the rows of > data into columns, grouped by ID. It should look something like this: > > ID RXDate1 RXDate2 RXDate3 RXCode1 > RXCode2 RXCode3 > 1001 1/1/2007 7/12/2007 . 1234 > 5678 . > 9999 2/12/2007 8/1/2007 9/20/2007 1369 > 2468 1234 > > I’ve run the command to do this in SPSS several times. It executes the > command, but never ends – both on the large 124 GB file and the smaller 12 > GB files. I literally had the large file running around the clock for more > than a month without completing. > > It's not looking promising for SPSS Statistics, but maybe I'm missing > something. I also have SPSS Modeler. I'm not that familiar, but it seems > there might be some combination of the aggregate, transpose and/or > restructure nodes that might makes this work. > > Any suggestions? > > Thanks in advance. > Brett > > > > > > -- > View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Cases-to-vars-with-a-very-large-dataset-tp5725171.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by bwyker
Brett,
I first starting using SPSS in 1995, and even with all the hardware limitations at the time, I was able to load and analyze data from a 10-year longitudinal study. If you all you need to do is shrink the size of the data set, I'd recommend looking at data "Type" for each variable. The amount of space SPSS allocates each variable is based on how the user defines each variable (e.g., Numeric, Dot, Comma or string). Also, as Bruce mentioned, think about the analyses and results. Not every analysis will need or require ALL VARS, so perhaps you could determine a set of analyses that are not related, even if some may share the same variables, and strip out all data not relevant to that particular analysis. Good luck, Joseph ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by bwyker
I'd suggest getting a count of records for each ID, splitting the
data into different files based on number of records (e.g., all IDs that have only one record would be in a file and would only need to have their variable names changed). I assume that SPSS chokes when it has a lot of records per ID and it would make sense to run these separately from the rest. Then merge the files together. -Mike Palij New York University [hidden email] ----- Original Message ----- From: "bwyker" <[hidden email]> To: <[hidden email]> Sent: Wednesday, April 02, 2014 11:25 AM Subject: Re: Cases to vars with a very large dataset > Hi Bruce, > Thanks for the reply. Good point. Unfortunately, I do not know what > analyses > will be run. I've just been given the task of merging and structuring > the > data. > Brett > > > > -- > View this message in context: > http://spssx-discussion.1045642.n5.nabble.com/Cases-to-vars-with-a-very-large-dataset-tp5725171p5725174.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except > the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by JosephYoungblood
"The amount of space SPSS allocates
each variable is based on how the user defines each variable (e.g., Numeric, Dot, Comma or string)." Actually, all NUMERIC fields take the same amount of space (8 bytes). Dot, Comma, Date, F, N etc. Strings a multiple of 8 bytes [(A9) takes actually 16 bytes]. Actually maybe double or triple that due to Unicode. I'm sure Jon will chime in to correct me if that is incorrect. BUT my point is that all numeric values are created equal WRT memory requirements!
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by bwyker
On Wed, 2 Apr 2014 11:17:01 -0700, David Marso <[hidden email]>
wrote: >"The amount of space SPSS allocates >each variable is based on how the user defines each variable (e.g., Numeric, >Dot, Comma or string)." > >Actually, all NUMERIC fields take the same amount of space (8 bytes). Dot, >Comma, Date, F, N etc. >Strings a multiple of 8 bytes [(A9) takes actually 16 bytes]. Actually >maybe double or triple that due to Unicode. I'm sure Jon will chime in to >correct me if that is incorrect. >BUT my point is that all numeric values are created equal WRT memory >requirements! > Thank you for clarifying David, and I while I admit I don't know why this is the case, I have been able to reduce total file size on very large data set by as much 20% by changing data type. Cheers, Joseph ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by David Marso
David is right that all numerics are 8
bytes, and string space is allowed in multiples of 8 bytes. But the
A size is in bytes, so An takes the same number of bytes whether in Unicode
or code page mode.
sav file compression depends on the actual contents with small integer values and blank strings heavily compressed, but that does not affect memory usage when passing the data. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: David Marso <[hidden email]> To: [hidden email], Date: 04/02/2014 12:17 PM Subject: Re: [SPSSX-L] Cases to vars with a very large dataset Sent by: "SPSSX(r) Discussion" <[hidden email]> "The amount of space SPSS allocates each variable is based on how the user defines each variable (e.g., Numeric, Dot, Comma or string)." Actually, all NUMERIC fields take the same amount of space (8 bytes). Dot, Comma, Date, F, N etc. Strings a multiple of 8 bytes [(A9) takes actually 16 bytes]. Actually maybe double or triple that due to Unicode. I'm sure Jon will chime in to correct me if that is incorrect. BUT my point is that all numeric values are created equal WRT memory requirements! Joseph Youngblood wrote > Brett, > > I first starting using SPSS in 1995, and even with all the hardware > limitations at the time, I was able to load and analyze data from a > 10-year > longitudinal study. > > If you all you need to do is shrink the size of the data set, I'd > recommend > looking at data "Type" for each variable. The amount of space SPSS > allocates > each variable is based on how the user defines each variable (e.g., > Numeric, > Dot, Comma or string). Also, as Bruce mentioned, think about the analyses > and results. Not every analysis will need or require ALL VARS, so perhaps > you could determine a set of analyses that are not related, even if some > may > share the same variables, and strip out all data not relevant to that > particular analysis. > > Good luck, > > Joseph > > ===================== > To manage your subscription to SPSSX-L, send a message to > LISTSERV@.UGA > (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD ----- Please reply to the list and not to my personal email. Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Cases-to-vars-with-a-very-large-dataset-tp5725171p5725185.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by hillel vardi
Shalom
After some more thinking It looks to me that you are using the date variable as the index variable in casesstovars in that case you will have 45 variables time the number of days from the minimum date to the maximum date if your average line per id is 5 and you have data for 5 year then you are trying to create a table 365 time bigger then your original table that is a table of some petabytes, may be you don't have the memory for satch a big table. Hillel Vardi On 02/04/2014 20:33, Hillel vardi wrote: > Shalom > > when doing the transposing that you are attempting you have to remember > that the new table will have your 45 variables (codes) time the maximum > lines per ID > If one of your id have 100 line and the average lines per id is 5 you > new table will be multiplied by almost 20 . > I think you should check what is the maximum lines per ID and may be > drop some cases with too many lines > > HIllel Vardi > > > > > On 02/04/2014 18:10, bwyker wrote: >> Hello all, >> >> I've been struggling for months to restructure an enormous file in >> SPSS and >> would greatly appreciate any help. The main file is a merge of data >> from 14 >> states, totaling over 411 million rows of data (45 variables) in a >> file that >> is 124 GB in size. I also have these data separated by state and >> year (2007 >> and 2008), which brings the file down to about 12 GB. >> >> In each file, each participant has several rows of data, which looks >> something like this: >> >> ID RXDate RXCode….. >> 1001 1/1/2007 1234 >> 1001 7/12/2007 5678 >> 9999 2/12/2007 1369 >> 9999 8/1/2007 2468 >> 9999 9/20/2007 1234 >> >> I’m trying to perform a relatively simple task of transposing the >> rows of >> data into columns, grouped by ID. It should look something like this: >> >> ID RXDate1 RXDate2 RXDate3 RXCode1 >> RXCode2 RXCode3 >> 1001 1/1/2007 7/12/2007 . 1234 >> 5678 . >> 9999 2/12/2007 8/1/2007 9/20/2007 1369 >> 2468 1234 >> >> I’ve run the command to do this in SPSS several times. It executes the >> command, but never ends – both on the large 124 GB file and the >> smaller 12 >> GB files. I literally had the large file running around the clock for >> more >> than a month without completing. >> >> It's not looking promising for SPSS Statistics, but maybe I'm missing >> something. I also have SPSS Modeler. I'm not that familiar, but it seems >> there might be some combination of the aggregate, transpose and/or >> restructure nodes that might makes this work. >> >> Any suggestions? >> >> Thanks in advance. >> Brett >> >> >> >> >> >> -- >> View this message in context: >> http://spssx-discussion.1045642.n5.nabble.com/Cases-to-vars-with-a-very-large-dataset-tp5725171.html >> Sent from the SPSSX Discussion mailing list archive at Nabble.com. >> >> ===================== >> To manage your subscription to SPSSX-L, send a message to >> [hidden email] (not to SPSSX-L), with no body text except the >> command. To leave the list, send the command >> SIGNOFF SPSSX-L >> For a list of commands to manage subscriptions, send the command >> INFO REFCARD > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
"It looks to me that you are using the date
variable as the index variable in casesstovars " I don't see where you get this idea. My only advice to OP is to confront whoever is suggesting this denormalized format and demand some sort of justification. Review downstream analytical requirements before building out this monster. In any case the show stopper is the possibly large number of records for possibly a small number of people which will blow up the problem size. Note also that SPSS needs to evaluate the entire dataset to create the dictionary before building out the resulting dataset. Is it necessary to have all variables in the file per case or is it ok to have them in separate data files? So, moving forward? What needs to be done with this data? Just because some manager/hack decided that all the data needs to be on one record per ID doesn't mean that is a best or even good solution (especially if it has been running for a month ;-).
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by bwyker
At 11:10 AM 4/2/2014, bwyker wrote:
>I've been struggling for months to restructure >an enormous file. The main file is a merge of >data from 14 states, totaling over 411 million >rows of data (45 variables) in a file that is >124 GB in size. I also have these data >separated by state and year (2007 and 2008), >which brings the file down to about 12 GB. > >In each file, each participant has several rows >of data, which looks something like this: > >ID RXDate RXCode .. >1001 1/1/2007 1234 >1001 7/12/2007 5678 >9999 2/12/2007 1369 >9999 8/1/2007 2468 >9999 9/20/2007 1234 > >I'm trying to transpose the rows of data into columns, grouped by ID A question I haven't seen asked, in the responses to your posting: You write of 45 variable in the file; but your problem depends on only three variables -- ID, RXDate, RXCode. What are the other variables -- other prescriptions, or something different? What do you want to do with them, when you've gone from many records for each ID to only one? ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by bwyker
Coming back to Brett's primary question: how to reorganize data (cases to vars)? In case of a very large dataset I would use a different software (e.g. Matlab) and prepare the data for SPSS to be analyzed. I would guess 20 lines in Matlab will be sufficient to read, transpose and write the data.
Andreas |
Administrator
|
Really? You base this on what amount of experience with SPSS?
I'll see your BS and raise you two cow pies.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by David Marso
> Please read the TIPS on posting questions to SPSS list topic I bumped yesterday.
Will do. Just found this forum this morning. Using for the first time, apologies. > In this case you should have posted your syntax. The syntax is pretty simple. Just a restructure, case to vars: CASESTOVARS /ID=ID /GROUPBY=VARIABLE. >Useful info here would be to know the actual number of unique ID's, the max number of Rx values and dates. Is the file PRESORTED by ID and Date? Sorry I did not provide this info. It varies, but the short answer is I understand there are going to be A LOT of variables. At most, probably somewhere around 100 - 200 IDs. Most of the 45 variables are actually the same for all IDs (e.g. state of the respondent), so not every variable will be duplicated by the number of IDs. At any rate, as I understand it, there isn't a limit to the number of variables SPSS will include in a dataset, even though there are other reasons to not have thousands of variables in a dataset. > Maybe your hardware isn't up to the file size? How much free disk space? I've tried on several devices and have quite a bit of space. I don't believe this is a hardware problem. > Maybe break into several sets of IDs, run task, then ADD files? I've definitely considered the brute force option as a last resort. However, this would take a considerable amount of time and manpower. > Maybe see VECTOR/AGGREGATE approach (in the FM). I'm not familiar, but will look into this. Thanks. > IMNSHO denormalizing the data IS a stupid thing to do. Are you sure the data base people on the project aren't rectocranially inverted? Not my call, unfortunately. |
In reply to this post by Mike
Thanks Mike. I have thought of this. The trouble is, anything I do with this data take a long time, including cutting it up into chunks. I may be forced to do this, but I'm hoping there is a way to do it without having to spend the time slicing and dicing datasets.
|
Administrator
|
In reply to this post by bwyker
"Sorry I did not provide this info. It varies, but the short answer is I understand there are going to be A LOT of variables. At most, probably somewhere around 100 - 200 IDs."
So this megamonster is going to end up sitting in a mere 100-200 rows? I'm afraid you have lost me here. Or are you switching references and 100-200 records per ID as defined in the CASESTOVARS code? Another concern in the usability of such a file from any analytical context. --- This might take awhile to run, but get some context here on the problem size? DATASET DECLARE agg. AGGREGATE OUTFILE agg / BREAK=ID / RecordCountsPerId=N. DATASET ACTIVATE agg. DESCRIPTIVES RecordCountsPerId / STATISTICS MEAN MIN MAX STDDEV. "The syntax is pretty simple. Just a restructure, case to vars: CASESTOVARS /ID=ID /GROUPBY=VARIABLE."... Deceptively simple... SPSS has to rip through the entire file, determining a LOT of information, determine which variables are constant for all records within an ID -for all ids- (so it can treat them as FIXED), determine the maximum records across all ids for each variable in order to build the dictionary. then it can actually read and flatten the data... Enough said. ---------------------------------------
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Andreas Sprenger
[re-posting through Nabble after failure]
Here is a practical consideration: Wouldn't using Matlab assume that you have a terabyte of memory to hold all the data? I thought that efficiency would be served by the advice to create separate file for IDs with different lengths of created data. (Assuming the project is worth doing at all, which seems a bit doubtful.) Another practical consideration: Most PCs don't have internal drives with multi-terabyte storage. For an external drive: Is your transfer port using the latest (fastest) possible transfer rate? Another practical consideration: Your users will want a smaller file available (assuming they have any experience at all) for testing new syntax, transformations, etc. Start now by creating a test file for them, in order to prod them to finding whether there is really a practical demand for this non-normalized data file. I suggest creating a typical set of IDs that use 1 part per 1440 of the original data: This way, the eventual execution time may be estimated by simply re-reading minutes as days. -- Rich Ulrich |
Free forum by Nabble | Edit this page |