I'm cleaning up some old data sets and converting item level responses to response strings. This requires concatenating hundreds of variables (~500) for thousands of cases (60-500k). I have a short macro that runs for smaller jobs but is a dog the scope of this project. i'm wondering if there is a way to leverage python to concatenate the variables using .join() and pass the results back to SPSS quicker. I'm just not sure how you would go about doing so or if it would work.
Conceptually something like.... begin program. #List of vars to concatenate vars=[v1, v2, ...... v500] #Compute new variable in python and pass back to SPSS....some how
SPSS_OUTPUT_VAR="".join(vars) end program. Thoughts? |
I would do this by generating COMPUTE statements
with the concat function using Python. That would be much faster.
It is not clear what you are concatenating: all 500 into 1? pairs? other sets? Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Craig J <[hidden email]> To: [hidden email], Date: 12/13/2013 05:34 PM Subject: [SPSSX-L] Concatenate Variables with Python Sent by: "SPSSX(r) Discussion" <[hidden email]> I'm cleaning up some old data sets and converting item level responses to response strings. This requires concatenating hundreds of variables (~500) for thousands of cases (60-500k). I have a short macro that runs for smaller jobs but is a dog the scope of this project. i'm wondering if there is a way to leverage python to concatenate the variables using .join() and pass the results back to SPSS quicker. I'm just not sure how you would go about doing so or if it would work. Conceptually something like.... begin program. #List of vars to concatenate vars=[v1, v2, ...... v500] #Compute new variable in python and pass back to SPSS....some how SPSS_OUTPUT_VAR="".join(vars) end program. Thoughts? |
In reply to this post by Craig Johnson
this seems like an
unusual thing to do.
What is the purpose of this effort? Are you trying to have a fixed format file to pass to other software? There may be other ways of reaching your goal. Art Kendall Social Research ConsultantsOn 12/13/2013 7:34 PM, Craig Johnson [via SPSSX Discussion] wrote:
Art Kendall
Social Research Consultants |
Administrator
|
In reply to this post by Craig Johnson
"I'm cleaning up some old data sets and converting item level responses to response strings..."
This seems like a rather bizarre practice at best. As Art and Jon have asked, WHY do you want to do this? To what end? Maybe the macro which is a dog is poorly written and needs dog food or medical attention? Assuming what you are doing has some reasonable purpose : see WRITE and DATA LIST commands. --
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Jon K Peck
Thanks Jon, creating the compute statement with python is probably the most parsimonious method. I'll give it a shot.
Answers to other questions: It is not clear what you are concatenating: all 500 into 1? pairs? other sets?
*The 500 variables are being concatenated into a single string variable. Eventually the standardized variable will be stored in a database. What is the purpose of this effort? *Data standardization and storage. We have an infinite number of assessments with differing numbers of items. All items are numeric values. Rather than creating a unique table for each assessment or constantly adjusting database structure, the responses are stored as a response string ('13451'). A single table can then handle all assessments with three variables (Case ID, assessmentid, response string). When the data is pulled for analysis it's substringed into N variables. This process puts the data manipulation emphasis on the analyst rather than the data storage.
Are you trying to have a fixed format file to pass to other software? *The data will be stored in a relational database.
This seems like a rather bizarre practice at best. *This is a data storage issue rather than an analysis issue and it's an important distinction. For analysis we obviously use all single 500 variables. For data storage we store all responses as a single string say "1010101." The structure is very common in the industry...probably the norm. Although it may seem odd the structure has advantages for data management which is why it is used. Single simple flexible table structure, performance gains for retrieving/storing data, and even data cleaning. Depending on the purpose the data structure is extremely handy. What about trying to handle the data live? There is basically no need. Assessments are taken, handled, and stored. Any retrieval of the data requires all responses and the frequency is very low so it's not a performance issue.
Thanks to all for the different suggestions! |
Administrator
|
WOW! Why do people get blindsided by python when there are trivial native PARSIMONIOUS ways of doing things? *DATA SIMULATION (omit for your data dump). MATRIX. SAVE TRUNC(UNIFORM(100000,500)*10) / OUTFILE */ VARIABLES V001 TO V500. END MATRIX. * "Concatenate" 500 variables . WRITE OUTFILE "C:\TEMP\CRAP.dat" / V001 TO V500 (500F1.0). EXECUTE. * Confirm result *. DATA LIST FILE "C:\TEMP\CRAP.dat" / X1 (A500).
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Craig Johnson
At 11:29 AM 12/15/2013, Craig J wrote:
>We have an [infinite] unbounded number of assessments with differing >numbers of items. All items are numeric values. Rather than >creating a unique table for each assessment or constantly adjusting >database structure, the responses are stored as a response string >('13451'). A single table can then handle all assessments with three >variables (Case ID, assessmentid, response string). When the data >is pulled for analysis it's substringed into N variables. In that case, assuming your responses are contiguous in the SPSS dataset and named Response1 to Response75, how about STRING ResponseString (A500) /* Maximum number of response in any study */. STRING #RespChar (A1). DO REPEAT Answer = Response1 TO Response75. . COMPUTE #RespChar = STRING(Answer,F1). . COMPUTE ResponseString = CONCAT(RTRIM(ResponseString),#RespChar). END REPEAT. I don't see that anything you could do in Python would be easier than this. This may be a case where an assigning SUBSTR would be faster -- but this requires you NOT be in Unicode mode: DO REPEAT Answer = Response1 TO Response75 /AnsNum = 1 TO 75. . COMPUTE #RespChar = STRING(Answer,F1). . COMPUTE SUBSTR(ResponseString,AnsNum,1) = #RespChar. END REPEAT. In either case, then SAVE OUTFILE=Crunched /KEEP= Case ID assessmentid ResponseString. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by David Marso
I agree, no need for Python in this simple
case. I was expecting a more complicated problem.
OTOH, you could do this. begin program. import spss spss.Submit(r"""WRITE OUTFILE "C:\TEMP\CRAP.dat" / V001 TO V500 (500F1.0).""") end program. :-) Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: David Marso <[hidden email]> To: [hidden email], Date: 12/15/2013 09:57 AM Subject: Re: [SPSSX-L] Concatenate Variables with Python Sent by: "SPSSX(r) Discussion" <[hidden email]> WOW! Why do people get blindsided by python when there are trivial native PARSIMONIOUS ways of doing things? * * "Concatenate" 500 variables . WRITE OUTFILE "C:\TEMP\CRAP.dat" / V001 TO V500 (500F1.0). EXECUTE.* * Confirm result *. DATA LIST FILE "C:\TEMP\CRAP.dat" / X1 (A500). Craig Johnson wrote > Thanks Jon, creating the compute statement with python is probably the > most > parsimonious method. I'll give it a shot. > > Answers to other questions: > It is not clear what you are concatenating: all 500 into 1? pairs? other > sets? > *The 500 variables are being concatenated into a single string variable. > Eventually the standardized variable will be stored in a database. > > What is the purpose of this effort? > *Data standardization and storage. We have an infinite number of > assessments with differing numbers of items. All items are numeric values. > Rather than creating a unique table for each assessment or constantly > adjusting database structure, the responses are stored as a response > string > ('13451'). A single table can then handle all assessments with three > variables (Case ID, assessmentid, response string). When the data is > pulled for analysis it's substringed into N variables. This process puts > the data manipulation emphasis on the analyst rather than the data > storage. > > > > Are you trying to have a fixed format file to pass to other software? > *The data will be stored in a relational database. > > This seems like a rather bizarre practice at best. > *This is a data storage issue rather than an analysis issue and it's an > important distinction. For analysis we obviously use all single 500 > variables. For data storage we store all responses as a single string say > "1010101." The structure is very common in the industry...probably the > norm. Although it may seem odd the structure has advantages for data > management which is why it is used. Single simple flexible table > structure, > performance gains for retrieving/storing data, and even data cleaning. > Depending on the purpose the data structure is extremely handy. What > about > trying to handle the data live? There is basically no need. Assessments > are taken, handled, and stored. Any retrieval of the data requires all > responses and the frequency is very low so it's not a performance issue. > > Thanks to all for the different suggestions! ----- Please reply to the list and not to my personal email. Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Concatenate-Variables-with-Python-tp5723631p5723645.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
;-)
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Administrator
|
In reply to this post by Richard Ristow
If there were an [infinite] number of assessments you will want to run it on an OS that would NEVER require a reboot and hope that someone would look in on it every hundred years to be sure it was still doing the job. ;-))))))))))))
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Thanks for the example Richard. I think I'll give that one shot tomorrow and see how we end up. Glad to see there was a more simplistic answer. On Sun, Dec 15, 2013 at 6:15 PM, David Marso <[hidden email]> wrote: If there were an [infinite] number of assessments you will want to run it on |
Note that in Unicode mode there is no need
for the RTRIM.
Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Craig J <[hidden email]> To: [hidden email], Date: 12/15/2013 08:35 PM Subject: Re: [SPSSX-L] Concatenate Variables with Python Sent by: "SPSSX(r) Discussion" <[hidden email]> Thanks for the example Richard. I think I'll give that one shot tomorrow and see how we end up. Glad to see there was a more simplistic answer. On Sun, Dec 15, 2013 at 6:15 PM, David Marso <david.marso@...> wrote: If there were an [infinite] number of assessments you will want to run it on an OS that would NEVER require a reboot and hope that someone would look in on it every hundred years to be sure it was still doing the job. ;-)))))))))))) Richard Ristow wrote > At 11:29 AM 12/15/2013, Craig J wrote: > >>We have an [infinite] unbounded number of assessments with differing >>numbers of items. All items are numeric values. Rather than >>creating a unique table for each assessment or constantly adjusting >>database structure, the responses are stored as a response string >>('13451'). A single table can then handle all assessments with three >>variables (Case ID, assessmentid, response string). When the data >>is pulled for analysis it's substringed into N variables. > > In that case, assuming your responses are contiguous in the SPSS > dataset and named Response1 to Response75, how about > > STRING ResponseString (A500) /* Maximum number of response in any study > */. > > STRING #RespChar (A1). > > DO REPEAT Answer = Response1 TO Response75. > . COMPUTE #RespChar = STRING(Answer,F1). > . COMPUTE ResponseString = CONCAT(RTRIM(ResponseString),#RespChar). > END REPEAT. > > I don't see that anything you could do in Python would be easier than > this. > > This may be a case where an assigning SUBSTR would be faster -- but > this requires you NOT be in Unicode mode: > > > DO REPEAT Answer = Response1 TO Response75 > /AnsNum = 1 TO 75. > . COMPUTE #RespChar = STRING(Answer,F1). > . COMPUTE SUBSTR(ResponseString,AnsNum,1) = #RespChar. > END REPEAT. > > In either case, then > > SAVE OUTFILE=Crunched > /KEEP= Case ID assessmentid ResponseString. > > ===================== > To manage your subscription to SPSSX-L, send a message to > LISTSERV@.UGA > (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD ----- Please reply to the list and not to my personal email. Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Concatenate-Variables-with-Python-tp5723631p5723650.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by David Marso
At 09:15 PM 12/15/2013, David Marso wrote:
>If there were an [infinite] number of assessments you will want to >run it on an OS that would NEVER require a reboot and hope that >someone would look in on it every hundred years to be sure it was >still doing the job. No point in checking in on it. After all, if there were an infinite number of assessments, the job would be no nearer finished after 100 years than at the beginning. The distinction between an 'infinite' and an 'unbounded' quantity is easy to mistake, especially as 'unbounded' (a finite quantity, but one that can be increased without limit) is an inherently dynamic notion -- there's no such thing as an 'unbounded' static quantity. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |