Concatenate Variables with Python

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Concatenate Variables with Python

Craig Johnson
I'm cleaning up some old data sets and converting item level responses to response strings.  This requires concatenating hundreds of variables (~500) for thousands of cases (60-500k).  I have a short macro that runs for smaller jobs but is a dog the scope of this project.  i'm wondering if there is a way to leverage python to concatenate the variables using .join() and pass the results back to SPSS quicker.  I'm just not sure how you would go about doing so or if it would work. 

Conceptually something like....

begin program. 

#List of vars to concatenate
vars=[v1, v2, ...... v500]

#Compute new variable in python and pass back to SPSS....some how
SPSS_OUTPUT_VAR="".join(vars)

end program.

 Thoughts?
Reply | Threaded
Open this post in threaded view
|

Re: Concatenate Variables with Python

Jon K Peck
I would do this by generating COMPUTE statements with the concat function using Python.  That would be much faster.
It is not clear what you are concatenating: all 500 into 1? pairs? other sets?



Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Craig J <[hidden email]>
To:        [hidden email],
Date:        12/13/2013 05:34 PM
Subject:        [SPSSX-L] Concatenate Variables with Python
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




I'm cleaning up some old data sets and converting item level responses to response strings.  This requires concatenating hundreds of variables (~500) for thousands of cases (60-500k).  I have a short macro that runs for smaller jobs but is a dog the scope of this project.  i'm wondering if there is a way to leverage python to concatenate the variables using .join() and pass the results back to SPSS quicker.  I'm just not sure how you would go about doing so or if it would work. 

Conceptually something like....


begin program. 

#List of vars to concatenate
vars=[v1, v2, ...... v500]

#Compute new variable in python and pass back to SPSS....some how
SPSS_OUTPUT_VAR="".join(vars)

end program.

 Thoughts?
Reply | Threaded
Open this post in threaded view
|

Re: Concatenate Variables with Python

Art Kendall
In reply to this post by Craig Johnson
this seems like an unusual thing to do.

What is the purpose of this effort?
Are you trying to have a fixed format file to pass to other software?

There may be other ways of reaching your goal.
Art Kendall
Social Research Consultants
On 12/13/2013 7:34 PM, Craig Johnson [via SPSSX Discussion] wrote:
I'm cleaning up some old data sets and converting item level responses to response strings.  This requires concatenating hundreds of variables (~500) for thousands of cases (60-500k).  I have a short macro that runs for smaller jobs but is a dog the scope of this project.  i'm wondering if there is a way to leverage python to concatenate the variables using .join() and pass the results back to SPSS quicker.  I'm just not sure how you would go about doing so or if it would work. 

Conceptually something like....

begin program. 

#List of vars to concatenate
vars=[v1, v2, ...... v500]

#Compute new variable in python and pass back to SPSS....some how
SPSS_OUTPUT_VAR="".join(vars)

end program.

 Thoughts?



If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/Concatenate-Variables-with-Python-tp5723631.html
To start a new topic under SPSSX Discussion, email [hidden email]
To unsubscribe from SPSSX Discussion, click here.
NAML

Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Concatenate Variables with Python

David Marso
Administrator
In reply to this post by Craig Johnson
"I'm cleaning up some old data sets and converting item level responses to response strings..."
This seems like a rather bizarre practice at best.
As Art and Jon have asked, WHY do you want to do this?  To what end?
Maybe the macro which is a dog is poorly written and needs dog food or medical attention?
Assuming what you are doing has some reasonable purpose :
see WRITE and DATA LIST commands.
--
Craig Johnson wrote
I'm cleaning up some old data sets and converting item level responses to
response strings.  This requires concatenating hundreds of variables (~500)
for thousands of cases (60-500k).  I have a short macro that runs for
smaller jobs but is a dog the scope of this project.  i'm wondering if
there is a way to leverage python to concatenate the variables using
.join() and pass the results back to SPSS quicker.  I'm just not sure how
you would go about doing so or if it would work.

Conceptually something like....

begin program.

#List of vars to concatenate
vars=[v1, v2, ...... v500]

#Compute new variable in python and pass back to SPSS....some how
SPSS_OUTPUT_VAR="".join(vars)

end program.

 Thoughts?
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Concatenate Variables with Python

Craig Johnson
In reply to this post by Jon K Peck
Thanks Jon, creating the compute statement with python is probably the most parsimonious method. I'll give it a shot.

Answers to other questions: 
It is not clear what you are concatenating: all 500 into 1? pairs? other sets?
*The 500 variables are being concatenated into a single string variable.  Eventually the standardized variable will be stored in a database. 

What is the purpose of this effort?
*Data standardization and storage.  We have an infinite number of assessments with differing numbers of items. All items are numeric values.  Rather than creating a unique table for each assessment or constantly adjusting database structure, the responses are stored as a response string ('13451'). A single table can then handle all assessments with three variables (Case ID, assessmentid, response string).  When the data is pulled for analysis it's substringed into N variables. This process puts the data manipulation emphasis on the analyst rather than the data storage.  


Are you trying to have a fixed format file to pass to other software?
*The data will be stored in a relational database. 

This seems like a rather bizarre practice at best.
*This is a data storage issue rather than an analysis issue and it's an important distinction.  For analysis we obviously use all single 500 variables.  For data storage we store all responses as a single string say "1010101."  The structure  is very common in the industry...probably the norm.  Although it may seem odd the structure has advantages for data management which is why it is used. Single simple flexible table structure, performance gains for retrieving/storing data, and even data cleaning.  Depending on the purpose the data structure is extremely handy. What about trying to handle the data live?  There is basically no need.  Assessments are taken, handled, and stored.  Any retrieval of the data requires all responses and the frequency is very low so it's not a performance issue.

Thanks to all for the different suggestions!




Reply | Threaded
Open this post in threaded view
|

Re: Concatenate Variables with Python

David Marso
Administrator

WOW! Why do people get blindsided by python when there are trivial native PARSIMONIOUS ways of doing things?

*DATA SIMULATION (omit for your data dump).
MATRIX.
SAVE TRUNC(UNIFORM(100000,500)*10) / OUTFILE */ VARIABLES V001 TO V500.
END MATRIX.


* "Concatenate" 500 variables .
WRITE OUTFILE "C:\TEMP\CRAP.dat" / V001 TO V500 (500F1.0).
EXECUTE.


* Confirm result *.
DATA LIST FILE "C:\TEMP\CRAP.dat" / X1 (A500).


Craig Johnson wrote
Thanks Jon, creating the compute statement with python is probably the most
parsimonious method. I'll give it a shot.

Answers to other questions:
It is not clear what you are concatenating: all 500 into 1? pairs? other
sets?
*The 500 variables are being concatenated into a single string variable.
 Eventually the standardized variable will be stored in a database.

What is the purpose of this effort?
*Data standardization and storage.  We have an infinite number of
assessments with differing numbers of items. All items are numeric values.
 Rather than creating a unique table for each assessment or constantly
adjusting database structure, the responses are stored as a response string
('13451'). A single table can then handle all assessments with three
variables (Case ID, assessmentid, response string).  When the data is
pulled for analysis it's substringed into N variables. This process puts
the data manipulation emphasis on the analyst rather than the data storage.



Are you trying to have a fixed format file to pass to other software?
*The data will be stored in a relational database.

This seems like a rather bizarre practice at best.
*This is a data storage issue rather than an analysis issue and it's an
important distinction.  For analysis we obviously use all single 500
variables.  For data storage we store all responses as a single string say
"1010101."  The structure  is very common in the industry...probably the
norm.  Although it may seem odd the structure has advantages for data
management which is why it is used. Single simple flexible table structure,
performance gains for retrieving/storing data, and even data cleaning.
 Depending on the purpose the data structure is extremely handy. What about
trying to handle the data live?  There is basically no need.  Assessments
are taken, handled, and stored.  Any retrieval of the data requires all
responses and the frequency is very low so it's not a performance issue.

Thanks to all for the different suggestions!
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Concatenate Variables with Python

Richard Ristow
In reply to this post by Craig Johnson
At 11:29 AM 12/15/2013, Craig J wrote:

>We have an [infinite] unbounded number of assessments with differing
>numbers of items. All items are numeric values.  Rather than
>creating a unique table for each assessment or constantly adjusting
>database structure, the responses are stored as a response string
>('13451'). A single table can then handle all assessments with three
>variables (Case ID, assessmentid, response string).  When the data
>is pulled for analysis it's substringed into N variables.

In that case, assuming your responses are contiguous in the SPSS
dataset and named Response1 to Response75, how about

STRING ResponseString (A500) /* Maximum number of response in any study */.

STRING #RespChar      (A1).

DO REPEAT Answer = Response1 TO Response75.
.  COMPUTE #RespChar = STRING(Answer,F1).
.  COMPUTE ResponseString = CONCAT(RTRIM(ResponseString),#RespChar).
END REPEAT.

I don't see that anything you could do in Python would be easier than this.

This may be a case where an assigning SUBSTR would be faster -- but
this requires you NOT be in Unicode mode:


DO REPEAT Answer = Response1 TO Response75
          /AnsNum =     1     TO   75.
.  COMPUTE #RespChar = STRING(Answer,F1).
.  COMPUTE SUBSTR(ResponseString,AnsNum,1) = #RespChar.
END REPEAT.

In either case, then

SAVE OUTFILE=Crunched
     /KEEP= Case ID assessmentid ResponseString.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Concatenate Variables with Python

Jon K Peck
In reply to this post by David Marso
I agree, no need for Python in this simple case.  I was expecting a more complicated problem.

OTOH, you could do this.

begin program.
import spss
spss.Submit(r"""WRITE OUTFILE "C:\TEMP\CRAP.dat" / V001 TO V500 (500F1.0).""")
end program.


:-)


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        David Marso <[hidden email]>
To:        [hidden email],
Date:        12/15/2013 09:57 AM
Subject:        Re: [SPSSX-L] Concatenate Variables with Python
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




WOW! Why do people get blindsided by python when there are trivial native
PARSIMONIOUS ways of doing things?

*
* "Concatenate" 500 variables .
WRITE OUTFILE "C:\TEMP\CRAP.dat" / V001 TO V500 (500F1.0).
EXECUTE.*

* Confirm result *.
DATA LIST FILE "C:\TEMP\CRAP.dat" / X1 (A500).



Craig Johnson wrote
> Thanks Jon, creating the compute statement with python is probably the
> most
> parsimonious method. I'll give it a shot.
>
> Answers to other questions:
> It is not clear what you are concatenating: all 500 into 1? pairs? other
> sets?
> *The 500 variables are being concatenated into a single string variable.
>  Eventually the standardized variable will be stored in a database.
>
> What is the purpose of this effort?
> *Data standardization and storage.  We have an infinite number of
> assessments with differing numbers of items. All items are numeric values.
>  Rather than creating a unique table for each assessment or constantly
> adjusting database structure, the responses are stored as a response
> string
> ('13451'). A single table can then handle all assessments with three
> variables (Case ID, assessmentid, response string).  When the data is
> pulled for analysis it's substringed into N variables. This process puts
> the data manipulation emphasis on the analyst rather than the data
> storage.
>
>
>
> Are you trying to have a fixed format file to pass to other software?
> *The data will be stored in a relational database.
>
> This seems like a rather bizarre practice at best.
> *This is a data storage issue rather than an analysis issue and it's an
> important distinction.  For analysis we obviously use all single 500
> variables.  For data storage we store all responses as a single string say
> "1010101."  The structure  is very common in the industry...probably the
> norm.  Although it may seem odd the structure has advantages for data
> management which is why it is used. Single simple flexible table
> structure,
> performance gains for retrieving/storing data, and even data cleaning.
>  Depending on the purpose the data structure is extremely handy. What
> about
> trying to handle the data live?  There is basically no need.  Assessments
> are taken, handled, and stored.  Any retrieval of the data requires all
> responses and the frequency is very low so it's not a performance issue.
>
> Thanks to all for the different suggestions!





-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Concatenate-Variables-with-Python-tp5723631p5723645.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: Concatenate Variables with Python

David Marso
Administrator
;-)
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Concatenate Variables with Python

David Marso
Administrator
In reply to this post by Richard Ristow
If there were an [infinite] number of assessments you will want to run it on an OS that would NEVER require a reboot and hope that someone would look in on it every hundred years to be sure it was still doing the job. ;-))))))))))))

Richard Ristow wrote
At 11:29 AM 12/15/2013, Craig J wrote:

>We have an [infinite] unbounded number of assessments with differing
>numbers of items. All items are numeric values.  Rather than
>creating a unique table for each assessment or constantly adjusting
>database structure, the responses are stored as a response string
>('13451'). A single table can then handle all assessments with three
>variables (Case ID, assessmentid, response string).  When the data
>is pulled for analysis it's substringed into N variables.

In that case, assuming your responses are contiguous in the SPSS
dataset and named Response1 to Response75, how about

STRING ResponseString (A500) /* Maximum number of response in any study */.

STRING #RespChar      (A1).

DO REPEAT Answer = Response1 TO Response75.
.  COMPUTE #RespChar = STRING(Answer,F1).
.  COMPUTE ResponseString = CONCAT(RTRIM(ResponseString),#RespChar).
END REPEAT.

I don't see that anything you could do in Python would be easier than this.

This may be a case where an assigning SUBSTR would be faster -- but
this requires you NOT be in Unicode mode:


DO REPEAT Answer = Response1 TO Response75
          /AnsNum =     1     TO   75.
.  COMPUTE #RespChar = STRING(Answer,F1).
.  COMPUTE SUBSTR(ResponseString,AnsNum,1) = #RespChar.
END REPEAT.

In either case, then

SAVE OUTFILE=Crunched
     /KEEP= Case ID assessmentid ResponseString.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Concatenate Variables with Python

Craig Johnson
Thanks for the example Richard.  I think I'll give that one shot tomorrow and see how we end up.  Glad to see there was a more simplistic answer.  




On Sun, Dec 15, 2013 at 6:15 PM, David Marso <[hidden email]> wrote:
If there were an [infinite] number of assessments you will want to run it on
an OS that would NEVER require a reboot and hope that someone would look in
on it every hundred years to be sure it was still doing the job.
;-))))))))))))


Richard Ristow wrote
> At 11:29 AM 12/15/2013, Craig J wrote:
>
>>We have an [infinite] unbounded number of assessments with differing
>>numbers of items. All items are numeric values.  Rather than
>>creating a unique table for each assessment or constantly adjusting
>>database structure, the responses are stored as a response string
>>('13451'). A single table can then handle all assessments with three
>>variables (Case ID, assessmentid, response string).  When the data
>>is pulled for analysis it's substringed into N variables.
>
> In that case, assuming your responses are contiguous in the SPSS
> dataset and named Response1 to Response75, how about
>
> STRING ResponseString (A500) /* Maximum number of response in any study
> */.
>
> STRING #RespChar      (A1).
>
> DO REPEAT Answer = Response1 TO Response75.
> .  COMPUTE #RespChar = STRING(Answer,F1).
> .  COMPUTE ResponseString = CONCAT(RTRIM(ResponseString),#RespChar).
> END REPEAT.
>
> I don't see that anything you could do in Python would be easier than
> this.
>
> This may be a case where an assigning SUBSTR would be faster -- but
> this requires you NOT be in Unicode mode:
>
>
> DO REPEAT Answer = Response1 TO Response75
>           /AnsNum =     1     TO   75.
> .  COMPUTE #RespChar = STRING(Answer,F1).
> .  COMPUTE SUBSTR(ResponseString,AnsNum,1) = #RespChar.
> END REPEAT.
>
> In either case, then
>
> SAVE OUTFILE=Crunched
>      /KEEP= Case ID assessmentid ResponseString.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD





-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Concatenate-Variables-with-Python-tp5723631p5723650.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: Concatenate Variables with Python

Jon K Peck
Note that in Unicode mode there is no need for the RTRIM.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Craig J <[hidden email]>
To:        [hidden email],
Date:        12/15/2013 08:35 PM
Subject:        Re: [SPSSX-L] Concatenate Variables with Python
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Thanks for the example Richard.  I think I'll give that one shot tomorrow and see how we end up.  Glad to see there was a more simplistic answer.  




On Sun, Dec 15, 2013 at 6:15 PM, David Marso <david.marso@...> wrote:
If there were an [infinite] number of assessments you will want to run it on
an OS that would NEVER require a reboot and hope that someone would look in
on it every hundred years to be sure it was still doing the job.
;-))))))))))))


Richard Ristow wrote

> At 11:29 AM 12/15/2013, Craig J wrote:
>
>>We have an [infinite] unbounded number of assessments with differing
>>numbers of items. All items are numeric values.  Rather than
>>creating a unique table for each assessment or constantly adjusting
>>database structure, the responses are stored as a response string
>>('13451'). A single table can then handle all assessments with three
>>variables (Case ID, assessmentid, response string).  When the data
>>is pulled for analysis it's substringed into N variables.
>
> In that case, assuming your responses are contiguous in the SPSS
> dataset and named Response1 to Response75, how about
>
> STRING ResponseString (A500) /* Maximum number of response in any study
> */.
>
> STRING #RespChar      (A1).
>
> DO REPEAT Answer = Response1 TO Response75.
> .  COMPUTE #RespChar = STRING(Answer,F1).
> .  COMPUTE ResponseString = CONCAT(RTRIM(ResponseString),#RespChar).
> END REPEAT.
>
> I don't see that anything you could do in Python would be easier than
> this.
>
> This may be a case where an assigning SUBSTR would be faster -- but
> this requires you NOT be in Unicode mode:
>
>
> DO REPEAT Answer = Response1 TO Response75
>           /AnsNum =     1     TO   75.
> .  COMPUTE #RespChar = STRING(Answer,F1).
> .  COMPUTE SUBSTR(ResponseString,AnsNum,1) = #RespChar.
> END REPEAT.
>
> In either case, then
>
> SAVE OUTFILE=Crunched
>      /KEEP= Case ID assessmentid ResponseString.
>

> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD





-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--

View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Concatenate-Variables-with-Python-tp5723631p5723650.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to

LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: Concatenate Variables with Python

Richard Ristow
In reply to this post by David Marso
At 09:15 PM 12/15/2013, David Marso wrote:

>If there were an [infinite] number of assessments you will want to
>run it on an OS that would NEVER require a reboot and hope that
>someone would look in on it every hundred years to be sure it was
>still doing the job.

No point in checking in on it. After all, if there were an infinite
number of assessments, the job would be no nearer finished after 100
years than at the beginning.

The distinction between an 'infinite' and an 'unbounded' quantity is
easy to mistake, especially as 'unbounded' (a finite quantity, but
one that can be increased without limit) is an inherently dynamic
notion -- there's no such thing as an 'unbounded' static quantity.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD