Using Python to process SPSS data

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Using Python to process SPSS data

mgriffiths
I've inherited several files containing thousands of lines of SPSS syntax for processing a dataset of responses to a complex survey. Most of the code derives new output variables on the basis of case-by-case data (mainly using DO IF and COMPUTE) and it doesn't use any complex SPSS procedures - mainly just MATCH, FREQUENCIES and CROSSTABS. The syntax outputs various files along the way.

I've made some improvements by using SPSS macros (e.g. I wrote a macro to annualise amounts depending on the corresponding period code, rather than copying and pasting a chunk of syntax throughout the code). But I now have SPSS Python Essentials (v21) and I am wondering if I would be better off starting to reimplement the code in Python.

In particular, one issue is that for each year of the survey, the input variables change slightly as the questionnaire is updated. I'd like to be able to use Python to separate out functions to derive each output variable from the raw inputs, so that the code can easily be updated when the inputs change. At the moment, the process of updating the code is rather laborious.

Does anyone have advice on how I might approach this task, if SPSS Python programming is indeed a sensible way to go about it? Is the best way to process case-by-case data to use the spssdata class? I realise this is a rather general query, but thanks for any help.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Using Python to process SPSS data

Jon K Peck
Python certainly has the potential to simplify and generalize jobs.  Here are some resources that may help.

A blog post on the SPSS Community site
Using SPSSINC PROGRAM and generalizing your code vs writing macros in SPSS Statistics
https://www.ibm.com/developerworks/community/blogs/ab16c38e-2f7b-4912-a47e-85682d124d32/entry/using_spssinc_program_and_generalizing_your_code_vs_writing_macros_in_spss_statistics?lang=en

The Programming and Data Management book
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/We70df3195ec8_4f95_9773_42e448fa9029/page/Books%20and%20Articles

There are some extension commands that are implemented in Python but function like regular syntax that can be a big help in generalization.
In particular the SPSSINC SELECT VARIABLES command generates macro definitions from variable metadata such as patterns in names, variable type, measurement level, and custom attributes.

I will send you offline a PowerPoint I wrote a few years ago entitled
Increasing productivity with SPSS Statistics: Generalization and Automation

 

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Martin Griffiths <[hidden email]>
To:        [hidden email]
Date:        02/12/2015 07:45 AM
Subject:        [SPSSX-L] Using Python to process SPSS data
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




I've inherited several files containing thousands of lines of SPSS syntax for processing a dataset of responses to a complex survey. Most of the code derives new output variables on the basis of case-by-case data (mainly using DO IF and COMPUTE) and it doesn't use any complex SPSS procedures - mainly just MATCH, FREQUENCIES and CROSSTABS. The syntax outputs various files along the way.

I've made some improvements by using SPSS macros (e.g. I wrote a macro to annualise amounts depending on the corresponding period code, rather than copying and pasting a chunk of syntax throughout the code). But I now have SPSS Python Essentials (v21) and I am wondering if I would be better off starting to reimplement the code in Python.

In particular, one issue is that for each year of the survey, the input variables change slightly as the questionnaire is updated. I'd like to be able to use Python to separate out functions to derive each output variable from the raw inputs, so that the code can easily be updated when the inputs change. At the moment, the process of updating the code is rather laborious.

Does anyone have advice on how I might approach this task, if SPSS Python programming is indeed a sensible way to go about it? Is the best way to process case-by-case data to use the spssdata class? I realise this is a rather general query, but thanks for any help.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Using Python to process SPSS data

Richard Ristow
In reply to this post by mgriffiths
At 09:37 AM 2/12/2015, Martin Griffiths wrote:

>I've inherited several files containing thousands of lines of SPSS
>syntax for processing a dataset of responses to a complex survey.
>Most of the code derives new output variables on the basis of
>case-by-case data. ... One issue is that for each year of the
>survey, the input variables change slightly as the questionnaire is
>updated. I'd like to be able [to use Python] to separate out
>functions to derive each output variable from the raw inputs, so
>that the code can easily be updated when the inputs change.
>
>Does anyone have advice on how I might approach this task, if SPSS
>Python programming is indeed a sensible way to go about it? Is the
>best way to process case-by-case data to use the spssdata class?

There are at least two, quite different, ways to use Python.  One is
what (I think) you're talking about:  using Python code *instead of*
SPSS code, in the spssdata class. The other is using Python as a
super-macro tool *to generate* SPSS code, doing the actual processing in SPSS.

 From what you describe, your problem fits the latter approach
better:  You already have SPSS code to do what you need to do;  the
problem is, it needs to be changed for each new survey.  If you use
Python (or, possibly, macros) to generate the changes, then,
. You'll be able to use most of the code you already have, rather
than re-implementing its functionality in Python
. Your code will be mostly native SPSS code. It'll be readable by
other SPSS users who haven't gone through the additional step of
becoming fluent in Python.

Now, you write:
>For each year of the survey, the input variables change slightly as
>the questionnaire is updated. I'd like to be able ... to derive each
>output variable from the raw inputs, so that the code can easily be
>updated when the inputs change.

The question is, how extensive are the changes?  Is it as simple as
the questions being the same each year, with variable names changing
-- like Q1.2014 and Q2.2014 on last year's survey, becoming Q1.2015
and Q2.2015 for this year's survey? If so, the changed code could
certainly be generated in Python, but since it can be made without
accessing the data dictionary, it's well within the capability of a macro.

Could you post what kind of changes need to be made for each
year?  That would help us give more informed answers to your questions.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD