Python success story

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Python success story

Daniel Robertson
I just saved myself a bunch of time and effort by using Python, and so
like a happy little kid I have to share it with the list:

We have a common data extract from one of our legacy mainframe systems
in which each row represents a student, and each student can have a
record for one or more of the 23 SAT Subject tests. Unfortunately, in
the extract file the test scores are arranged in columns labeled SCORE1,
SCORE2, etc, for up to 18 different tests. The name of the test is
identified in another column labeled CODE1, CODE2, etc. So to make any
sense of these data, each score has to be matched up with it
accompanying test code. Worse, students can retake tests any number of
times, so a particular test score may appear multiple times in a
student's record, and only the highest score is usually what we want to
analyze.

So, at lunch time today I set about writing a Python script to try to
recode these data so that each test has its own variable and only the
highest score for a test is represented. I'm still pretty new to Python
and wasn't sure how to proceed, but within only about two hours I had a
fully working solution. With only three lines of Python code I was able
to accomplish what it would have taken 432 lines of SPSS syntax to do. I
realize this could have been done using a macro, but I also realize I'm
only scratching the surface. Absolutely amazing! Thanks SPSS!

--
Daniel Robertson
Senior Research and Planning Associate
Institutional Research and Planning
Cornell University, Ithaca NY 14853-2801
607.255.9642 / irp.cornell.edu
Reply | Threaded
Open this post in threaded view
|

Re: Python success story

Hal 9000
Impressive - Congrats!

This sounds like a very manageable task with SPSS syntax, however. In fact,
it could be done in just one line of code (provided that the line is very
LONG....alright, that's not actually true, but it sounded funny). But
seriously, it would be great if you would share your code with the list so
we can learn from what you did.

-Gary


On 6/7/07, Daniel Robertson <[hidden email]> wrote:

>
> I just saved myself a bunch of time and effort by using Python, and so
> like a happy little kid I have to share it with the list:
>
> We have a common data extract from one of our legacy mainframe systems
> in which each row represents a student, and each student can have a
> record for one or more of the 23 SAT Subject tests. Unfortunately, in
> the extract file the test scores are arranged in columns labeled SCORE1,
> SCORE2, etc, for up to 18 different tests. The name of the test is
> identified in another column labeled CODE1, CODE2, etc. So to make any
> sense of these data, each score has to be matched up with it
> accompanying test code. Worse, students can retake tests any number of
> times, so a particular test score may appear multiple times in a
> student's record, and only the highest score is usually what we want to
> analyze.
>
> So, at lunch time today I set about writing a Python script to try to
> recode these data so that each test has its own variable and only the
> highest score for a test is represented. I'm still pretty new to Python
> and wasn't sure how to proceed, but within only about two hours I had a
> fully working solution. With only three lines of Python code I was able
> to accomplish what it would have taken 432 lines of SPSS syntax to do. I
> realize this could have been done using a macro, but I also realize I'm
> only scratching the surface. Absolutely amazing! Thanks SPSS!
>
> --
> Daniel Robertson
> Senior Research and Planning Associate
> Institutional Research and Planning
> Cornell University, Ithaca NY 14853-2801
> 607.255.9642 / irp.cornell.edu
>
Reply | Threaded
Open this post in threaded view
|

Re: Python success story

Daniel Robertson
I need to apologize for my giddiness yesterday... I'm not a programmer
by any means, so I was pretty amazed I managed to make this work at all.
And I'm sure there are better ways to do this (even with Syntax), but
fwiw here's my code:

BEGIN PROGRAM.

import spss

listTest =
["EB","MB","CH","CL","EP","FL","FR","GL","GM","IT","JL","KL","LR","LT","M1","M2","MH","PH","SL","SP","UH","WH","WR"]

for j in range(0,len(listTest)):
    spss.Submit("numeric sat" + listTest[j] + " (f8.0).")

for i in range(0,len(listTest)):
    for p in range(1,18):
        spss.Submit("if(code" + str(p) + " = '" + listTest[i] + "' and
(score" + str(p)  + " > sat" + listTest[i] + " or sysmis(sat" +
listTest[i] + ")=1)) sat" + listTest[i] + " = score" + str(p)  + " .")

spss.Submit("exe.")

END PROGRAM.

The first FOR statement declares a new test-specific variable for each
test name contained in listTest. This was necessary because the second
FOR statement references the new variable.

The second FOR statement loops through the i test names contained in
listTest and, within each test name loop, loops through the p existing
test score variables. If it finds a test name matching the current value
of listTest[i] and if the value of the associated existing test score is
greater than the current value of the new test score variable (or the
value of the new test score variable is system missing) it moves the
existing test score to the correct new test score variable. It takes
about 30 seconds to run on my machine on a file with 30,000 cases.

Any comments are appreciated. Out of curiosity, how would this be done
with syntax?

best,
Dan R.

Hal 9000 wrote:

> Impressive - Congrats!
>
> This sounds like a very manageable task with SPSS syntax, however. In
> fact, it could be done in just one line of code (provided that the
> line is very LONG....alright, that's not actually true, but it sounded
> funny). But seriously, it would be great if you would share your code
> with the list so we can learn from what you did.
>
> -Gary
>
>
> On 6/7/07, *Daniel Robertson* <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     I just saved myself a bunch of time and effort by using Python, and so
>     like a happy little kid I have to share it with the list:
>
>     We have a common data extract from one of our legacy mainframe systems
>     in which each row represents a student, and each student can have a
>     record for one or more of the 23 SAT Subject tests. Unfortunately, in
>     the extract file the test scores are arranged in columns labeled
>     SCORE1,
>     SCORE2, etc, for up to 18 different tests. The name of the test is
>     identified in another column labeled CODE1, CODE2, etc. So to make any
>     sense of these data, each score has to be matched up with it
>     accompanying test code. Worse, students can retake tests any number of
>     times, so a particular test score may appear multiple times in a
>     student's record, and only the highest score is usually what we
>     want to
>     analyze.
>
>     So, at lunch time today I set about writing a Python script to try to
>     recode these data so that each test has its own variable and only the
>     highest score for a test is represented. I'm still pretty new to
>     Python
>     and wasn't sure how to proceed, but within only about two hours I
>     had a
>     fully working solution. With only three lines of Python code I was
>     able
>     to accomplish what it would have taken 432 lines of SPSS syntax to
>     do. I
>     realize this could have been done using a macro, but I also
>     realize I'm
>     only scratching the surface. Absolutely amazing! Thanks SPSS!
>
>     --
>     Daniel Robertson
>     Senior Research and Planning Associate
>     Institutional Research and Planning
>     Cornell University, Ithaca NY 14853-2801
>     607.255.9642 / irp.cornell.edu <http://irp.cornell.edu>
>
>

--
Daniel Robertson
Senior Research and Planning Associate
Institutional Research and Planning
Cornell University, Ithaca NY 14853-2801
607.255.9642 / irp.cornell.edu