|
I just saved myself a bunch of time and effort by using Python, and so
like a happy little kid I have to share it with the list: We have a common data extract from one of our legacy mainframe systems in which each row represents a student, and each student can have a record for one or more of the 23 SAT Subject tests. Unfortunately, in the extract file the test scores are arranged in columns labeled SCORE1, SCORE2, etc, for up to 18 different tests. The name of the test is identified in another column labeled CODE1, CODE2, etc. So to make any sense of these data, each score has to be matched up with it accompanying test code. Worse, students can retake tests any number of times, so a particular test score may appear multiple times in a student's record, and only the highest score is usually what we want to analyze. So, at lunch time today I set about writing a Python script to try to recode these data so that each test has its own variable and only the highest score for a test is represented. I'm still pretty new to Python and wasn't sure how to proceed, but within only about two hours I had a fully working solution. With only three lines of Python code I was able to accomplish what it would have taken 432 lines of SPSS syntax to do. I realize this could have been done using a macro, but I also realize I'm only scratching the surface. Absolutely amazing! Thanks SPSS! -- Daniel Robertson Senior Research and Planning Associate Institutional Research and Planning Cornell University, Ithaca NY 14853-2801 607.255.9642 / irp.cornell.edu |
|
Impressive - Congrats!
This sounds like a very manageable task with SPSS syntax, however. In fact, it could be done in just one line of code (provided that the line is very LONG....alright, that's not actually true, but it sounded funny). But seriously, it would be great if you would share your code with the list so we can learn from what you did. -Gary On 6/7/07, Daniel Robertson <[hidden email]> wrote: > > I just saved myself a bunch of time and effort by using Python, and so > like a happy little kid I have to share it with the list: > > We have a common data extract from one of our legacy mainframe systems > in which each row represents a student, and each student can have a > record for one or more of the 23 SAT Subject tests. Unfortunately, in > the extract file the test scores are arranged in columns labeled SCORE1, > SCORE2, etc, for up to 18 different tests. The name of the test is > identified in another column labeled CODE1, CODE2, etc. So to make any > sense of these data, each score has to be matched up with it > accompanying test code. Worse, students can retake tests any number of > times, so a particular test score may appear multiple times in a > student's record, and only the highest score is usually what we want to > analyze. > > So, at lunch time today I set about writing a Python script to try to > recode these data so that each test has its own variable and only the > highest score for a test is represented. I'm still pretty new to Python > and wasn't sure how to proceed, but within only about two hours I had a > fully working solution. With only three lines of Python code I was able > to accomplish what it would have taken 432 lines of SPSS syntax to do. I > realize this could have been done using a macro, but I also realize I'm > only scratching the surface. Absolutely amazing! Thanks SPSS! > > -- > Daniel Robertson > Senior Research and Planning Associate > Institutional Research and Planning > Cornell University, Ithaca NY 14853-2801 > 607.255.9642 / irp.cornell.edu > |
|
I need to apologize for my giddiness yesterday... I'm not a programmer
by any means, so I was pretty amazed I managed to make this work at all. And I'm sure there are better ways to do this (even with Syntax), but fwiw here's my code: BEGIN PROGRAM. import spss listTest = ["EB","MB","CH","CL","EP","FL","FR","GL","GM","IT","JL","KL","LR","LT","M1","M2","MH","PH","SL","SP","UH","WH","WR"] for j in range(0,len(listTest)): spss.Submit("numeric sat" + listTest[j] + " (f8.0).") for i in range(0,len(listTest)): for p in range(1,18): spss.Submit("if(code" + str(p) + " = '" + listTest[i] + "' and (score" + str(p) + " > sat" + listTest[i] + " or sysmis(sat" + listTest[i] + ")=1)) sat" + listTest[i] + " = score" + str(p) + " .") spss.Submit("exe.") END PROGRAM. The first FOR statement declares a new test-specific variable for each test name contained in listTest. This was necessary because the second FOR statement references the new variable. The second FOR statement loops through the i test names contained in listTest and, within each test name loop, loops through the p existing test score variables. If it finds a test name matching the current value of listTest[i] and if the value of the associated existing test score is greater than the current value of the new test score variable (or the value of the new test score variable is system missing) it moves the existing test score to the correct new test score variable. It takes about 30 seconds to run on my machine on a file with 30,000 cases. Any comments are appreciated. Out of curiosity, how would this be done with syntax? best, Dan R. Hal 9000 wrote: > Impressive - Congrats! > > This sounds like a very manageable task with SPSS syntax, however. In > fact, it could be done in just one line of code (provided that the > line is very LONG....alright, that's not actually true, but it sounded > funny). But seriously, it would be great if you would share your code > with the list so we can learn from what you did. > > -Gary > > > On 6/7/07, *Daniel Robertson* <[hidden email] > <mailto:[hidden email]>> wrote: > > I just saved myself a bunch of time and effort by using Python, and so > like a happy little kid I have to share it with the list: > > We have a common data extract from one of our legacy mainframe systems > in which each row represents a student, and each student can have a > record for one or more of the 23 SAT Subject tests. Unfortunately, in > the extract file the test scores are arranged in columns labeled > SCORE1, > SCORE2, etc, for up to 18 different tests. The name of the test is > identified in another column labeled CODE1, CODE2, etc. So to make any > sense of these data, each score has to be matched up with it > accompanying test code. Worse, students can retake tests any number of > times, so a particular test score may appear multiple times in a > student's record, and only the highest score is usually what we > want to > analyze. > > So, at lunch time today I set about writing a Python script to try to > recode these data so that each test has its own variable and only the > highest score for a test is represented. I'm still pretty new to > Python > and wasn't sure how to proceed, but within only about two hours I > had a > fully working solution. With only three lines of Python code I was > able > to accomplish what it would have taken 432 lines of SPSS syntax to > do. I > realize this could have been done using a macro, but I also > realize I'm > only scratching the surface. Absolutely amazing! Thanks SPSS! > > -- > Daniel Robertson > Senior Research and Planning Associate > Institutional Research and Planning > Cornell University, Ithaca NY 14853-2801 > 607.255.9642 / irp.cornell.edu <http://irp.cornell.edu> > > -- Daniel Robertson Senior Research and Planning Associate Institutional Research and Planning Cornell University, Ithaca NY 14853-2801 607.255.9642 / irp.cornell.edu |
| Free forum by Nabble | Edit this page |
