has anyone a python script for automatic dummy coding of a variable?
The problem is that the coding of the variable does not necessarily start with 1, 2, ... 1 = age 18-25 2 = age 26 - 30 3 = age 31 - 36 ... but (i.e.) 11 = age 18-25 12 = age 26 - 30 13 = age 31 - 36 ... plus, the values could be non linear: 11,12,14,17,18,20, ... plus the missings could be a problem. thanks
Dr. Frank Gaeth
|
Administrator
|
AUTORECODE followed by VECTOR and COMPUTE. Will leave it to you to RTFM the details!
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Thanks, Dave
I did RTFM (see below) and I do have a working program. However, the program does not take the value labels as the new variable labels. Plus, I thought python might have a method much more elegant i.e. to put all value labels into an array and read the array into dummys. Frank Here is my code: *------------------- Data ---------------------------------------------. input program. loop a =1 to 1000 by 1. end case. end loop. end file. end input program. COMPUTE v=RV.UNIFORM(0,10). COMPUTE variable=TRUNC(v). EXECUTE . DELETE VARIABLES a v. EXECUTE . *--------------------Begin Python ---------------------------------------. AUTORECODE VARIABLES=variable /INTO b. BEGIN PROGRAM. import spss, spssaux n = len(spssaux.VariableDict(['b'])[0].ValueLabels) print n spss.Submit(r""" VECTOR dummy("""+str(n)+"""). """) i = 0 while i < n: i+=1 spss.Submit(r"""if (b = """+str(i)+""") dummy"""+str(i)+""" = 1. EXECUTE . """) spss.Submit(r""" RECODE dummy1 to dummy"""+str(n)+""" (MISSING=0). EXECUTE. DELETE VARIABLES b. """) END PROGRAM.
Dr. Frank Gaeth
|
I'm baffled about exactly what you want
to do here. If you could spell it out, I'm sure that there would
be a solution.
Jon Peck Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: drfg2008 <[hidden email]> To: [hidden email] Date: 06/09/2011 12:40 PM Subject: Re: [SPSSX-L] automatic dummy coding Sent by: "SPSSX(r) Discussion" <[hidden email]> Thanks, Dave I did RTFM (see below) and I do have a working program. However, the program does not take the value labels as the new variable labels. Plus, I thought python might have a method much more elegant i.e. to put all value labels into an array and read the array into dummys. Frank Here is my code: *------------------- Data ---------------------------------------------. input program. loop a =1 to 1000 by 1. end case. end loop. end file. end input program. COMPUTE v=RV.UNIFORM(0,10). COMPUTE variable=TRUNC(v). EXECUTE . DELETE VARIABLES a v. EXECUTE . *--------------------Begin Python ---------------------------------------. AUTORECODE VARIABLES=variable /INTO b. BEGIN PROGRAM. import spss, spssaux n = len(spssaux.VariableDict(['b'])[0].ValueLabels) print n spss.Submit(r""" VECTOR dummy("""+str(n)+"""). """) i = 0 while i < n: i+=1 spss.Submit(r"""if (b = """+str(i)+""") dummy"""+str(i)+""" = 1. EXECUTE . """) spss.Submit(r""" RECODE dummy1 to dummy"""+str(n)+""" (MISSING=0). EXECUTE. DELETE VARIABLES b. """) END PROGRAM. ----- Dr. Frank Gaeth FU-Berlin -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/automatic-dummy-coding-tp4473014p4473762.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by drfg2008
Hi Frank,
Observation: You have access to the labels via spssaux.VariableDict(['b'])[0].ValueLabels but throw them away by only accessing the length of the array (len). I can't test this but the following idea should work. <TOTALLY UNTESTED SEAT OF THE PANTS AIRCODE FROM HELL> AUTORECODE VARIABLES=variable /INTO b. *Don't know if this direct assignment will work (haven't touched my Python books in awhile). labels[?]=spssaux.VariableDict(['b'])[0].ValueLabels n=len(labels) VECTOR dummy("""+str(n)+"""). """) ** DON'T NEED TO LOOP AND COMPARE. Just slam it in by indexing into the vector. COMPUTE dummy(b)=1. i = 0 while i < n: i+=1 spss.Submit(r""" VARIABLE LABELS dummy"""+str(i)+""" ..resolve_quotes_here( labels[i]). """) ** There are likely more elegant looping constructs than while **. </TOTALLY UNTESTED SEAT OF THE PANTS AIRCODE FROM HELL> BTW, you should PURGE your code of the EXECUTE statements. They are almost always UNNECESSARY! HTH, David
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Jon K Peck
My question was obviously misleading. Sorry. I just can't find a solution how to take the value labels of the original variable and use it as a variable label for the dummy variables. I thought there might be a solution in python by using an array (similar to what is described on the SPSS-IBM website - see below).
array=[] while i < FileN: label = spss.GetVariableLabel(i) array.append(label) i+=1 print array Does something exist like: spss.GetValueLabel(i) ? Thanks and sorry again.
Dr. Frank Gaeth
|
vls = spssaux.VariableDict('x').ValueLabels
returns a Python dictionary of the value labels for variable x (case sensitive). You could then iterate through that via value, label = vls.items() and do whatever you want with those labels. You can also use the Dataset class for full dictionary access I'm still not sure of the goal, but note that the SPSSINC CREATE DUMMIES will create a vector of dummy variables for a variable and will assign the value labels of the input variable as the variable labels of the output. HTH, Jon Peck Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: drfg2008 <[hidden email]> To: [hidden email] Date: 06/09/2011 01:59 PM Subject: Re: [SPSSX-L] automatic dummy coding Sent by: "SPSSX(r) Discussion" <[hidden email]> My question was obviously misleading. Sorry. I just can't find a solution how to take the value labels of the original variable and use it as a variable label for the dummy variables. I thought there might be a solution in python by using an array (similar to what is described on the SPSS-IBM website - see below). array=[] while i < FileN: label = spss.GetVariableLabel(i) array.append(label) i+=1 print array Does something exist like: spss.GetValueLabel(i) ? Thanks and sorry again. ----- Dr. Frank Gaeth FU-Berlin -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/automatic-dummy-coding-tp4473014p4474075.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by David Marso
Thanks David,
-> Observation: You have access to the labels via spssaux.VariableDict(['b'])[0].ValueLabels but throw them away by only accessing the length of the array (len). I didn't even realize, that I already had the solution. Thanks for your advice. Well, here's the result (if anyone wants to integrate a script for dumy-coding, -and can not use SPSSINC CREATE DUMMIES). The label replace is a bit awkward Cheers, Frank * * * * Dummy-Coding with automatic labelling * * www.frag-einen-statistiker.de * * Comment: Script starts at: "Begin Python" * Cange "variable" into the variable-name that is to be dummy-coded * AUTORECODE VARIABLES=variable <-here ! * /INTO b. * * * *------------------- Example Data ---------------------------------------------. input program. loop a =1 to 1000 by 1. end case. end loop. end file. end input program. COMPUTE v=RV.UNIFORM(1,5). COMPUTE variable=TRUNC(v). EXECUTE . DELETE VARIABLES a v. EXECUTE . *--------------------Begin Python (here starts the script)---------------------------------------. *GET FILE='C:\<path>\filename.SAV'. AUTORECODE VARIABLES=variable /INTO b. BEGIN PROGRAM. import spss, spssaux n = len(spssaux.VariableDict(['b'])[0].ValueLabels) print n spss.Submit(r"""VECTOR dummy("""+str(n)+"""). """) i = 0 while i < n: i+=1 spss.Submit(r"""if (b = """+str(i)+""") dummy"""+str(i)+""" = 1. """) spss.Submit(r""" RECODE dummy1 to dummy"""+str(n)+""" (MISSING=0). EXECUTE. """) mydict = spssaux.VariableDict(['b'])[0].ValueLabels for code, label in mydict.items(): # print "Code: %s Label: %s" % (code, label) label=label.replace(" ", "_"); label=label.replace("ä", "ae"); label=label.replace("ö", "oe"); label=label.replace("ü", "ue"); label=label.replace("ß", "ss"); label=label.replace("!", "_"); label=label.replace("?", "_"); label=label.replace("µ", "_"); label=label.replace("$", "S"); label=label.replace("€", "Eur"); label=label.replace("@", "a"); label=label.replace("(", "_"); label=label.replace(")", "_"); label=label.replace("[", "_"); label=label.replace("]", "_"); label=label.replace("}", "_"); label=label.replace("{", "_"); label=label.replace("&", "_"); label=label.replace("/", "_"); label=label.replace("%", "_"); label=label.replace("§", "_"); label=label.replace("'", "_"); label=label.replace("*", "_"); label=label.replace("+", "_"); label=label.replace("-", "_"); label=label.replace(" ", "_"); label=label.replace("*", "_"); label=label.replace("#", "_"); label=label.replace("~", "_"); label=label.replace(":", "_"); label=label.replace(",", "_"); label=label.replace(";", "_"); label=label.replace("<", "_"); label=label.replace(">", "_"); label=label.replace("|", "_"); label = "v_"+ label befehl = "RENAME VARIABLES dummy"+code+" = "+label+"." print befehl spss.Submit(befehl) END PROGRAM. DELETE VARIABLES b.
Dr. Frank Gaeth
|
Use a regular expression to clean up the
invalid characters
import re label = re.sub(r"[list-all-the-invalid-characters]", "_", label) Include the square brackets. That expression means match any occurrence of a character in the list and replace with _. E.g., re.sub(r"[abc]", "_", "aAxxxc") would return '_Axxx_' Jon Peck Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: drfg2008 <[hidden email]> To: [hidden email] Date: 06/10/2011 10:18 AM Subject: Re: [SPSSX-L] automatic dummy coding Sent by: "SPSSX(r) Discussion" <[hidden email]> Thanks David, -> Observation: You have access to the labels via spssaux.VariableDict(['b'])[0].ValueLabels but throw them away by only accessing the length of the array (len). I didn't even realize, that I already had the solution. Thanks for your advice. Well, here's the result (if anyone wants to integrate a script for dumy-coding, -and can not use SPSSINC CREATE DUMMIES). The label replace is a bit awkward Cheers, Frank * * * * Dummy-Coding with automatic labelling * * www.frag-einen-statistiker.de * * Comment: Script starts at: "Begin Python" * Cange "variable" into the variable-name that is to be dummy-coded * AUTORECODE VARIABLES=variable <-here ! * /INTO b. * * * *------------------- Example Data ---------------------------------------------. input program. loop a =1 to 1000 by 1. end case. end loop. end file. end input program. COMPUTE v=RV.UNIFORM(1,5). COMPUTE variable=TRUNC(v). EXECUTE . DELETE VARIABLES a v. EXECUTE . *--------------------Begin Python (here starts the script)---------------------------------------. *GET FILE='C:\<path>\filename.SAV'. AUTORECODE VARIABLES=variable /INTO b. BEGIN PROGRAM. import spss, spssaux n = len(spssaux.VariableDict(['b'])[0].ValueLabels) print n spss.Submit(r"""VECTOR dummy("""+str(n)+"""). """) i = 0 while i < n: i+=1 spss.Submit(r"""if (b = """+str(i)+""") dummy"""+str(i)+""" = 1. """) spss.Submit(r""" RECODE dummy1 to dummy"""+str(n)+""" (MISSING=0). EXECUTE. """) mydict = spssaux.VariableDict(['b'])[0].ValueLabels for code, label in mydict.items(): # print "Code: %s Label: %s" % (code, label) label=label.replace(" ", "_"); label=label.replace("ä", "ae"); label=label.replace("ö", "oe"); label=label.replace("ü", "ue"); label=label.replace("ß", "ss"); label=label.replace("!", "_"); label=label.replace("?", "_"); label=label.replace("µ", "_"); label=label.replace("$", "S"); label=label.replace("€", "Eur"); label=label.replace("@", "a"); label=label.replace("(", "_"); label=label.replace(")", "_"); label=label.replace("[", "_"); label=label.replace("]", "_"); label=label.replace("}", "_"); label=label.replace("{", "_"); label=label.replace("&", "_"); label=label.replace("/", "_"); label=label.replace("%", "_"); label=label.replace("§", "_"); label=label.replace("'", "_"); label=label.replace("*", "_"); label=label.replace("+", "_"); label=label.replace("-", "_"); label=label.replace(" ", "_"); label=label.replace("*", "_"); label=label.replace("#", "_"); label=label.replace("~", "_"); label=label.replace(":", "_"); label=label.replace(",", "_"); label=label.replace(";", "_"); label=label.replace("<", "_"); label=label.replace(">", "_"); label=label.replace("|", "_"); label = "v_"+ label befehl = "RENAME VARIABLES dummy"+code+" = "+label+"." print befehl spss.Submit(befehl) END PROGRAM. DELETE VARIABLES b. ----- Dr. Frank Gaeth FU-Berlin -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/automatic-dummy-coding-tp4473014p4476615.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
thanks, but it doesn't work:
label = re.sub(r"[!§$%&/()=?´´*#'µ<>-µ€@öäüÖÄÜ^°]", "_", label) Error message: Traceback (most recent call last): File "<string>", line 25, in <module> NameError: name 're' is not defined
Dr. Frank Gaeth
|
Administrator
|
Frank,
You need the import re (I bet you don't have that in your code). General observation: Variable names can be 64 bytes long. Value labels can be 120 bytes long. DING DING DING!!! Multiple variables may have the same set of value labels. DING DING DING!!! A variable named b might already exist in the data file. DING DING DING!!! Very long variable names are unwieldy and a total pain in the ass to work with. DING DING DING!!! --------- ergo... Consider scrapping your current RENAME approach and simply apply the VALUE labels to the new variables as VARIABLE LABELS. In this case you won't need to 'scrub' the variable names with the re.sub business. Also, as I previously posted: AUTORECODE var INTO new. VECTOR dummy(maxlength_from_python). COMPUTE dummy(new)=1. is MUCH MORE EFFICIENT than looping through all possible variables and submitting IF (new=i) dummy(i)=1 multiple times. HTH, Have a great weekend. David
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by drfg2008
My instruction said to start with
import re Otherwise you can't use things in the re module. There is one additional problem. The hyphen has a special meaning inside square brackets in a regular expression, so you need to put it first re.sub(r"[-!§$%&/()=?´´*#'µ<>µ€@öäüÖÄÜ^°]", "_", label) If you are trying to ensure that the result is a legitimate variable name, though, note that all those characters with umlauts are legal in a variable name as are some of the other special characters. Jon Peck Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: drfg2008 <[hidden email]> To: [hidden email] Date: 06/10/2011 11:53 AM Subject: Re: [SPSSX-L] automatic dummy coding Sent by: "SPSSX(r) Discussion" <[hidden email]> thanks, but it doesn't work: label = re.sub(r"[!§$%&/()=?´´*#'µ<>-µ€@öäüÖÄÜ^°]", "_", label) Error message: Traceback (most recent call last): File "<string>", line 25, in <module> NameError: name 're' is not defined ----- Dr. Frank Gaeth FU-Berlin -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/automatic-dummy-coding-tp4473014p4476797.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by David Marso
David,
after I read your message, I found myself under the table ( DING DING DING!!! ) Well, this might be a better aproach (see below). However, the new version with vector ... instead of running a loop produces an error message in some cases, though it works fine. Happy Pentecost ~~~~~~~~~~~~~~~~~~~ >Warnung Nr. 525 >An attempt was made to store a value into an element of a vector the subscript >of which was missing or otherwise invalid. The subscript must be a positive >integer and must not be greater than the length of the vector. No store can >occur. >Command line: 53 Current case: 22 Current splitfile group: 1 >Warnung Nr. 92 >The limit of MXWARNS warnings in this data pass has been printed. Further >warnings have been suppressed. ~~~~~~~~~~~~~~~~~~~ input program. loop a =1 to 1000 by 1. end case. end loop. end file. end input program. COMPUTE v=RV.UNIFORM(1,5). COMPUTE variable=TRUNC(v). EXECUTE . DELETE VARIABLES a v. EXECUTE . *--------------------Begin Python (here starts the script)---------------------------------------. #GET FILE='C:\<path>\S3000.SAV'. BEGIN PROGRAM. import spss, spssaux #variable="v179" variable="variable" laufvariable="NBDPJLDAEHJFAOCHDGDFKFIL54848184463901c2dd25340dfcdbdf39" dummy="dummy_" + variable +"_" spss.Submit(r"""AUTORECODE VARIABLES="""+variable+""" /INTO """+laufvariable+""" .""") n = len(spssaux.VariableDict([laufvariable])[0].ValueLabels) spss.Submit(r""" VECTOR """+dummy+"""("""+str(n)+"""). COMPUTE """+dummy+"""("""+laufvariable+""")=1. EXECUTE . """) spss.Submit(r""" RECODE """+dummy+str(1)+""" to """+dummy+str(n)+""" (MISSING=0). EXECUTE. """) mydict = spssaux.VariableDict([laufvariable])[0].ValueLabels for code, label in mydict.items(): befehl = "VARIABLE LABELS "+dummy+code+" ' "+label+" '." print befehl spss.Submit(befehl) spss.Submit(r"""DELETE VARIABLES """+laufvariable+""".""") END PROGRAM.
Dr. Frank Gaeth
|
Free forum by Nabble | Edit this page |