automatic dummy coding

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

automatic dummy coding

drfg2008
has anyone a python script for automatic dummy coding of a variable?

The problem is that the coding of the variable does not necessarily start with 1, 2, ...

1 = age 18-25
2 = age 26 - 30
3 = age 31 - 36

...

but (i.e.)

11 = age 18-25
12 = age 26 - 30
13 = age 31 - 36

...

plus, the values could be non linear: 11,12,14,17,18,20, ...
plus the missings could be a problem.

thanks
Dr. Frank Gaeth

Reply | Threaded
Open this post in threaded view
|

Re: automatic dummy coding

David Marso
Administrator
AUTORECODE followed by VECTOR and COMPUTE.  Will leave it to you to RTFM the details!
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: automatic dummy coding

drfg2008
Thanks, Dave

I did RTFM (see below) and I do have a working program. However, the program does not take the value labels as the new variable labels. Plus, I thought python might have a method much more elegant i.e. to put all value labels into an array and read the array into dummys.

Frank

Here is my code:


*------------------- Data  ---------------------------------------------.

input program.
loop a =1 to 1000 by 1.
end case.
end loop.
end file.
end input program.
COMPUTE v=RV.UNIFORM(0,10).
COMPUTE variable=TRUNC(v).
EXECUTE .
DELETE VARIABLES a v.
EXECUTE .

*--------------------Begin Python ---------------------------------------.

AUTORECODE VARIABLES=variable
  /INTO b.

BEGIN PROGRAM.
import spss, spssaux
n = len(spssaux.VariableDict(['b'])[0].ValueLabels)

print n

spss.Submit(r"""

VECTOR dummy("""+str(n)+"""). """)

i = 0
while i < n:
             i+=1
             spss.Submit(r"""if (b = """+str(i)+""") dummy"""+str(i)+""" = 1.
             EXECUTE .
             """)

spss.Submit(r"""
RECODE dummy1 to dummy"""+str(n)+""" (MISSING=0).
EXECUTE.
DELETE VARIABLES b.
""")    

END PROGRAM.

Dr. Frank Gaeth

Reply | Threaded
Open this post in threaded view
|

Re: automatic dummy coding

Jon K Peck
I'm baffled about exactly what you want to do here.  If you could spell it out, I'm sure that there would be a solution.

Jon Peck
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        drfg2008 <[hidden email]>
To:        [hidden email]
Date:        06/09/2011 12:40 PM
Subject:        Re: [SPSSX-L] automatic dummy coding
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Thanks, Dave

I did RTFM (see below) and I do have a working program. However, the program
does not take the value labels as the new variable labels. Plus, I thought
python might have a method much more elegant i.e. to put all value labels
into an array and read the array into dummys.

Frank

Here is my code:


*------------------- Data  ---------------------------------------------.

input program.
loop a =1 to 1000 by 1.
end case.
end loop.
end file.
end input program.
COMPUTE v=RV.UNIFORM(0,10).
COMPUTE variable=TRUNC(v).
EXECUTE .
DELETE VARIABLES a v.
EXECUTE .

*--------------------Begin Python ---------------------------------------.

AUTORECODE VARIABLES=variable
 /INTO b.

BEGIN PROGRAM.
import spss, spssaux
n = len(spssaux.VariableDict(['b'])[0].ValueLabels)

print n

spss.Submit(r"""

VECTOR dummy("""+str(n)+"""). """)

i = 0
while i < n:
            i+=1
            spss.Submit(r"""if (b = """+str(i)+""") dummy"""+str(i)+""" =
1.
            EXECUTE .
            """)

spss.Submit(r"""
RECODE dummy1 to dummy"""+str(n)+""" (MISSING=0).
EXECUTE.
DELETE VARIABLES b.
""")

END PROGRAM.



-----
Dr. Frank Gaeth
FU-Berlin

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/automatic-dummy-coding-tp4473014p4473762.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: automatic dummy coding

David Marso
Administrator
In reply to this post by drfg2008
Hi Frank,
Observation: You have access to the labels via spssaux.VariableDict(['b'])[0].ValueLabels
but throw them away by only accessing the length of the array (len).
I can't test this but the following idea should work.
<TOTALLY UNTESTED SEAT OF THE PANTS AIRCODE FROM HELL>
AUTORECODE VARIABLES=variable   /INTO b.
*Don't know if this direct assignment will work (haven't touched my Python books in awhile).
labels[?]=spssaux.VariableDict(['b'])[0].ValueLabels
n=len(labels)
VECTOR dummy("""+str(n)+"""). """)
** DON'T NEED TO LOOP AND COMPARE.  Just slam it in by indexing into the vector.
COMPUTE dummy(b)=1.
i = 0
while i < n:
             i+=1
             spss.Submit(r""" VARIABLE LABELS dummy"""+str(i)+""" ..resolve_quotes_here( labels[i]).
             """)

** There are likely more elegant looping constructs than while **.
</TOTALLY UNTESTED SEAT OF THE PANTS AIRCODE FROM HELL>

BTW, you should PURGE your code of the EXECUTE statements.  They are almost always UNNECESSARY!
HTH, David

drfg2008 wrote
Thanks, Dave

I did RTFM (see below) and I do have a working program. However, the program does not take the value labels as the new variable labels. Plus, I thought python might have a method much more elegant i.e. to put all value labels into an array and read the array into dummys.

Frank

Here is my code:


*------------------- Data  ---------------------------------------------.

input program.
loop a =1 to 1000 by 1.
end case.
end loop.
end file.
end input program.
COMPUTE v=RV.UNIFORM(0,10).
COMPUTE variable=TRUNC(v).
EXECUTE .
DELETE VARIABLES a v.
EXECUTE .

*--------------------Begin Python ---------------------------------------.

AUTORECODE VARIABLES=variable
  /INTO b.

BEGIN PROGRAM.
import spss, spssaux
n = len(spssaux.VariableDict(['b'])[0].ValueLabels)

print n

spss.Submit(r"""

VECTOR dummy("""+str(n)+"""). """)

i = 0
while i < n:
             i+=1
             spss.Submit(r"""if (b = """+str(i)+""") dummy"""+str(i)+""" = 1.
             EXECUTE .
             """)

spss.Submit(r"""
RECODE dummy1 to dummy"""+str(n)+""" (MISSING=0).
EXECUTE.
DELETE VARIABLES b.
""")    

END PROGRAM.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: automatic dummy coding

drfg2008
In reply to this post by Jon K Peck
My question was obviously misleading. Sorry. I just can't find a solution how to take the value labels of the original variable and use it as a variable label for the dummy variables. I thought there might be a solution in python by using an array (similar to what is described on the SPSS-IBM website - see below).

array=[]
while i < FileN:
         label = spss.GetVariableLabel(i)
         array.append(label)
         i+=1
print array


Does something exist like: spss.GetValueLabel(i) ?


Thanks and sorry again.
Dr. Frank Gaeth

Reply | Threaded
Open this post in threaded view
|

Re: automatic dummy coding

Jon K Peck
vls = spssaux.VariableDict('x').ValueLabels
returns a Python dictionary of the value labels for variable x (case sensitive).  You could then iterate through that via
value, label = vls.items()
and do whatever you want with those labels.

You can also use the Dataset class for full dictionary access

I'm still not sure of the goal, but note that the SPSSINC CREATE DUMMIES will create a vector of dummy variables for a variable and will assign the value labels of the input variable as the variable labels of the output.

HTH,

Jon Peck
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        drfg2008 <[hidden email]>
To:        [hidden email]
Date:        06/09/2011 01:59 PM
Subject:        Re: [SPSSX-L] automatic dummy coding
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




My question was obviously misleading. Sorry. I just can't find a solution how
to take the value labels of the original variable and use it as a variable
label for the dummy variables. I thought there might be a solution in python
by using an array (similar to what is described on the SPSS-IBM website -
see below).

array=[]
while i < FileN:
        label = spss.GetVariableLabel(i)
        array.append(label)
        i+=1
print array


Does something exist like: spss.GetValueLabel(i) ?


Thanks and sorry again.

-----
Dr. Frank Gaeth
FU-Berlin

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/automatic-dummy-coding-tp4473014p4474075.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: automatic dummy coding

drfg2008
In reply to this post by David Marso
Thanks David,

-> Observation: You have access to the labels via spssaux.VariableDict(['b'])[0].ValueLabels
but throw them away by only accessing the length of the array (len).

I didn't even realize, that I already had the solution. Thanks for your advice.

Well, here's the result (if anyone wants to integrate a script for dumy-coding, -and can not use SPSSINC CREATE DUMMIES). The label replace is a bit awkward

Cheers, Frank



*
*
*
*    Dummy-Coding with automatic labelling
*    
*    www.frag-einen-statistiker.de
*
*    Comment: Script starts at: "Begin Python"
*    Cange "variable" into the variable-name that is to be dummy-coded
*    AUTORECODE VARIABLES=variable <-here !
*      /INTO b.
*
*    
*
*------------------- Example Data  ---------------------------------------------.

input program.
loop a =1 to 1000 by 1.
end case.
end loop.
end file.
end input program.
COMPUTE v=RV.UNIFORM(1,5).
COMPUTE variable=TRUNC(v).
EXECUTE .
DELETE VARIABLES a v.
EXECUTE .


*--------------------Begin Python (here starts the script)---------------------------------------.

*GET  FILE='C:\<path>\filename.SAV'.

AUTORECODE VARIABLES=variable
  /INTO b.

BEGIN PROGRAM.
import spss, spssaux
n = len(spssaux.VariableDict(['b'])[0].ValueLabels)
print n

spss.Submit(r"""VECTOR dummy("""+str(n)+"""). """)

i = 0
while i < n:
             i+=1
             spss.Submit(r"""if (b = """+str(i)+""") dummy"""+str(i)+""" = 1.

             """)

spss.Submit(r"""
RECODE dummy1 to dummy"""+str(n)+""" (MISSING=0).
EXECUTE.
""")    

mydict = spssaux.VariableDict(['b'])[0].ValueLabels

for code, label in mydict.items():
#    print "Code: %s Label: %s" % (code, label)

    label=label.replace(" ", "_");
    label=label.replace("ä", "ae");
    label=label.replace("ö", "oe");
    label=label.replace("ü", "ue");
    label=label.replace("ß", "ss");
    label=label.replace("!", "_");
    label=label.replace("?", "_");
    label=label.replace("µ", "_");
    label=label.replace("$", "S");
    label=label.replace("€", "Eur");
    label=label.replace("@", "a");
    label=label.replace("(", "_");
    label=label.replace(")", "_");
    label=label.replace("[", "_");
    label=label.replace("]", "_");
    label=label.replace("}", "_");
    label=label.replace("{", "_");
    label=label.replace("&", "_");
    label=label.replace("/", "_");
    label=label.replace("%", "_");
    label=label.replace("§", "_");
    label=label.replace("'", "_");
    label=label.replace("*", "_");
    label=label.replace("+", "_");
    label=label.replace("-", "_");
    label=label.replace(" ", "_");
    label=label.replace("*", "_");
    label=label.replace("#", "_");
    label=label.replace("~", "_");
    label=label.replace(":", "_");
    label=label.replace(",", "_");
    label=label.replace(";", "_");
    label=label.replace("<", "_");
    label=label.replace(">", "_");
    label=label.replace("|", "_");
    label = "v_"+ label

    befehl = "RENAME VARIABLES dummy"+code+" = "+label+"."
    print befehl
   
    spss.Submit(befehl)
   
END PROGRAM.

DELETE VARIABLES b.


Dr. Frank Gaeth

Reply | Threaded
Open this post in threaded view
|

Re: automatic dummy coding

Jon K Peck
Use a regular expression to clean up the invalid characters
import re
label = re.sub(r"[list-all-the-invalid-characters]", "_", label)

Include the square brackets.  That expression means match any occurrence of a character in the list and replace with _.  E.g.,
re.sub(r"[abc]", "_", "aAxxxc")
would return
'_Axxx_'

Jon Peck
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        drfg2008 <[hidden email]>
To:        [hidden email]
Date:        06/10/2011 10:18 AM
Subject:        Re: [SPSSX-L] automatic dummy coding
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Thanks David,

-> Observation: You have access to the labels via
spssaux.VariableDict(['b'])[0].ValueLabels
but throw them away by only accessing the length of the array (len).

I didn't even realize, that I already had the solution. Thanks for your
advice.

Well, here's the result (if anyone wants to integrate a script for
dumy-coding, -and can not use SPSSINC CREATE DUMMIES). The label replace is
a bit awkward

Cheers, Frank



*
*
*
*    Dummy-Coding with automatic labelling
*
*    
www.frag-einen-statistiker.de
*
*    Comment: Script starts at: "Begin Python"
*    Cange "variable" into the variable-name that is to be dummy-coded
*    AUTORECODE VARIABLES=variable <-here !
*      /INTO b.
*
*
*
*------------------- Example Data
---------------------------------------------.

input program.
loop a =1 to 1000 by 1.
end case.
end loop.
end file.
end input program.
COMPUTE v=RV.UNIFORM(1,5).
COMPUTE variable=TRUNC(v).
EXECUTE .
DELETE VARIABLES a v.
EXECUTE .


*--------------------Begin Python (here starts the
script)---------------------------------------.

*GET  FILE='C:\<path>\filename.SAV'.

AUTORECODE VARIABLES=variable
 /INTO b.

BEGIN PROGRAM.
import spss, spssaux
n = len(spssaux.VariableDict(['b'])[0].ValueLabels)
print n

spss.Submit(r"""VECTOR dummy("""+str(n)+"""). """)

i = 0
while i < n:
            i+=1
            spss.Submit(r"""if (b = """+str(i)+""") dummy"""+str(i)+""" =
1.

            """)

spss.Submit(r"""
RECODE dummy1 to dummy"""+str(n)+""" (MISSING=0).
EXECUTE.
""")

mydict = spssaux.VariableDict(['b'])[0].ValueLabels

for code, label in mydict.items():
#    print "Code: %s Label: %s" % (code, label)

   label=label.replace(" ", "_");
   label=label.replace("ä", "ae");
   label=label.replace("ö", "oe");
   label=label.replace("ü", "ue");
   label=label.replace("ß", "ss");
   label=label.replace("!", "_");
   label=label.replace("?", "_");
   label=label.replace("µ", "_");
   label=label.replace("$", "S");
   label=label.replace("€", "Eur");
   label=label.replace("@", "a");
   label=label.replace("(", "_");
   label=label.replace(")", "_");
   label=label.replace("[", "_");
   label=label.replace("]", "_");
   label=label.replace("}", "_");
   label=label.replace("{", "_");
   label=label.replace("&", "_");
   label=label.replace("/", "_");
   label=label.replace("%", "_");
   label=label.replace("§", "_");
   label=label.replace("'", "_");
   label=label.replace("*", "_");
   label=label.replace("+", "_");
   label=label.replace("-", "_");
   label=label.replace(" ", "_");
   label=label.replace("*", "_");
   label=label.replace("#", "_");
   label=label.replace("~", "_");
   label=label.replace(":", "_");
   label=label.replace(",", "_");
   label=label.replace(";", "_");
   label=label.replace("<", "_");
   label=label.replace(">", "_");
   label=label.replace("|", "_");
   label = "v_"+ label

   befehl = "RENAME VARIABLES dummy"+code+" = "+label+"."
   print befehl

   spss.Submit(befehl)

END PROGRAM.

DELETE VARIABLES b.




-----
Dr. Frank Gaeth
FU-Berlin

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/automatic-dummy-coding-tp4473014p4476615.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: automatic dummy coding

drfg2008
thanks, but it doesn't work:

    label = re.sub(r"[!§$%&/()=?´´*#'µ<>-µ€@öäüÖÄÜ^°]", "_", label)


Error message:

Traceback (most recent call last):
  File "<string>", line 25, in <module> 
NameError: name 're' is not defined


Dr. Frank Gaeth

Reply | Threaded
Open this post in threaded view
|

Re: automatic dummy coding

David Marso
Administrator
Frank,
You need the import re (I bet you don't have that in your code).
General observation:  
Variable names can be 64 bytes long.
Value labels can be 120 bytes long.
DING DING DING!!!
Multiple variables may have the same set of value labels.
DING DING DING!!!
A variable named b might already exist in the data file.
DING DING DING!!!
Very long variable names are unwieldy and a total pain in the ass to work with.
DING DING DING!!!
---------
ergo...
Consider scrapping your current RENAME approach and simply apply the VALUE labels to the new variables as VARIABLE LABELS.  
In this case you won't need to 'scrub' the variable names with the re.sub business.
Also, as I previously posted:
AUTORECODE var INTO new.
VECTOR dummy(maxlength_from_python).
COMPUTE dummy(new)=1.
is MUCH MORE EFFICIENT than looping through all possible variables and
submitting IF (new=i) dummy(i)=1 multiple times.
HTH, Have a great weekend.
David

drfg2008 wrote
thanks, but it doesn't work:

    label = re.sub(r"[!§$%&/()=?´´*#'µ<>-µ€@öäüÖÄÜ^°]", "_", label)


Error message:

Traceback (most recent call last):
  File "<string>", line 25, in <module> 
NameError: name 're' is not defined
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: automatic dummy coding

Jon K Peck
In reply to this post by drfg2008
My instruction said to start with
import re
Otherwise you can't use things in the re module.

There is one additional problem.  The hyphen has a special meaning inside square brackets in a regular expression, so you need to put it first
re.sub(r"[-!§$%&/()=?´´*#'µ<>µ€@öäüÖÄÜ^°]", "_", label)

If you are trying to ensure that the result is a legitimate variable name, though, note that all those characters with umlauts are legal in a variable name as are some of the other special characters.

Jon Peck
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        drfg2008 <[hidden email]>
To:        [hidden email]
Date:        06/10/2011 11:53 AM
Subject:        Re: [SPSSX-L] automatic dummy coding
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




thanks, but it doesn't work:

   label = re.sub(r"[!§$%&/()=?´´*#'µ<>-µ€@öäüÖÄÜ^°]", "_", label)


Error message:

Traceback (most recent call last):
 File "<string>", line 25, in <module>
NameError: name 're' is not defined




-----
Dr. Frank Gaeth
FU-Berlin

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/automatic-dummy-coding-tp4473014p4476797.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: automatic dummy coding

drfg2008
In reply to this post by David Marso
David,
after I read your message, I found myself under the table ( DING DING DING!!! )

Well, this might be a better aproach (see below). However,  the new version with vector ... instead of running a loop produces an error message in some cases, though it works fine.

Happy Pentecost
~~~~~~~~~~~~~~~~~~~

>Warnung Nr.  525
>An attempt was made to store a value into an element of a vector the subscript
>of which was missing or otherwise invalid.  The subscript must be a positive
>integer and must not be greater than the length of the vector.  No store can
>occur.
>Command line: 53  Current case: 22  Current splitfile group: 1

>Warnung Nr.  92
>The limit of MXWARNS warnings in this data pass has been printed.  Further
>warnings have been suppressed.

~~~~~~~~~~~~~~~~~~~



input program.
loop a =1 to 1000 by 1.
end case.
end loop.
end file.
end input program.
COMPUTE v=RV.UNIFORM(1,5).
COMPUTE variable=TRUNC(v).
EXECUTE .
DELETE VARIABLES a v.
EXECUTE .


*--------------------Begin Python (here starts the script)---------------------------------------.

#GET  FILE='C:\<path>\S3000.SAV'.

BEGIN PROGRAM.
import spss, spssaux

#variable="v179"        
variable="variable"
laufvariable="NBDPJLDAEHJFAOCHDGDFKFIL54848184463901c2dd25340dfcdbdf39"
dummy="dummy_" + variable +"_"

spss.Submit(r"""AUTORECODE VARIABLES="""+variable+"""  /INTO """+laufvariable+""" .""")
n = len(spssaux.VariableDict([laufvariable])[0].ValueLabels)

spss.Submit(r"""
VECTOR """+dummy+"""("""+str(n)+""").
COMPUTE """+dummy+"""("""+laufvariable+""")=1.
EXECUTE .
""")

spss.Submit(r"""
RECODE """+dummy+str(1)+""" to """+dummy+str(n)+""" (MISSING=0).
EXECUTE.
""")    

mydict = spssaux.VariableDict([laufvariable])[0].ValueLabels
for code, label in mydict.items():
    befehl = "VARIABLE LABELS "+dummy+code+" ' "+label+" '."
    print befehl
    spss.Submit(befehl)

spss.Submit(r"""DELETE VARIABLES """+laufvariable+""".""")

END PROGRAM.


Dr. Frank Gaeth