Using the variable names to select subsamples (Python help?)

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Using the variable names to select subsamples (Python help?)

Staffan Lindberg
Dear list!
 
I have a large file in which the variable names signifies among other things
which cohort the subjects come from. This was done as follows (a few
examples):
 
1. The first character of the variable name (always alphabetical) signifies
which area within a large questionnaire the variable refers to, i.e.
 
A13AB (the first A means background variables/questions)
F130CDEF (the first F means drug abuse variables/questions)
 
2. The 2nd to at most 4th characters (always numerical) is the number of the
question within the area. Using the previous examples:
 
A13AB (13 means the 13th question among the background variables/questions)
F130CDEF (130 means the 130th questions among the drug abuse
variables/questions)
 
3. The rest of the variable name consists of allphabetical characters
specifying the cohort year. Using the same examples:
 
A13AB (AB at the end means that the variable/question is found in cohort
A=1985 and cohort B=1989)
F130CDEF (CDEF at the end means that te variable/question is found in cohort
C=1993 and cohort D=1997 and cohort E=2001 and cohort F=2005)
 
Now I want to use the variable names for making different kind of
selections, for example:
 
1. All variables for cohort F=2005
2. All variables that occur in both cohort A=1985 and cohort F=2005
3. All variables that occur in all cohorts (ABCDEF in the end of the
variable name)
4. All variables within area A and F
 
I suspect this demands a Python approach. Having just attended Ray's course
in Python I am still a beginner and have some difficulty in seeing the
forest because of all the trees. Perhaps a kind soul could help or point me
in the right direction?
 
best
 
Staffan Lindberg
Sweden
 
 

====================To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Using the variable names to select subsamples (Python help?)

Peck, Jon
There are two Python technologies that can help a lot with this and let you write very expressive and readable code.

First, the spssaux module VariableDict object accepts regular expression patterns in variable names.  So you could construct a (Python) variable dictionary object like this:
F2005 = spssaux.VariableDict(pattern= r"F.*2005$")

That would create an object with all the variables with names starting with "F" and ending with "2005".  Then you could, say, do descriptives by submitting

spss.Submit("DESC " + " ".join(F2005.variables))

since a VariableDict object has a property, variables, that is a list of all the variable names in that dictionary.

The second technology that addresses queries such as
2. All variables that occur in both cohort A=1985 and cohort F=2005
is Python sets.  You could create a VariableDict object for each of the two cohorts, say, A1985 and F2005.  Then use set intersection to get the variables in common.  The VariableDict objects are designed so that they support set operations based on the variable names.
AF19852005 = set(A1985).intersection(set(F2005))

That results in a set.  The members of the set are still variable objects carrying all the properties of those objects, so you could construct a list of names like this.

varsInCommon = " ".join([v.VariableName for v in AF19852005])

Using the properties of the objects, you could select, say, all the numeric variables in the set like this.

numericVarsInCommon = " ".join([v.VariableName for v in AF19852005 if v.VariableType == 0])

Python sets support a full set of set operations including union, intersection, difference, and symmetric difference, so you can write very natural expressions for these queries.  Just remember that the set members are variable objects, not just names, so you have to extract the names as above if you are constructing syntax.

You could, alternatively, use the variables property of each dictionary to return ordinary lists of names instead of variable objects.

Note that the pattern expression for selecting variables is not case sensitive.

Watch out, also, for situations where the query returns the empty set.

HTH,
Jon Peck


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Staffan Lindberg
Sent: Tuesday, September 30, 2008 5:11 AM
To: [hidden email]
Subject: [SPSSX-L] Using the variable names to select subsamples (Python help?)

Dear list!

I have a large file in which the variable names signifies among other things
which cohort the subjects come from. This was done as follows (a few
examples):

1. The first character of the variable name (always alphabetical) signifies
which area within a large questionnaire the variable refers to, i.e.

A13AB (the first A means background variables/questions)
F130CDEF (the first F means drug abuse variables/questions)

2. The 2nd to at most 4th characters (always numerical) is the number of the
question within the area. Using the previous examples:

A13AB (13 means the 13th question among the background variables/questions)
F130CDEF (130 means the 130th questions among the drug abuse
variables/questions)

3. The rest of the variable name consists of allphabetical characters
specifying the cohort year. Using the same examples:

A13AB (AB at the end means that the variable/question is found in cohort
A=1985 and cohort B=1989)
F130CDEF (CDEF at the end means that te variable/question is found in cohort
C=1993 and cohort D=1997 and cohort E=2001 and cohort F=2005)

Now I want to use the variable names for making different kind of
selections, for example:

1. All variables for cohort F=2005
2. All variables that occur in both cohort A=1985 and cohort F=2005
3. All variables that occur in all cohorts (ABCDEF in the end of the
variable name)
4. All variables within area A and F

I suspect this demands a Python approach. Having just attended Ray's course
in Python I am still a beginner and have some difficulty in seeing the
forest because of all the trees. Perhaps a kind soul could help or point me
in the right direction?

best

Staffan Lindberg
Sweden



=======
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Using the variable names to select subsamples (Python help?)

Albert-Jan Roskam
Hi Staffan!

Here's what I just cooked up. Sets were also the first thing that came to my mind while I was reading your post, but I don't think you need them in your particular case. I considered using the pattern matching feature of the VarDict class, but decided that my code was easier, as you want to make different combinations of vars.

Cheers!!
Albert-Jan

begin program.
import spssaux
def subSet(selectionList, position):
        targetList = []
        sourceList = spssaux.GetVariableNamesList()
        selectionList = [i.lower() for i in selectionList]
        [targetList.append(v) for k, v in enumerate(sourceList) if v[position] in selectionList]
        targetList.sort()
        print targetList
subSet(selectionList = ["T", "H", "I", "E", "c"], position = 0)
end program.


--- On Tue, 9/30/08, Peck, Jon <[hidden email]> wrote:

> From: Peck, Jon <[hidden email]>
> Subject: Re: Using the variable names to select subsamples (Python              help?)
> To: [hidden email]
> Date: Tuesday, September 30, 2008, 3:51 PM
> There are two Python technologies that can help a lot with
> this and let you write very expressive and readable code.
>
> First, the spssaux module VariableDict object accepts
> regular expression patterns in variable names.  So you could
> construct a (Python) variable dictionary object like this:
> F2005 = spssaux.VariableDict(pattern=
> r"F.*2005$")
>
> That would create an object with all the variables with
> names starting with "F" and ending with
> "2005".  Then you could, say, do descriptives by
> submitting
>
> spss.Submit("DESC " + "
> ".join(F2005.variables))
>
> since a VariableDict object has a property, variables, that
> is a list of all the variable names in that dictionary.
>
> The second technology that addresses queries such as
> 2. All variables that occur in both cohort A=1985 and
> cohort F=2005
> is Python sets.  You could create a VariableDict object for
> each of the two cohorts, say, A1985 and F2005.  Then use set
> intersection to get the variables in common.  The
> VariableDict objects are designed so that they support set
> operations based on the variable names.
> AF19852005 = set(A1985).intersection(set(F2005))
>
> That results in a set.  The members of the set are still
> variable objects carrying all the properties of those
> objects, so you could construct a list of names like this.
>
> varsInCommon = " ".join([v.VariableName for v in
> AF19852005])
>
> Using the properties of the objects, you could select, say,
> all the numeric variables in the set like this.
>
> numericVarsInCommon = " ".join([v.VariableName
> for v in AF19852005 if v.VariableType == 0])
>
> Python sets support a full set of set operations including
> union, intersection, difference, and symmetric difference,
> so you can write very natural expressions for these queries.
>  Just remember that the set members are variable objects,
> not just names, so you have to extract the names as above if
> you are constructing syntax.
>
> You could, alternatively, use the variables property of
> each dictionary to return ordinary lists of names instead of
> variable objects.
>
> Note that the pattern expression for selecting variables is
> not case sensitive.
>
> Watch out, also, for situations where the query returns the
> empty set.
>
> HTH,
> Jon Peck
>
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]]
> On Behalf Of Staffan Lindberg
> Sent: Tuesday, September 30, 2008 5:11 AM
> To: [hidden email]
> Subject: [SPSSX-L] Using the variable names to select
> subsamples (Python help?)
>
> Dear list!
>
> I have a large file in which the variable names signifies
> among other things
> which cohort the subjects come from. This was done as
> follows (a few
> examples):
>
> 1. The first character of the variable name (always
> alphabetical) signifies
> which area within a large questionnaire the variable refers
> to, i.e.
>
> A13AB (the first A means background variables/questions)
> F130CDEF (the first F means drug abuse variables/questions)
>
> 2. The 2nd to at most 4th characters (always numerical) is
> the number of the
> question within the area. Using the previous examples:
>
> A13AB (13 means the 13th question among the background
> variables/questions)
> F130CDEF (130 means the 130th questions among the drug
> abuse
> variables/questions)
>
> 3. The rest of the variable name consists of allphabetical
> characters
> specifying the cohort year. Using the same examples:
>
> A13AB (AB at the end means that the variable/question is
> found in cohort
> A=1985 and cohort B=1989)
> F130CDEF (CDEF at the end means that te variable/question
> is found in cohort
> C=1993 and cohort D=1997 and cohort E=2001 and cohort
> F=2005)
>
> Now I want to use the variable names for making different
> kind of
> selections, for example:
>
> 1. All variables for cohort F=2005
> 2. All variables that occur in both cohort A=1985 and
> cohort F=2005
> 3. All variables that occur in all cohorts (ABCDEF in the
> end of the
> variable name)
> 4. All variables within area A and F
>
> I suspect this demands a Python approach. Having just
> attended Ray's course
> in Python I am still a beginner and have some difficulty in
> seeing the
> forest because of all the trees. Perhaps a kind soul could
> help or point me
> in the right direction?
>
> best
>
> Staffan Lindberg
> Sweden
>
>
>
> =======
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body
> text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body
> text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

SV: Using the variable names to select subsamples (Python help?)

Staffan Lindberg
In reply to this post by Peck, Jon
Thank you very much Jon and others for your input. You've given me a lot to
experiment with.

best

Staffan

-----Ursprungligt meddelande-----
Från: SPSSX(r) Discussion [mailto:[hidden email]] För Peck, Jon
Skickat: den 30 september 2008 15:51
Till: [hidden email]
Ämne: Re: Using the variable names to select subsamples (Python help?)


There are two Python technologies that can help a lot with this and let you
write very expressive and readable code.

First, the spssaux module VariableDict object accepts regular expression
patterns in variable names.  So you could construct a (Python) variable
dictionary object like this: F2005 = spssaux.VariableDict(pattern=
r"F.*2005$")

That would create an object with all the variables with names starting with
"F" and ending with "2005".  Then you could, say, do descriptives by
submitting

spss.Submit("DESC " + " ".join(F2005.variables))

since a VariableDict object has a property, variables, that is a list of all
the variable names in that dictionary.

The second technology that addresses queries such as
2. All variables that occur in both cohort A=1985 and cohort F=2005 is
Python sets.  You could create a VariableDict object for each of the two
cohorts, say, A1985 and F2005.  Then use set intersection to get the
variables in common.  The VariableDict objects are designed so that they
support set operations based on the variable names. AF19852005 =
set(A1985).intersection(set(F2005))

That results in a set.  The members of the set are still variable objects
carrying all the properties of those objects, so you could construct a list
of names like this.

varsInCommon = " ".join([v.VariableName for v in AF19852005])

Using the properties of the objects, you could select, say, all the numeric
variables in the set like this.

numericVarsInCommon = " ".join([v.VariableName for v in AF19852005 if
v.VariableType == 0])

Python sets support a full set of set operations including union,
intersection, difference, and symmetric difference, so you can write very
natural expressions for these queries.  Just remember that the set members
are variable objects, not just names, so you have to extract the names as
above if you are constructing syntax.

You could, alternatively, use the variables property of each dictionary to
return ordinary lists of names instead of variable objects.

Note that the pattern expression for selecting variables is not case
sensitive.

Watch out, also, for situations where the query returns the empty set.

HTH,
Jon Peck


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Staffan Lindberg
Sent: Tuesday, September 30, 2008 5:11 AM
To: [hidden email]
Subject: [SPSSX-L] Using the variable names to select subsamples (Python
help?)

Dear list!

I have a large file in which the variable names signifies among other things
which cohort the subjects come from. This was done as follows (a few
examples):

1. The first character of the variable name (always alphabetical) signifies
which area within a large questionnaire the variable refers to, i.e.

A13AB (the first A means background variables/questions) F130CDEF (the first
F means drug abuse variables/questions)

2. The 2nd to at most 4th characters (always numerical) is the number of the
question within the area. Using the previous examples:

A13AB (13 means the 13th question among the background variables/questions)
F130CDEF (130 means the 130th questions among the drug abuse
variables/questions)

3. The rest of the variable name consists of allphabetical characters
specifying the cohort year. Using the same examples:

A13AB (AB at the end means that the variable/question is found in cohort
A=1985 and cohort B=1989) F130CDEF (CDEF at the end means that te
variable/question is found in cohort C=1993 and cohort D=1997 and cohort
E=2001 and cohort F=2005)

Now I want to use the variable names for making different kind of
selections, for example:

1. All variables for cohort F=2005
2. All variables that occur in both cohort A=1985 and cohort F=2005 3. All
variables that occur in all cohorts (ABCDEF in the end of the variable name)
4. All variables within area A and F

I suspect this demands a Python approach. Having just attended Ray's course
in Python I am still a beginner and have some difficulty in seeing the
forest because of all the trees. Perhaps a kind soul could help or point me
in the right direction?

best

Staffan Lindberg
Sweden



=======
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD