|
Dear list!
I have a large file in which the variable names signifies among other things which cohort the subjects come from. This was done as follows (a few examples): 1. The first character of the variable name (always alphabetical) signifies which area within a large questionnaire the variable refers to, i.e. A13AB (the first A means background variables/questions) F130CDEF (the first F means drug abuse variables/questions) 2. The 2nd to at most 4th characters (always numerical) is the number of the question within the area. Using the previous examples: A13AB (13 means the 13th question among the background variables/questions) F130CDEF (130 means the 130th questions among the drug abuse variables/questions) 3. The rest of the variable name consists of allphabetical characters specifying the cohort year. Using the same examples: A13AB (AB at the end means that the variable/question is found in cohort A=1985 and cohort B=1989) F130CDEF (CDEF at the end means that te variable/question is found in cohort C=1993 and cohort D=1997 and cohort E=2001 and cohort F=2005) Now I want to use the variable names for making different kind of selections, for example: 1. All variables for cohort F=2005 2. All variables that occur in both cohort A=1985 and cohort F=2005 3. All variables that occur in all cohorts (ABCDEF in the end of the variable name) 4. All variables within area A and F I suspect this demands a Python approach. Having just attended Ray's course in Python I am still a beginner and have some difficulty in seeing the forest because of all the trees. Perhaps a kind soul could help or point me in the right direction? best Staffan Lindberg Sweden ====================To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
There are two Python technologies that can help a lot with this and let you write very expressive and readable code.
First, the spssaux module VariableDict object accepts regular expression patterns in variable names. So you could construct a (Python) variable dictionary object like this: F2005 = spssaux.VariableDict(pattern= r"F.*2005$") That would create an object with all the variables with names starting with "F" and ending with "2005". Then you could, say, do descriptives by submitting spss.Submit("DESC " + " ".join(F2005.variables)) since a VariableDict object has a property, variables, that is a list of all the variable names in that dictionary. The second technology that addresses queries such as 2. All variables that occur in both cohort A=1985 and cohort F=2005 is Python sets. You could create a VariableDict object for each of the two cohorts, say, A1985 and F2005. Then use set intersection to get the variables in common. The VariableDict objects are designed so that they support set operations based on the variable names. AF19852005 = set(A1985).intersection(set(F2005)) That results in a set. The members of the set are still variable objects carrying all the properties of those objects, so you could construct a list of names like this. varsInCommon = " ".join([v.VariableName for v in AF19852005]) Using the properties of the objects, you could select, say, all the numeric variables in the set like this. numericVarsInCommon = " ".join([v.VariableName for v in AF19852005 if v.VariableType == 0]) Python sets support a full set of set operations including union, intersection, difference, and symmetric difference, so you can write very natural expressions for these queries. Just remember that the set members are variable objects, not just names, so you have to extract the names as above if you are constructing syntax. You could, alternatively, use the variables property of each dictionary to return ordinary lists of names instead of variable objects. Note that the pattern expression for selecting variables is not case sensitive. Watch out, also, for situations where the query returns the empty set. HTH, Jon Peck -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Staffan Lindberg Sent: Tuesday, September 30, 2008 5:11 AM To: [hidden email] Subject: [SPSSX-L] Using the variable names to select subsamples (Python help?) Dear list! I have a large file in which the variable names signifies among other things which cohort the subjects come from. This was done as follows (a few examples): 1. The first character of the variable name (always alphabetical) signifies which area within a large questionnaire the variable refers to, i.e. A13AB (the first A means background variables/questions) F130CDEF (the first F means drug abuse variables/questions) 2. The 2nd to at most 4th characters (always numerical) is the number of the question within the area. Using the previous examples: A13AB (13 means the 13th question among the background variables/questions) F130CDEF (130 means the 130th questions among the drug abuse variables/questions) 3. The rest of the variable name consists of allphabetical characters specifying the cohort year. Using the same examples: A13AB (AB at the end means that the variable/question is found in cohort A=1985 and cohort B=1989) F130CDEF (CDEF at the end means that te variable/question is found in cohort C=1993 and cohort D=1997 and cohort E=2001 and cohort F=2005) Now I want to use the variable names for making different kind of selections, for example: 1. All variables for cohort F=2005 2. All variables that occur in both cohort A=1985 and cohort F=2005 3. All variables that occur in all cohorts (ABCDEF in the end of the variable name) 4. All variables within area A and F I suspect this demands a Python approach. Having just attended Ray's course in Python I am still a beginner and have some difficulty in seeing the forest because of all the trees. Perhaps a kind soul could help or point me in the right direction? best Staffan Lindberg Sweden ======= To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Hi Staffan!
Here's what I just cooked up. Sets were also the first thing that came to my mind while I was reading your post, but I don't think you need them in your particular case. I considered using the pattern matching feature of the VarDict class, but decided that my code was easier, as you want to make different combinations of vars. Cheers!! Albert-Jan begin program. import spssaux def subSet(selectionList, position): targetList = [] sourceList = spssaux.GetVariableNamesList() selectionList = [i.lower() for i in selectionList] [targetList.append(v) for k, v in enumerate(sourceList) if v[position] in selectionList] targetList.sort() print targetList subSet(selectionList = ["T", "H", "I", "E", "c"], position = 0) end program. --- On Tue, 9/30/08, Peck, Jon <[hidden email]> wrote: > From: Peck, Jon <[hidden email]> > Subject: Re: Using the variable names to select subsamples (Python help?) > To: [hidden email] > Date: Tuesday, September 30, 2008, 3:51 PM > There are two Python technologies that can help a lot with > this and let you write very expressive and readable code. > > First, the spssaux module VariableDict object accepts > regular expression patterns in variable names. So you could > construct a (Python) variable dictionary object like this: > F2005 = spssaux.VariableDict(pattern= > r"F.*2005$") > > That would create an object with all the variables with > names starting with "F" and ending with > "2005". Then you could, say, do descriptives by > submitting > > spss.Submit("DESC " + " > ".join(F2005.variables)) > > since a VariableDict object has a property, variables, that > is a list of all the variable names in that dictionary. > > The second technology that addresses queries such as > 2. All variables that occur in both cohort A=1985 and > cohort F=2005 > is Python sets. You could create a VariableDict object for > each of the two cohorts, say, A1985 and F2005. Then use set > intersection to get the variables in common. The > VariableDict objects are designed so that they support set > operations based on the variable names. > AF19852005 = set(A1985).intersection(set(F2005)) > > That results in a set. The members of the set are still > variable objects carrying all the properties of those > objects, so you could construct a list of names like this. > > varsInCommon = " ".join([v.VariableName for v in > AF19852005]) > > Using the properties of the objects, you could select, say, > all the numeric variables in the set like this. > > numericVarsInCommon = " ".join([v.VariableName > for v in AF19852005 if v.VariableType == 0]) > > Python sets support a full set of set operations including > union, intersection, difference, and symmetric difference, > so you can write very natural expressions for these queries. > Just remember that the set members are variable objects, > not just names, so you have to extract the names as above if > you are constructing syntax. > > You could, alternatively, use the variables property of > each dictionary to return ordinary lists of names instead of > variable objects. > > Note that the pattern expression for selecting variables is > not case sensitive. > > Watch out, also, for situations where the query returns the > empty set. > > HTH, > Jon Peck > > > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] > On Behalf Of Staffan Lindberg > Sent: Tuesday, September 30, 2008 5:11 AM > To: [hidden email] > Subject: [SPSSX-L] Using the variable names to select > subsamples (Python help?) > > Dear list! > > I have a large file in which the variable names signifies > among other things > which cohort the subjects come from. This was done as > follows (a few > examples): > > 1. The first character of the variable name (always > alphabetical) signifies > which area within a large questionnaire the variable refers > to, i.e. > > A13AB (the first A means background variables/questions) > F130CDEF (the first F means drug abuse variables/questions) > > 2. The 2nd to at most 4th characters (always numerical) is > the number of the > question within the area. Using the previous examples: > > A13AB (13 means the 13th question among the background > variables/questions) > F130CDEF (130 means the 130th questions among the drug > abuse > variables/questions) > > 3. The rest of the variable name consists of allphabetical > characters > specifying the cohort year. Using the same examples: > > A13AB (AB at the end means that the variable/question is > found in cohort > A=1985 and cohort B=1989) > F130CDEF (CDEF at the end means that te variable/question > is found in cohort > C=1993 and cohort D=1997 and cohort E=2001 and cohort > F=2005) > > Now I want to use the variable names for making different > kind of > selections, for example: > > 1. All variables for cohort F=2005 > 2. All variables that occur in both cohort A=1985 and > cohort F=2005 > 3. All variables that occur in all cohorts (ABCDEF in the > end of the > variable name) > 4. All variables within area A and F > > I suspect this demands a Python approach. Having just > attended Ray's course > in Python I am still a beginner and have some difficulty in > seeing the > forest because of all the trees. Perhaps a kind soul could > help or point me > in the right direction? > > best > > Staffan Lindberg > Sweden > > > > ======= > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body > text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the > command > INFO REFCARD > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body > text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the > command > INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Peck, Jon
Thank you very much Jon and others for your input. You've given me a lot to
experiment with. best Staffan -----Ursprungligt meddelande----- Från: SPSSX(r) Discussion [mailto:[hidden email]] För Peck, Jon Skickat: den 30 september 2008 15:51 Till: [hidden email] Ämne: Re: Using the variable names to select subsamples (Python help?) There are two Python technologies that can help a lot with this and let you write very expressive and readable code. First, the spssaux module VariableDict object accepts regular expression patterns in variable names. So you could construct a (Python) variable dictionary object like this: F2005 = spssaux.VariableDict(pattern= r"F.*2005$") That would create an object with all the variables with names starting with "F" and ending with "2005". Then you could, say, do descriptives by submitting spss.Submit("DESC " + " ".join(F2005.variables)) since a VariableDict object has a property, variables, that is a list of all the variable names in that dictionary. The second technology that addresses queries such as 2. All variables that occur in both cohort A=1985 and cohort F=2005 is Python sets. You could create a VariableDict object for each of the two cohorts, say, A1985 and F2005. Then use set intersection to get the variables in common. The VariableDict objects are designed so that they support set operations based on the variable names. AF19852005 = set(A1985).intersection(set(F2005)) That results in a set. The members of the set are still variable objects carrying all the properties of those objects, so you could construct a list of names like this. varsInCommon = " ".join([v.VariableName for v in AF19852005]) Using the properties of the objects, you could select, say, all the numeric variables in the set like this. numericVarsInCommon = " ".join([v.VariableName for v in AF19852005 if v.VariableType == 0]) Python sets support a full set of set operations including union, intersection, difference, and symmetric difference, so you can write very natural expressions for these queries. Just remember that the set members are variable objects, not just names, so you have to extract the names as above if you are constructing syntax. You could, alternatively, use the variables property of each dictionary to return ordinary lists of names instead of variable objects. Note that the pattern expression for selecting variables is not case sensitive. Watch out, also, for situations where the query returns the empty set. HTH, Jon Peck -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Staffan Lindberg Sent: Tuesday, September 30, 2008 5:11 AM To: [hidden email] Subject: [SPSSX-L] Using the variable names to select subsamples (Python help?) Dear list! I have a large file in which the variable names signifies among other things which cohort the subjects come from. This was done as follows (a few examples): 1. The first character of the variable name (always alphabetical) signifies which area within a large questionnaire the variable refers to, i.e. A13AB (the first A means background variables/questions) F130CDEF (the first F means drug abuse variables/questions) 2. The 2nd to at most 4th characters (always numerical) is the number of the question within the area. Using the previous examples: A13AB (13 means the 13th question among the background variables/questions) F130CDEF (130 means the 130th questions among the drug abuse variables/questions) 3. The rest of the variable name consists of allphabetical characters specifying the cohort year. Using the same examples: A13AB (AB at the end means that the variable/question is found in cohort A=1985 and cohort B=1989) F130CDEF (CDEF at the end means that te variable/question is found in cohort C=1993 and cohort D=1997 and cohort E=2001 and cohort F=2005) Now I want to use the variable names for making different kind of selections, for example: 1. All variables for cohort F=2005 2. All variables that occur in both cohort A=1985 and cohort F=2005 3. All variables that occur in all cohorts (ABCDEF in the end of the variable name) 4. All variables within area A and F I suspect this demands a Python approach. Having just attended Ray's course in Python I am still a beginner and have some difficulty in seeing the forest because of all the trees. Perhaps a kind soul could help or point me in the right direction? best Staffan Lindberg Sweden ======= To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
| Free forum by Nabble | Edit this page |
