Build variable list of variables containing small number of different values in Python

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Build variable list of variables containing small number of different values in Python

Ruben Geert van den Berg
Dear all,
 
I want to run some frequencies but I'd like to automatically filter out variables with more than 5 answer categories. I managed to construct Python code to do it but it's rather slow. I think it uses a data pass for every variable in the dataset in order to see how many values each variable contains. Is there any way to evaluate this condition for all variables simultaneously or speed up the code otherwise?
 
The syntax is:

begin program.
from __future__ import with_statement
import spss,spssdata
varlist=[]
for i in range(spss.GetVariableCount()):
 with spssdata.Spssdata(indexes=[i]) as curs:
  if len(set([j for j in curs])) <= 5:
   varlist.append(spss.GetVariableName(i))
spss.Submit("""fre %s"""%" ".join(varlist))
end program.

TIA!

Ruben van den Berg
Consultant Models & Methods
TNS NIPO
Email: [hidden email]
Mobiel: +31 6 24641435
Telefoon: +31 20 522 5738
Internet: www.tns-nipo.com





New Windows 7: Find the right PC for you. Learn more.
Reply | Threaded
Open this post in threaded view
|

Re: Build variable list of variables containing small number of different values in Python

Jon K Peck

See below
Jon Peck
SPSS, an IBM Company
[hidden email]
312-651-3435



From: Ruben van den Berg <[hidden email]>
To: [hidden email]
Date: 07/15/2010 08:52 AM
Subject: [SPSSX-L] Build variable list of variables containing small              number of              different values in Python
Sent by: "SPSSX(r) Discussion" <[hidden email]>





Dear all,

I want to run some frequencies but I'd like to automatically filter out variables with more than 5 answer categories. I managed to construct Python code to do it but it's rather slow. I think it uses a data pass for every variable in the dataset in order to see how many values each variable contains. Is there any way to evaluate this condition for all variables simultaneously or speed up the code otherwise?

The syntax is:

begin program.
from __future__ import with_statement
import spss,spssdata
varlist=[]
for i in range(spss.GetVariableCount()):
with spssdata.Spssdata(indexes=[i]) as curs:
 if len(set([j for j in curs])) <= 5:
  varlist.append(spss.GetVariableName(i))
spss.Submit("""fre %s"""%" ".join(varlist))
end program.


>>>You are doing a separate data pass for each variable.  Here is an example that finds the variables in one data pass and then runs FREQUENCIES in one additional pass.
As is, it checks all variables, but you could limit this to a specified set in the vardict = line.

Note that another approach, suitable if data are set up appropriately, would be to filter based on variable measurement levels and/or to filter on the number of value labels.

import spss, spssaux, spssdata

maxvalues = 5  # criterion for value count

spssaux.OpenDataFile("c:/spss18/samples/english/employee data.sav")
vardict = spssaux.VariableDict()  # could specify a specific list of variables to process
varcount = len(vardict.variables)
valuesets = [set() for i in range(varcount)]  # a list of empty sets
curs = spssdata.Spssdata(vardict.variables)  # to accommodate specific variables
for case in curs:
    for i in range(varcount):
        if len(valuesets[i]) <= maxvalues+1:  # don't create unnecessary huge sets
            valuesets[i].add(case[i])  # accumulate list of values
curs.CClose()

varsforfreq = []
for i in range(varcount):
    if len(valuesets[i]) <= maxvalues:
        varsforfreq.append(vardict.variables[i])
if varsforfreq:
    spss.Submit("FREQ " + " ".join(varsforfreq))

TIA!

Ruben van den Berg

Consultant Models & Methods

TNS NIPO

Email: [hidden email]

Mobiel: +31 6 24641435

Telefoon: +31 20 522 5738

Internet:
www.tns-nipo.com





New Windows 7: Find the right PC for you. Learn more.