Compare variable names between two files

classic Classic list List threaded Threaded
3 messages Options
J P
Reply | Threaded
Open this post in threaded view
|

Compare variable names between two files

J P
Hello,

Context: Survey data is collected each year for multiple years, each year there are minor changes to the survey are made (questions are added or dropped). Most of the variables (~80%) do not change. It would be very helpful to automate the process for comparing variable names between two (or more) data files so that myself and others can more easily determine which variables are candidates for longitudinal analysis.

I have looked into the Compare Datasets command, but the focus of that procedure seems to be on comparing records & data for specified variables, not the variables per se.  I can always wrangle the data file info into excel and make side-by-side comparisons, which would be fine but with over 800 variables per file human error is inevitable. 

Any assistance is greatly appreciated.

Thanks,
John
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Compare variable names between two files

David Marso
Administrator
NEW FILE.
DATASET CLOSE ALL.
DATA LIST FREE/a b c f g h k l m.
BEGIN DATA
1 2 3 4 5 6 7 8 9
END DATA.
DATASET NAME data1.

DATA LIST FREE/a b c d e g h k l .
BEGIN DATA
1 2 3 4 5 6 7 8 9
END DATA.
DATASET NAME data2.
DATASET ACTIVATE data1.
SELECT IF $CASENUM EQ 1.
FLIP.
SORT CASES BY CASE_LBL.
DATASET NAME flipped1.
DATASET ACTIVATE data2.
SELECT IF $CASENUM EQ 1.
FLIP.
SORT CASES BY CASE_LBL.
DATASET NAME flipped2.
MATCH FILES FILE=flipped1/IN=In1/ FILE=flipped2/IN=In2/BY CASE_LBL.
EXECUTE.
DATASET NAME compare.
LIST.
 
CASE_LBL   var001 In1 In2
 
a            1.00  1   1
b            2.00  1   1
c            3.00  1   1
d            4.00  0   1
e            5.00  0   1
f            4.00  1   0
g            5.00  1   1
h            6.00  1   1
k            7.00  1   1
l            8.00  1   1
m            9.00  1   0
 
 
Number of cases read:  11    Number of cases listed:  11
J P wrote
Hello,

Context: Survey data is collected each year for multiple years, each year there are minor changes to the survey are made (questions are added or dropped). Most of the variables (~80%) do not change. It would be very helpful to automate the process for comparing variable names between two (or more) data files so that myself and others can more easily determine which variables are candidates for longitudinal analysis.


I have looked into the Compare Datasets command, but the focus of that procedure seems to be on comparing records & data for specified variables, not the variables per se.  I can always wrangle the data file info into excel and make side-by-side comparisons, which would be fine but with over 800 variables per file human error is inevitable.

Any assistance is greatly appreciated.

Thanks,
John

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Compare variable names between two files

Jon K Peck
In reply to this post by J P
You can sort of get this information by specifying only 1 case be compared, but the following Python code compares the variable names in the active dataset with the names in an external sav file.  It prints a sorted list of the names in both and a second list of the names in only one or the other.


begin program.
import spss, spssaux, textwrap

def comparenames(externalfile):
   activedsdict = set(spssaux.VariableDict().variables)
   spssaux.OpenDataFile(externalfile)
   externaldict = set(spssaux.VariableDict().variables)
   common = " ".join(sorted(activedsdict.intersection(externaldict)))
   disjoint = " ".join(sorted(activedsdict.symmetric_difference(externaldict)))
   print "\nVariables in Common\n", "\n".join(textwrap.wrap(common, 100))
   print "\nVariables in only one set\n", "\n".join(textwrap.wrap(disjoint, 100))

# invoke function with the name of the external file (using /, not backslash)
comparenames("c:/spss22/samples/english/employee data.sav")
end program.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        J P <[hidden email]>
To:        [hidden email]
Date:        09/17/2014 09:29 AM
Subject:        [SPSSX-L] Compare variable names between two files
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Hello,

Context: Survey data is collected each year for multiple years, each year there are minor changes to the survey are made (questions are added or dropped). Most of the variables (~80%) do not change. It would be very helpful to automate the process for comparing variable names between two (or more) data files so that myself and others can more easily determine which variables are candidates for longitudinal analysis.

I have looked into the Compare Datasets command, but the focus of that procedure seems to be on comparing records & data for specified variables, not the variables per se.  I can always wrangle the data file info into excel and make side-by-side comparisons, which would be fine but with over 800 variables per file human error is inevitable.

Any assistance is greatly appreciated.

Thanks,
John
===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD