SPSSX Discussion

Comparing two datafiles

Classic

List

Threaded

2 messages Options

Ucal Gyam

Comparing two datafiles

Hi ALL,
I have two data files File1.sav and File2.sav .
These files SHOULD have the same number of variables and data values(ie same records).
Is there a way to check if this condition has been met by this two files through syntax.
All suggestios are welcome.
Thanks,
Ucal

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Richard Ristow

Re: Comparing two datafiles

At 10:59 AM 5/3/2007, Ucal Gyam wrote:

>I have two data files File1.sav and File2.sav. These files SHOULD
>have the same number of variables and data values(ie same records). Is
>there a way to check if this condition has been met through syntax.

The short answer is that, at this point, there isn't, not in base SPSS.
You're not the first person who's wanted it, either, not by a good long
ways.

In the meantime, a few ways have been worked out. Try the AGGREGATE
method first.

All these methods start by using ADD FILES to interleave the files. All
assume that there is a unique key (possibly made up of more than one
variable) for the records in the files; in the following, I'll call it
RcrdKey.

To start with, the ADD FILES. (None of this code is tested.)

ADD FILES
/FILE='file1.sav' /* Add directory path as needed
/IN=File1
/FILE='file2.sav'
/IN=File2
/FIRST = FrstOccr
/LAST = LastOccr
/BY RcrdKey.

Everything below assumes you're working with this interleaved file.

A. AGGREGATE with lots of keys. (I regret that I can't find a citation,
so I don't know whom to credit.)

AGGREGATE
/OUTFILE=*
/BREAK = RcrdKey <all data variables>
/File1 'Occurrences in file 1' = SUM(File1)
/File2 'Occurrences in file 2' = SUM(File2).
FORMATS
File1 File2 (F3).

* If a record is in both files without errors, it .
* occurs exactly once in each file. .
SELECT IF File1 NE 1
OR File2 NE 1.

Now, you have all records from both files that don't match exactly with
exactly one record from the other file.

I don't know whether there's a practical limit on the number of
variables on a BREAK list for AGGREGATE. If there is, it could give you
trouble if your file has enough variables.

This AGGREGATE will have many break groups (at least one for every
record in the input files). If the files are large, say over 500,000
records, this form of AGGREGATE may be faster. (For smaller files, it
will be SLOWER.)

SORT CASES BY RcrdKey <all data variables>

AGGREGATE
/OUTFILE=*
/PRESORTED
/BREAK = RcrdKey <all data variables>
/File1 'Occurrences in file 1' = SUM(File1)
/File2 'Occurrences in file 2' = SUM(File2).
FORMATS
File1 File2 (F3).

Finally,
- If there's a discrepancy, this doesn't tell you WHICH variables
differ between the files.

B. Compare all variables using DO REPEAT. (This will go a lot better if
you have Python look in the data dictionary and generate the syntax.
Or, you write the data dictionary to an external file, and generate the
syntax with SPSS code. Even then, you have to write the code-generating
code.)

STRING PROBLEM (A30).

DO IF FrstOccr and LastOccr.
. IF File1 PROBLEM = 'Only in file 1'.
. IF File2 PROBLEM = 'Only in file 2'.
* You check using LAG, so you don't check the first .
* record with a key value, you check later one(s) .
ELSE IF NOT FrstOccr.
* First, check for keys duplicated within a file .
. IF File1 EQ 1
AND LAG(File1) EQ 1
PROBLEM = 'Duplicated in file 1'.
. IF File2 EQ 1
AND LAG(File2) EQ 1
PROBLEM = 'Duplicated in file 2'.
. NUMERIC #DataDsc (F3).
. COMPUTE #DataDsc = 0.
* Between files, check for discrepancies in data .
. DO IF PROBLEM = ' '.
. DO REPEAT VAR = <all data variables>.
. IF MISSING(VAR)
NE MISSING(LAG(VAR)) #DataDsc = #DataDsc + 1.
. IF VAR
NE LAG(VAR) #DataDsc = #DataDsc + 1.
. END REPEAT.
. IF #DataDsc GET
PROBLEM = CONCAT(STRING(#DataDsc,F3),
' data
discrepancies'.
. END IF.
END IF.

SELECT IF PROBLEM NE ' '.
LIST RcrdKey PROBLEM.

Comments:
- Function MISSING is invalid for strings longer than 8 characters.
You'll get an error message for every one of those in your files. I
think the test will still work, though.
- If there's a discrepancy, this doesn't tell you WHICH variables
differ between the files. (If you generate the code, with Python or
otherwise, you can include a list of the variable names as strings.
Then, you can report which variables are discrepant.)

C. VARSTOCASES. This is very neat, but it only works if all your data
variables are numeric (date variables are numeric), or all are strings.
Not very likely. (You can generate separate MAKE subcommands for
strings and Vnumbers, but only if you have lists of your variables of
each kind.)

Check for records found in only one file, or duplicated within a file,
before you do the VARSTOCASES.

VARSTOCASES
/MAKE Value FROM <all data variables>
/INDEX = VblName(Value).

SORT CASES BY RcrdKey VblName (A)
File1 (D) File2 (A).

NUMERIC TROUBLE (F2).
COMPUTE TROUBLE = 0.
DO IF RcrdKey EQ LAG(RcrdKey)
AND VblName EQ LAG(VblName).
. NUMERIC Value1 Value2 (F10).
. DO IF Value NE LAG(Value).
. COMPUTE TROUBLE = 1.
. COMPUTE Value1 = LAG(Value).
. COMPUTE Value2 = Value.
. END IF.
END IF.

SELECT IF TROUBLE.
LIST.