|
Hi ALL,
I have two data files File1.sav and File2.sav . These files SHOULD have the same number of variables and data values(ie same records). Is there a way to check if this condition has been met by this two files through syntax. All suggestios are welcome. Thanks, Ucal __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
|
At 10:59 AM 5/3/2007, Ucal Gyam wrote:
>I have two data files File1.sav and File2.sav. These files SHOULD >have the same number of variables and data values(ie same records). Is >there a way to check if this condition has been met through syntax. The short answer is that, at this point, there isn't, not in base SPSS. You're not the first person who's wanted it, either, not by a good long ways. In the meantime, a few ways have been worked out. Try the AGGREGATE method first. All these methods start by using ADD FILES to interleave the files. All assume that there is a unique key (possibly made up of more than one variable) for the records in the files; in the following, I'll call it RcrdKey. To start with, the ADD FILES. (None of this code is tested.) ADD FILES /FILE='file1.sav' /* Add directory path as needed /IN=File1 /FILE='file2.sav' /IN=File2 /FIRST = FrstOccr /LAST = LastOccr /BY RcrdKey. Everything below assumes you're working with this interleaved file. A. AGGREGATE with lots of keys. (I regret that I can't find a citation, so I don't know whom to credit.) AGGREGATE /OUTFILE=* /BREAK = RcrdKey <all data variables> /File1 'Occurrences in file 1' = SUM(File1) /File2 'Occurrences in file 2' = SUM(File2). FORMATS File1 File2 (F3). * If a record is in both files without errors, it . * occurs exactly once in each file. . SELECT IF File1 NE 1 OR File2 NE 1. Now, you have all records from both files that don't match exactly with exactly one record from the other file. I don't know whether there's a practical limit on the number of variables on a BREAK list for AGGREGATE. If there is, it could give you trouble if your file has enough variables. This AGGREGATE will have many break groups (at least one for every record in the input files). If the files are large, say over 500,000 records, this form of AGGREGATE may be faster. (For smaller files, it will be SLOWER.) SORT CASES BY RcrdKey <all data variables> AGGREGATE /OUTFILE=* /PRESORTED /BREAK = RcrdKey <all data variables> /File1 'Occurrences in file 1' = SUM(File1) /File2 'Occurrences in file 2' = SUM(File2). FORMATS File1 File2 (F3). Finally, - If there's a discrepancy, this doesn't tell you WHICH variables differ between the files. B. Compare all variables using DO REPEAT. (This will go a lot better if you have Python look in the data dictionary and generate the syntax. Or, you write the data dictionary to an external file, and generate the syntax with SPSS code. Even then, you have to write the code-generating code.) STRING PROBLEM (A30). DO IF FrstOccr and LastOccr. . IF File1 PROBLEM = 'Only in file 1'. . IF File2 PROBLEM = 'Only in file 2'. * You check using LAG, so you don't check the first . * record with a key value, you check later one(s) . ELSE IF NOT FrstOccr. * First, check for keys duplicated within a file . . IF File1 EQ 1 AND LAG(File1) EQ 1 PROBLEM = 'Duplicated in file 1'. . IF File2 EQ 1 AND LAG(File2) EQ 1 PROBLEM = 'Duplicated in file 2'. . NUMERIC #DataDsc (F3). . COMPUTE #DataDsc = 0. * Between files, check for discrepancies in data . . DO IF PROBLEM = ' '. . DO REPEAT VAR = <all data variables>. . IF MISSING(VAR) NE MISSING(LAG(VAR)) #DataDsc = #DataDsc + 1. . IF VAR NE LAG(VAR) #DataDsc = #DataDsc + 1. . END REPEAT. . IF #DataDsc GET PROBLEM = CONCAT(STRING(#DataDsc,F3), ' data discrepancies'. . END IF. END IF. SELECT IF PROBLEM NE ' '. LIST RcrdKey PROBLEM. Comments: - Function MISSING is invalid for strings longer than 8 characters. You'll get an error message for every one of those in your files. I think the test will still work, though. - If there's a discrepancy, this doesn't tell you WHICH variables differ between the files. (If you generate the code, with Python or otherwise, you can include a list of the variable names as strings. Then, you can report which variables are discrepant.) C. VARSTOCASES. This is very neat, but it only works if all your data variables are numeric (date variables are numeric), or all are strings. Not very likely. (You can generate separate MAKE subcommands for strings and Vnumbers, but only if you have lists of your variables of each kind.) Check for records found in only one file, or duplicated within a file, before you do the VARSTOCASES. VARSTOCASES /MAKE Value FROM <all data variables> /INDEX = VblName(Value). SORT CASES BY RcrdKey VblName (A) File1 (D) File2 (A). NUMERIC TROUBLE (F2). COMPUTE TROUBLE = 0. DO IF RcrdKey EQ LAG(RcrdKey) AND VblName EQ LAG(VblName). . NUMERIC Value1 Value2 (F10). . DO IF Value NE LAG(Value). . COMPUTE TROUBLE = 1. . COMPUTE Value1 = LAG(Value). . COMPUTE Value2 = Value. . END IF. END IF. SELECT IF TROUBLE. LIST. |
| Free forum by Nabble | Edit this page |
