SPSSX Discussion

COMPARE DATASETS - having a problem

Classic

List

Threaded

3 messages Options

Catherine Kubitschek

COMPARE DATASETS - having a problem

Hi, all.

I'm running SPSS 20.0.0.1 (64 bit) on Windows 7.

I'm trying to compare a series of text files using COMPARE DATASETS and I'm running into a problem where I'm getting a warning: "Duplicate or out of order ID value found in [dataset name]. Processing Stopped." but when I look at the dataset (and when I ask SPSS to look at the dataset) I can't find any duplicates and I've sorted the table by the ID. Here's my code.

******************************************************************************** .

* check added for searching for duplicates - never found one in these data .

DEFINE !Check4Dups (!POS !CHAREND('|')

/!POS !CMDEND) .

title !quote(!1) .

get file = !2 .

aggregate outfile = * mode=addvariables /presorted /break primary_key /cnt = n(primary_key).

temporary .

select if cnt > 1 .

list var = primary_key cnt .

title '' .

!ENDDEFINE .

******************************************************************************** .

* compare two files using the primary key as the ID .

DEFINE !Compare_Datasets (!POS !CHAREND('|')

/!POS !CHAREND('|')

/!POS !CMDEND).

!let !a2=!concat(!1,"_",!2,"_") .

!let !a3=!concat(!1,"_",!3,"_") .

!let !b=!concat(!1,"_",!2,"_",!3,"_") .

!let !c=!quote(!concat("U:\Logs\VL\VL001\",!b,"diffs.log")) .

get file=!2 .

!Check4Dups !a2 | !2 .

dataset name !a2 .

get file=!3 .

!Check4Dups !a3 | !3 .

dataset name !a3 .

DATASET ACTIVATE !a2 .

SPSSINC COMPARE DATASETS DS2=!a3

/DATA ID = primary_key DIFFCOUNT=diff_count

LOGFILE=!c ROOTNAME=!b

/DICTIONARY ATTRIBUTES FORMAT INDEX MEASLEVEL MISSINGVALUES TYPE VARLABEL VALUELABELS.

dataset close !a2 .

dataset close !a3 .

!ENDDEFINE .

******************************************************************************** .

* Setup for comparing dev to test, dev to prod, test to prod .

DEFINE !Compare_Datasets_3x (!POS !CMDEND).

show $vars .

!Compare_Datasets !1 | dev | test .

show $vars .

!Compare_Datasets !1 | dev | prod .

show $vars .

!Compare_Datasets !1 | test | prod .

show $vars .

!ENDDEFINE .

******************************************************************************** .

* read in the file, compute the primary key, sort by the primary key,

* save the output, check for dups .

DEFINE !Major (!POS !CHAREND('|')) .

!do !x !in (!1) .

!let !a1=!concat("OR9dVL001_Major_",!x) .

DATA LIST FILE = !a1 /

MAJOR_CODE 1-4 (A) /* MAJOR_CODE */

COLLEGE_CODE 5-6 (A) /* COLLEGE_CODE */

MAJOR_DESC 7-56 (A) /* MAJOR_DESC */

PRIM_MAJOR_IND 57-57 (A) /* PRIM_MAJOR_IND */

SUPP_2NDMAJOR_IND 58-58 (A) /* SUPP_2NDMAJOR_IND */

ACTIVE_IND 59-59 (A) /* ACTIVE_IND */

CIP_CODE 60-66 (A) /* CIP_CODE */

SEVIS_CODE 67-73 (A) /* SEVIS_CODE */

SORT_ORDER 74-76 (0) /* SORT_ORDER */

ANALYSIS_MAJOR_CODE 77-80 (A) /* ANALYSIS_MAJOR_CODE */

DEGREE_DEPT_CODE 81-84 (A) /* DEGREE_DEPT_CODE */

DEGREE_COLLEGE_CODE 85-86 (A) /* DEGREE_COLLEGE_CODE */

MAJOR_DEPT_CODE 87-90 (A) /* MAJOR_DEPT_CODE */

MAJOR_COLLEGE_CODE 91-92 (A) /* MAJOR_COLLEGE_CODE */

MAJOR_DESC_SRC 93-122 (A) /* MAJOR_DESC_SRC */

string primary_key (A100) .

compute primary_key = concat(MAJOR_CODE, COLLEGE_CODE) .

sort cases primary_key .

save outfile = !x .

!Check4Dups !a1 | !x .

!doend .

display names .

!ENDDEFINE .

******************************************************************************** .

!Major dev test prod .

!Compare_Datasets_3x Vl_Major .

This has worked fine for the previous 21 text file "trios" but I can't get it to work for this "trio".

This is the first set of text files where the primary key may have a space embedded in the middle of the string (e.g., 'SOC AL'). Is it possible that Python is having a problem with how SPSS sorts spaces?

Any other suggestions for what could be going wrong, what I've done wrong, or what I should try next?

Is there a way to see the Python code for COMPARE DATASETS?

Thanks.

Catherine

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon K Peck

Re: COMPARE DATASETS - having a problem

It is conceivable that the Python and Statistics sort order could differ for a string variable if the strings contain extended characters, possibly depending on the locale and Unicode settings. You might be able to tell from the log file where the problem is. If you want to send me the two datasets that are mysteriously failing, I can look at this. The command is distributed in source form, so you can look at the Python code if you want. The error comes from the cases method of the CompareDatasets class in SPSSINC COMPARE DATASETS.py .

The SPSSINC COMPARE DATASETS command is now obsolete for most purposes, since - in V21 I think - we introduced a native equivalent, but, of course, that wouldn't work with V20.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Albert-Jan Roskam-2

Re: COMPARE DATASETS - having a problem

In reply to this post by Catherine Kubitschek

------------------------------
On Wed, Oct 22, 2014 4:43 PM CEST Catherine Kubitschek wrote:

>Is there a way to see the Python code for COMPARE DATASETS?
>

You can open COMPARE_DATASETS.py with eg Idle. Or you can do

Begin program.
import inspect,COMPARE_DATASETS
print inspect.getsource(COMPARE_DATASETS)
End program.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD