COMPARE DATASETS - having a problem

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

COMPARE DATASETS - having a problem

Catherine Kubitschek
Hi, all.

I'm running SPSS 20.0.0.1 (64 bit) on Windows 7.

I'm trying to compare a series of text files using COMPARE DATASETS and I'm running into a problem where I'm getting a warning: "Duplicate or out of order ID value found in [dataset name]. Processing Stopped." but when I look at the dataset (and when I ask SPSS to look at the dataset) I can't find any duplicates and I've sorted the table by the ID. Here's my code.

******************************************************************************** .
* check added for searching for duplicates - never found one in these data .
DEFINE !Check4Dups (!POS !CHAREND('|')
                   /!POS !CMDEND) .
title !quote(!1) .
get file = !2 .
aggregate outfile = * mode=addvariables /presorted /break primary_key /cnt = n(primary_key).
temporary .
select if cnt > 1 .
list var = primary_key cnt .
title '' .
!ENDDEFINE .
******************************************************************************** .
* compare two files using the primary key as the ID .
DEFINE !Compare_Datasets (!POS !CHAREND('|')
                         /!POS !CHAREND('|')
                         /!POS !CMDEND).
!let !a2=!concat(!1,"_",!2,"_") .
!let !a3=!concat(!1,"_",!3,"_") .
!let !b=!concat(!1,"_",!2,"_",!3,"_") .
!let !c=!quote(!concat("U:\Logs\VL\VL001\",!b,"diffs.log")) .
get file=!2 .
!Check4Dups !a2 | !2 .
dataset name !a2 .
get file=!3 .
!Check4Dups !a3 | !3 .
dataset name !a3 .
DATASET ACTIVATE !a2 .
SPSSINC COMPARE DATASETS  DS2=!a3 
/DATA ID = primary_key DIFFCOUNT=diff_count 
    LOGFILE=!c ROOTNAME=!b
/DICTIONARY ATTRIBUTES FORMAT INDEX MEASLEVEL MISSINGVALUES TYPE VARLABEL VALUELABELS.
dataset close !a2 .
dataset close !a3 .
!ENDDEFINE .
******************************************************************************** .
* Setup for comparing dev to test, dev to prod, test to prod .
DEFINE !Compare_Datasets_3x (!POS !CMDEND).
show $vars .
!Compare_Datasets !1 | dev  | test .
show $vars .
!Compare_Datasets !1 | dev  | prod .
show $vars .
!Compare_Datasets !1 | test | prod .
show $vars .
!ENDDEFINE .
******************************************************************************** .
* read in the file, compute the primary key, sort by the primary key,
* save the output, check for dups .
DEFINE !Major (!POS !CHAREND('|')) .
!do !x !in (!1) .
!let !a1=!concat("OR9dVL001_Major_",!x) .
DATA LIST FILE = !a1 /
    MAJOR_CODE           1-4 (A) /* MAJOR_CODE */
    COLLEGE_CODE         5-6 (A) /* COLLEGE_CODE */
    MAJOR_DESC           7-56 (A) /* MAJOR_DESC */
    PRIM_MAJOR_IND       57-57 (A) /* PRIM_MAJOR_IND */
    SUPP_2NDMAJOR_IND    58-58 (A) /* SUPP_2NDMAJOR_IND */
    ACTIVE_IND           59-59 (A) /* ACTIVE_IND */
    CIP_CODE             60-66 (A) /* CIP_CODE */
    SEVIS_CODE           67-73 (A) /* SEVIS_CODE */
    SORT_ORDER           74-76 (0) /* SORT_ORDER */
    ANALYSIS_MAJOR_CODE  77-80 (A) /* ANALYSIS_MAJOR_CODE */
    DEGREE_DEPT_CODE     81-84 (A) /* DEGREE_DEPT_CODE */
    DEGREE_COLLEGE_CODE  85-86 (A) /* DEGREE_COLLEGE_CODE */
    MAJOR_DEPT_CODE      87-90 (A) /* MAJOR_DEPT_CODE */
    MAJOR_COLLEGE_CODE   91-92 (A) /* MAJOR_COLLEGE_CODE */
    MAJOR_DESC_SRC       93-122 (A) /* MAJOR_DESC_SRC */
.
string primary_key (A100) .
compute primary_key = concat(MAJOR_CODE, COLLEGE_CODE) .
sort cases primary_key .
save outfile = !x .
!Check4Dups !a1 | !x .
!doend .
display names .
!ENDDEFINE .
******************************************************************************** .
!Major dev test prod .
!Compare_Datasets_3x Vl_Major .

This has worked fine for the previous 21 text file "trios" but I can't get it to work for this "trio".

This is the first set of text files where the primary key may have a space embedded in the middle of the string (e.g., 'SOC AL'). Is it possible that Python is having a problem with how SPSS sorts spaces?

Any other suggestions for what could be going wrong, what I've done wrong, or what I should try next?

Is there a way to see the Python code for COMPARE DATASETS?

Thanks.

Catherine
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: COMPARE DATASETS - having a problem

Jon K Peck
It is conceivable that the Python and Statistics sort order could differ for a string variable if the strings contain extended characters, possibly depending on the locale and Unicode settings.  You might be able to tell from the log file where the problem is.  If you want to send me the two datasets that are mysteriously failing, I can look at this.  The command is distributed in source form, so you can look at the Python code if you want.  The error comes from the cases method of the CompareDatasets class in SPSSINC COMPARE DATASETS.py .

The SPSSINC COMPARE DATASETS command is now obsolete for most purposes, since - in V21 I think - we introduced a native equivalent, but, of course, that wouldn't work with V20.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: COMPARE DATASETS - having a problem

Albert-Jan Roskam-2
In reply to this post by Catherine Kubitschek
------------------------------
On Wed, Oct 22, 2014 4:43 PM CEST Catherine Kubitschek wrote:


>Is there a way to see the Python code for COMPARE DATASETS?
>

You can open COMPARE_DATASETS.py with eg Idle. Or you can do

Begin program.
import inspect,COMPARE_DATASETS
print inspect.getsource(COMPARE_DATASETS)
End program.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD