Has anybody already worked this out?
CaseIDs are supposed to be Unique. Identify Duplicate Cases shows that is not so. a variable "Match Sequence" has been added to the file. There are dozens of variables, strings of varying widths mixed with numbers and dates. I would like to see which variables are NOT the same. For the instance of* pairs* of almost duplicates something like this DO IF MatchSequence GE 2. PRINT /CaseID. DO REPEAT MyVar= varlist. DO IF MyVar NE Lag(MyVar). PRINT /VarName MyVar Lag(MyVar). END IF. END REPEAT. END IF. Obviously PRINT would not output VarName or Lag(MyVar). It is also not possible in syntax to put Lag(Var) into a new variable because of the different types and string widths. Another way to think of it is to pull sets of cases that are a group in Identify Duplicate Cases as a matrix with as many rows as group members and columns for all the variables. Then transpose that matrix keeping type and string width formatting and indicate where there are differences. ----- Art Kendall Social Research Consultants -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
Is this a similar problem? Years ago, I dealt with data that were keyed in
by two people, instead of the key-punch-style key-and-verify. I needed to
confirm that two records were identical, or point to the differences... which
then required selection or correction.
Printing out "variables" wasn't a solution where there were 80 columns that might
or might not be 1-column each. What I did for an ID match where the data differed
was to print the first line of data as-is, and the second line as blanks except where
there was a difference. That gave me one line for each set of errors for a line.
I adapted that once for some SPSS data set, and here is what I reconstruct as likely,
and how I would fit it for your problem.
Match Files for one file, marking /First and /Last for each ID. Execute.
COMMENT if both /First and /Last, there is not a duplicate.
XSAVE to write out a file where the ID is /First and not /Last;
XSAVE to write out a file where the ID is not /First; RENAME as you save,
to ID and var1 to var25 (or however many). Execute.
Match Files with a File for /first and Table for others (notFirst).
Do Repeat to compare the var1 to var25 to the /First rec; set to Sysmis if matching.
SAVE but only save the ID and var1 to var25 -- use the original data list to RENAME
as you save, restoring the original set of names.
You now will have a file of notFirst, with duplicated VALUES replaced by Sysmis.
Merge (sort) that with the file of First, and you have something like what I had with
lines of data and lines that were mostly blank.
If these are too many vars for LIST CASES to show them all on one line, do several LIST CASES.
If this doesn't fit your problem .... well, maybe the techniques (renaming; printing "." to
show a null field) will be useful to someone, some time.
--
Rich Ulrich
From: SPSSX(r) Discussion <[hidden email]> on behalf of Art Kendall <[hidden email]>
Sent: Monday, July 20, 2020 5:33 PM To: [hidden email] <[hidden email]> Subject: Print a vertical list of MyVar and lag(MyVar) Has anybody already worked this out?
CaseIDs are supposed to be Unique. Identify Duplicate Cases shows that is not so. a variable "Match Sequence" has been added to the file. There are dozens of variables, strings of varying widths mixed with numbers and dates. I would like to see which variables are NOT the same. For the instance of* pairs* of almost duplicates something like this DO IF MatchSequence GE 2. PRINT /CaseID. DO REPEAT MyVar= varlist. DO IF MyVar NE Lag(MyVar). PRINT /VarName MyVar Lag(MyVar). END IF. END REPEAT. END IF. Obviously PRINT would not output VarName or Lag(MyVar). It is also not possible in syntax to put Lag(Var) into a new variable because of the different types and string widths. Another way to think of it is to pull sets of cases that are a group in Identify Duplicate Cases as a matrix with as many rows as group members and columns for all the variables. Then transpose that matrix keeping type and string width formatting and indicate where there are differences. ----- Art Kendall Social Research Consultants -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Remember the COMPARE DATASETS command. It can compare data and/or metadata in two datasets. On Wed, Jul 22, 2020 at 12:55 AM Rich Ulrich <[hidden email]> wrote:
|
In reply to this post by Rich Ulrich
Head slap. Thanks for reminding me.
This works, BUT not for casing (.i.e., upper vs lower case. NEW FILE. DATA LIST LIST /ID (f2) NV1 TO NV4 (4f1) SV1 to SV4 (4a8). BEGIN DATA 01 1 2 3 4 apple pear Orange Peach 02 4 5 6 7 pear grapes '' '' 03 4 3 2 1 peach lemon apple lime 04 1 1 1 1 '' '' grapes mango END DATA. DATASET NAME Set1. EXECUTE. DATA LIST LIST /ID (f2) NV1 TO NV4 (4f1) SV1 to SV4 (4a8). BEGIN DATA 01 1 2 3 9 Apple pear orange Peach 02 4 9 6 7 pear grAPes '' '' 03 4 3 1 2 peach Lemon apple lime 04 1 7 7 1 '' '' lime manGO END DATA. DATASET NAME Set2. EXECUTE. DATASET ACTIVATE Set1. DATASET ACTIVATE Set2. SORT CASES BY ID . DATASET ACTIVATE Set1 WINDOW=ASIS. SORT CASES BY ID . COMPARE DATASETS /COMPDATASET = Set2 /VARIABLES NV1 NV2 NV3 NV4 SV1 SV2 SV3 SV4 /CASEID ID /SAVE FLAGMISMATCHES=YES VARNAME=CasesCompare MATCHDATASET=NO MISMATCHDATASET=NO /OUTPUT VARPROPERTIES=NONE CASETABLE=YES TABLELIMIT=100. ----- Art Kendall Social Research Consultants -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
This is a different problem than the first one. And a messed up dataset.
I would probably start by running a quick edit that converts all upper case
to lower case, and stash the original files in a directory called RAW. Here is
an instance where I would make data changes and keep the original var names.
Or I would change the var names in RAW if there was a risk that someone
might try to use them.
Here's a more messed up example, one dataset I was asked to trouble-shoot.
The data entry had been done by some program that allowed "back-space" as
a legal character (!). Some names and addresses (wide fields) had
{bad char, BS, good char} entered in place of {good char}. When you LOOKED
at a listing already displayed on the screen, it looked fine, but Sorting or matching
went bad. What made it easier to find the problem in 1995 was that writing to
the screen was just slow enough that I could see the blip when a few characters
were written, then back-spaced over, before the correct characters. And I
recognized the symptom, from previous experiences with BS.
A friend once received a file that puzzled him with a half-similar problem. The
first variable was an ASCII id variable. For reasons mysterious, those had all
been created with the actual id started with a NULL
--
Rich Ulrich
From: SPSSX(r) Discussion <[hidden email]> on behalf of Art Kendall <[hidden email]>
Sent: Wednesday, July 22, 2020 12:27 PM To: [hidden email] <[hidden email]> Subject: Re: Print a vertical list of MyVar and lag(MyVar) Head slap. Thanks for reminding me.
This works, BUT not for casing (.i.e., upper vs lower case. NEW FILE. DATA LIST LIST /ID (f2) NV1 TO NV4 (4f1) SV1 to SV4 (4a8). BEGIN DATA 01 1 2 3 4 apple pear Orange Peach 02 4 5 6 7 pear grapes '' '' 03 4 3 2 1 peach lemon apple lime 04 1 1 1 1 '' '' grapes mango END DATA. DATASET NAME Set1. EXECUTE. DATA LIST LIST /ID (f2) NV1 TO NV4 (4f1) SV1 to SV4 (4a8). BEGIN DATA 01 1 2 3 9 Apple pear orange Peach 02 4 9 6 7 pear grAPes '' '' 03 4 3 1 2 peach Lemon apple lime 04 1 7 7 1 '' '' lime manGO END DATA. DATASET NAME Set2. EXECUTE. DATASET ACTIVATE Set1. DATASET ACTIVATE Set2. SORT CASES BY ID . DATASET ACTIVATE Set1 WINDOW=ASIS. SORT CASES BY ID . COMPARE DATASETS /COMPDATASET = Set2 /VARIABLES NV1 NV2 NV3 NV4 SV1 SV2 SV3 SV4 /CASEID ID /SAVE FLAGMISMATCHES=YES VARNAME=CasesCompare MATCHDATASET=NO MISMATCHDATASET=NO /OUTPUT VARPROPERTIES=NONE CASETABLE=YES TABLELIMIT=100. ----- Art Kendall Social Research Consultants -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In general, for problems where there are nonprinting characters in string data, it can be useful to change the format to AHEX and then look for bytes with a value less than 20. On Wed, Jul 22, 2020 at 11:40 AM Rich Ulrich <[hidden email]> wrote:
|
Good hint for working in SPSS. The hard part is recalling that "non-printing
characters" might be the cause of the problem. We old-timers have an
advantage there, from old, weird experiences when people wrote out files
using Fortran, Cobol and the like.
Tabs can look like blanks. I would not expect other problems from files from
Excel or other ordinary sources of today.
--
Rich Ulrich
From: Jon Peck <[hidden email]>
Sent: Wednesday, July 22, 2020 1:47 PM To: Rich Ulrich <[hidden email]> Cc: SPSS List <[hidden email]> Subject: Re: [SPSSX-L] Print a vertical list of MyVar and lag(MyVar) In general, for problems where there are nonprinting characters in string data, it can be useful to change the format to AHEX and then look for bytes with a value less than 20.
On Wed, Jul 22, 2020 at 11:40 AM Rich Ulrich <[hidden email]> wrote:
|
Line ending characters such as CR and LF in the data can still cause problems for users. On Wed, Jul 22, 2020 at 1:44 PM Rich Ulrich <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |