SPSSX Discussion

Print a vertical list of MyVar and lag(MyVar)

Classic

List

Threaded

8 messages Options

Art Kendall

Print a vertical list of MyVar and lag(MyVar)

Has anybody already worked this out?

CaseIDs are supposed to be Unique.
Identify Duplicate Cases shows that is not so.
a variable "Match Sequence" has been added to the file.

There are dozens of variables, strings of varying widths mixed with numbers
and dates.

I would like to see which variables are NOT the same.

For the instance of* pairs* of almost duplicates something like this

DO IF MatchSequence GE 2.
PRINT /CaseID.
DO REPEAT MyVar= varlist.
DO IF MyVar NE Lag(MyVar).
PRINT /VarName MyVar Lag(MyVar).
END IF.
END REPEAT.
END IF.

Obviously PRINT would not output VarName or Lag(MyVar).
It is also not possible in syntax to put Lag(Var) into a new variable
because of the different types and string widths.

Another way to think of it is to pull sets of cases that are a group in
Identify Duplicate Cases as a matrix with as many rows as group members and
columns for all the variables.
Then transpose that matrix keeping type and string width formatting and
indicate where there are differences.

-----
Art Kendall
Social Research Consultants
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Art Kendall
Social Research Consultants

Rich Ulrich

Re: Print a vertical list of MyVar and lag(MyVar)

Is this a similar problem? Years ago, I dealt with data that were keyed in

by two people, instead of the key-punch-style key-and-verify. I needed to

confirm that two records were identical, or point to the differences... which

then required selection or correction.

Printing out "variables" wasn't a solution where there were 80 columns that might

or might not be 1-column each. What I did for an ID match where the data differed

was to print the first line of data as-is, and the second line as blanks except where

there was a difference. That gave me one line for each set of errors for a line.

I adapted that once for some SPSS data set, and here is what I reconstruct as likely,

and how I would fit it for your problem.

Match Files for one file, marking /First and /Last for each ID. Execute.

COMMENT if both /First and /Last, there is not a duplicate.

XSAVE to write out a file where the ID is /First and not /Last;

XSAVE to write out a file where the ID is not /First; RENAME as you save,

to ID and var1 to var25 (or however many). Execute.

Match Files with a File for /first and Table for others (notFirst).

Do Repeat to compare the var1 to var25 to the /First rec; set to Sysmis if matching.

SAVE but only save the ID and var1 to var25 -- use the original data list to RENAME

as you save, restoring the original set of names.

You now will have a file of notFirst, with duplicated VALUES replaced by Sysmis.

Merge (sort) that with the file of First, and you have something like what I had with

lines of data and lines that were mostly blank.

If these are too many vars for LIST CASES to show them all on one line, do several LIST CASES.

If this doesn't fit your problem .... well, maybe the techniques (renaming; printing "." to

show a null field) will be useful to someone, some time.

Rich Ulrich

From: SPSSX(r) Discussion <[hidden email]> on behalf of Art Kendall <[hidden email]>
Sent: Monday, July 20, 2020 5:33 PM
To: [hidden email] <[hidden email]>
Subject: Print a vertical list of MyVar and lag(MyVar)

Has anybody already worked this out?

CaseIDs are supposed to be Unique.
Identify Duplicate Cases shows that is not so.
a variable "Match Sequence" has been added to the file.

There are dozens of variables, strings of varying widths mixed with numbers
and dates.

I would like to see which variables are NOT the same.

For the instance of* pairs* of almost duplicates something like this

DO IF MatchSequence GE 2.
    PRINT /CaseID.
    DO REPEAT MyVar= varlist.
        DO IF MyVar NE Lag(MyVar).
            PRINT /VarName MyVar Lag(MyVar).
        END IF.
    END REPEAT.
END IF.

Obviously PRINT would not output VarName or Lag(MyVar).
It is also not possible in syntax to put Lag(Var) into a new variable
because of the different types and string widths.

Another way to think of it is to pull sets of cases that are a group in
Identify Duplicate Cases as a matrix with as many rows as group members and
columns for all the variables.
Then transpose that matrix keeping type and string width formatting and
indicate where there are differences.

-----
Art Kendall
Social Research Consultants
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon Peck

Re: Print a vertical list of MyVar and lag(MyVar)

Remember the COMPARE DATASETS command. It can compare data and/or metadata in two datasets.

On Wed, Jul 22, 2020 at 12:55 AM Rich Ulrich <[hidden email]> wrote:

Is this a similar problem? Years ago, I dealt with data that were keyed in

by two people, instead of the key-punch-style key-and-verify. I needed to

confirm that two records were identical, or point to the differences... which

then required selection or correction.

Printing out "variables" wasn't a solution where there were 80 columns that might

or might not be 1-column each. What I did for an ID match where the data differed

was to print the first line of data as-is, and the second line as blanks except where

there was a difference. That gave me one line for each set of errors for a line.

I adapted that once for some SPSS data set, and here is what I reconstruct as likely,

and how I would fit it for your problem.

Match Files for one file, marking /First and /Last for each ID. Execute.

COMMENT if both /First and /Last, there is not a duplicate.

XSAVE to write out a file where the ID is /First and not /Last;

XSAVE to write out a file where the ID is not /First; RENAME as you save,

to ID and var1 to var25 (or however many). Execute.

Match Files with a File for /first and Table for others (notFirst).

Do Repeat to compare the var1 to var25 to the /First rec; set to Sysmis if matching.

SAVE but only save the ID and var1 to var25 -- use the original data list to RENAME

as you save, restoring the original set of names.

You now will have a file of notFirst, with duplicated VALUES replaced by Sysmis.

Merge (sort) that with the file of First, and you have something like what I had with

lines of data and lines that were mostly blank.

If these are too many vars for LIST CASES to show them all on one line, do several LIST CASES.

If this doesn't fit your problem .... well, maybe the techniques (renaming; printing "." to

show a null field) will be useful to someone, some time.

--

Rich Ulrich

From: SPSSX(r) Discussion <[hidden email]> on behalf of Art Kendall <[hidden email]>
Sent: Monday, July 20, 2020 5:33 PM
To: [hidden email] <[hidden email]>
Subject: Print a vertical list of MyVar and lag(MyVar)

Has anybody already worked this out?

CaseIDs are supposed to be Unique.
Identify Duplicate Cases shows that is not so.
a variable "Match Sequence" has been added to the file.

There are dozens of variables, strings of varying widths mixed with numbers
and dates.

I would like to see which variables are NOT the same.

For the instance of* pairs* of almost duplicates something like this

DO IF MatchSequence GE 2.
    PRINT /CaseID.
    DO REPEAT MyVar= varlist.
        DO IF MyVar NE Lag(MyVar).
            PRINT /VarName MyVar Lag(MyVar).
        END IF.
    END REPEAT.
END IF.

Obviously PRINT would not output VarName or Lag(MyVar).
It is also not possible in syntax to put Lag(Var) into a new variable
because of the different types and string widths.

Another way to think of it is to pull sets of cases that are a group in
Identify Duplicate Cases as a matrix with as many rows as group members and
columns for all the variables.
Then transpose that matrix keeping type and string width formatting and
indicate where there are differences.

-----
Art Kendall
Social Research Consultants
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon K Peck
[hidden email]

Art Kendall

Re: Print a vertical list of MyVar and lag(MyVar)

In reply to this post by Rich Ulrich

Head slap. Thanks for reminding me.

This works, BUT not for casing (.i.e., upper vs lower case.

NEW FILE.
DATA LIST LIST /ID (f2) NV1 TO NV4 (4f1) SV1 to SV4 (4a8).
BEGIN DATA
01 1 2 3 4 apple pear Orange Peach
02 4 5 6 7 pear grapes '' ''
03 4 3 2 1 peach lemon apple lime
04 1 1 1 1 '' '' grapes mango
END DATA.
DATASET NAME Set1.
EXECUTE.

DATA LIST LIST /ID (f2) NV1 TO NV4 (4f1) SV1 to SV4 (4a8).
BEGIN DATA
01 1 2 3 9 Apple pear orange Peach
02 4 9 6 7 pear grAPes '' ''
03 4 3 1 2 peach Lemon apple lime
04 1 7 7 1 '' '' lime manGO
END DATA.
DATASET NAME Set2.
EXECUTE.

DATASET ACTIVATE Set1.

DATASET ACTIVATE Set2.

SORT CASES BY ID .

DATASET ACTIVATE Set1 WINDOW=ASIS.

SORT CASES BY ID .

COMPARE DATASETS
/COMPDATASET = Set2
/VARIABLES NV1 NV2 NV3 NV4 SV1 SV2 SV3 SV4
/CASEID ID
/SAVE FLAGMISMATCHES=YES VARNAME=CasesCompare MATCHDATASET=NO
MISMATCHDATASET=NO
/OUTPUT VARPROPERTIES=NONE CASETABLE=YES TABLELIMIT=100.

-----
Art Kendall
Social Research Consultants
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Art Kendall
Social Research Consultants

Rich Ulrich

Re: Print a vertical list of MyVar and lag(MyVar)

This is a different problem than the first one. And a messed up dataset.

I would probably start by running a quick edit that converts all upper case

to lower case, and stash the original files in a directory called RAW. Here is

an instance where I would make data changes and keep the original var names.

Or I would change the var names in RAW if there was a risk that someone

might try to use them.

Here's a more messed up example, one dataset I was asked to trouble-shoot.

The data entry had been done by some program that allowed "back-space" as

a legal character (!). Some names and addresses (wide fields) had

{bad char, BS, good char} entered in place of {good char}. When you LOOKED

at a listing already displayed on the screen, it looked fine, but Sorting or matching

went bad. What made it easier to find the problem in 1995 was that writing to

the screen was just slow enough that I could see the blip when a few characters

were written, then back-spaced over, before the correct characters. And I

recognized the symptom, from previous experiences with BS.

A friend once received a file that puzzled him with a half-similar problem. The

first variable was an ASCII id variable. For reasons mysterious, those had all

been created with the actual id started with a NULL

Rich Ulrich

From: SPSSX(r) Discussion <[hidden email]> on behalf of Art Kendall <[hidden email]>
Sent: Wednesday, July 22, 2020 12:27 PM
To: [hidden email] <[hidden email]>
Subject: Re: Print a vertical list of MyVar and lag(MyVar)

Jon Peck

Re: Print a vertical list of MyVar and lag(MyVar)

In general, for problems where there are nonprinting characters in string data, it can be useful to change the format to AHEX and then look for bytes with a value less than 20.

On Wed, Jul 22, 2020 at 11:40 AM Rich Ulrich <[hidden email]> wrote:

This is a different problem than the first one. And a messed up dataset.

I would probably start by running a quick edit that converts all upper case

to lower case, and stash the original files in a directory called RAW. Here is

an instance where I would make data changes and keep the original var names.

Or I would change the var names in RAW if there was a risk that someone

might try to use them.

Here's a more messed up example, one dataset I was asked to trouble-shoot.

The data entry had been done by some program that allowed "back-space" as

a legal character (!). Some names and addresses (wide fields) had

{bad char, BS, good char} entered in place of {good char}. When you LOOKED

at a listing already displayed on the screen, it looked fine, but Sorting or matching

went bad. What made it easier to find the problem in 1995 was that writing to

the screen was just slow enough that I could see the blip when a few characters

were written, then back-spaced over, before the correct characters. And I

recognized the symptom, from previous experiences with BS.

A friend once received a file that puzzled him with a half-similar problem. The

first variable was an ASCII id variable. For reasons mysterious, those had all

been created with the actual id started with a NULL

--

Rich Ulrich

From: SPSSX(r) Discussion <[hidden email]> on behalf of Art Kendall <[hidden email]>
Sent: Wednesday, July 22, 2020 12:27 PM
To: [hidden email] <[hidden email]>
Subject: Re: Print a vertical list of MyVar and lag(MyVar)

Head slap. Thanks for reminding me.

This works, BUT not for casing (.i.e., upper vs lower case.

NEW FILE.
DATA LIST LIST /ID (f2) NV1 TO NV4 (4f1) SV1 to SV4 (4a8).
BEGIN DATA
01 1 2 3 4 apple pear Orange Peach
02 4 5 6 7 pear grapes '' ''
03 4 3 2 1 peach lemon apple lime
04 1 1 1 1 '' '' grapes mango
END DATA.
DATASET NAME Set1.
EXECUTE.

DATA LIST LIST /ID (f2) NV1 TO NV4 (4f1) SV1 to SV4 (4a8).
BEGIN DATA
01 1 2 3 9 Apple pear orange Peach
02 4 9 6 7 pear grAPes '' ''
03 4 3 1 2 peach Lemon apple lime
04 1 7 7 1 '' '' lime manGO
END DATA.
DATASET NAME Set2.
EXECUTE.

DATASET ACTIVATE Set1.

DATASET ACTIVATE Set2.

SORT CASES BY ID .

DATASET ACTIVATE Set1 WINDOW=ASIS.

SORT CASES BY ID .

COMPARE DATASETS
/COMPDATASET = Set2
/VARIABLES NV1 NV2 NV3 NV4 SV1 SV2 SV3 SV4
/CASEID ID
/SAVE FLAGMISMATCHES=YES VARNAME=CasesCompare MATCHDATASET=NO
MISMATCHDATASET=NO
/OUTPUT VARPROPERTIES=NONE CASETABLE=YES TABLELIMIT=100.

-----
Art Kendall
Social Research Consultants
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon K Peck
[hidden email]

Rich Ulrich

Re: Print a vertical list of MyVar and lag(MyVar)

Good hint for working in SPSS. The hard part is recalling that "non-printing

characters" might be the cause of the problem. We old-timers have an

advantage there, from old, weird experiences when people wrote out files

using Fortran, Cobol and the like.

Tabs can look like blanks. I would not expect other problems from files from

Excel or other ordinary sources of today.

Rich Ulrich

From: Jon Peck <[hidden email]>
Sent: Wednesday, July 22, 2020 1:47 PM
To: Rich Ulrich <[hidden email]>
Cc: SPSS List <[hidden email]>
Subject: Re: [SPSSX-L] Print a vertical list of MyVar and lag(MyVar)

In general, for problems where there are nonprinting characters in string data, it can be useful to change the format to AHEX and then look for bytes with a value less than 20.

On Wed, Jul 22, 2020 at 11:40 AM Rich Ulrich <[hidden email]> wrote:

This is a different problem than the first one. And a messed up dataset.

I would probably start by running a quick edit that converts all upper case

to lower case, and stash the original files in a directory called RAW. Here is

an instance where I would make data changes and keep the original var names.

Or I would change the var names in RAW if there was a risk that someone

might try to use them.

Here's a more messed up example, one dataset I was asked to trouble-shoot.

The data entry had been done by some program that allowed "back-space" as

a legal character (!). Some names and addresses (wide fields) had

{bad char, BS, good char} entered in place of {good char}. When you LOOKED

at a listing already displayed on the screen, it looked fine, but Sorting or matching

went bad. What made it easier to find the problem in 1995 was that writing to

the screen was just slow enough that I could see the blip when a few characters

were written, then back-spaced over, before the correct characters. And I

recognized the symptom, from previous experiences with BS.

A friend once received a file that puzzled him with a half-similar problem. The

first variable was an ASCII id variable. For reasons mysterious, those had all

been created with the actual id started with a NULL

--

Rich Ulrich

< snip ... >

Jon Peck

Re: Print a vertical list of MyVar and lag(MyVar)

Line ending characters such as CR and LF in the data can still cause problems for users.

On Wed, Jul 22, 2020 at 1:44 PM Rich Ulrich <[hidden email]> wrote:

Good hint for working in SPSS. The hard part is recalling that "non-printing

characters" might be the cause of the problem. We old-timers have an

advantage there, from old, weird experiences when people wrote out files

using Fortran, Cobol and the like.

Tabs can look like blanks. I would not expect other problems from files from

Excel or other ordinary sources of today.

--

Rich Ulrich

From: Jon Peck <[hidden email]>
Sent: Wednesday, July 22, 2020 1:47 PM
To: Rich Ulrich <[hidden email]>
Cc: SPSS List <[hidden email]>
Subject: Re: [SPSSX-L] Print a vertical list of MyVar and lag(MyVar)

In general, for problems where there are nonprinting characters in string data, it can be useful to change the format to AHEX and then look for bytes with a value less than 20.

On Wed, Jul 22, 2020 at 11:40 AM Rich Ulrich <[hidden email]> wrote:

This is a different problem than the first one. And a messed up dataset.

I would probably start by running a quick edit that converts all upper case

to lower case, and stash the original files in a directory called RAW. Here is

an instance where I would make data changes and keep the original var names.

Or I would change the var names in RAW if there was a risk that someone

might try to use them.

Here's a more messed up example, one dataset I was asked to trouble-shoot.

The data entry had been done by some program that allowed "back-space" as

a legal character (!). Some names and addresses (wide fields) had

{bad char, BS, good char} entered in place of {good char}. When you LOOKED

at a listing already displayed on the screen, it looked fine, but Sorting or matching

went bad. What made it easier to find the problem in 1995 was that writing to

the screen was just slow enough that I could see the blip when a few characters

were written, then back-spaced over, before the correct characters. And I

recognized the symptom, from previous experiences with BS.

A friend once received a file that puzzled him with a half-similar problem. The

first variable was an ASCII id variable. For reasons mysterious, those had all

been created with the actual id started with a NULL

--

Rich Ulrich

< snip ... >

Jon K Peck
[hidden email]