SPSSX Discussion

Selecting Duplicate Cases (along with their unique pair)

Classic

List

Threaded

9 messages Options

Jignesh Sutar

Selecting Duplicate Cases (along with their unique pair)

Hi,

I use the following syntax to identify duplicate cases:

sort cases by ID.
match files file=* /by ID / first = CheckDedupe.
value labels CheckDedupe 0 'Duplicated record' 1 'Unique record'.
freq CheckDedupe.

What I'd like to do is not only identify the duplicate case with a 1 but also its original (first) record with a 1 also. Even better would be to identify the first pair (or possibly more) of duplicate cases with a 1 then the next set of duplicate cases with 2 and so on (hope that makes sense?)

Any help much appreciated.

Many thanks,

Jignesh

Jon K Peck

Re: Selecting Duplicate Cases (along with their unique pair)

The Identify Duplicate Cases wizard can generate syntax for sequentially numbering the duplicates (MatchSequence).

Jon Peck
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: J Sutar <[hidden email]>
To: [hidden email]
Date: 08/02/2011 07:54 AM
Subject: [SPSSX-L] Selecting Duplicate Cases (along with their unique pair)
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Hi,

I use the following syntax to identify duplicate cases:

sort cases by ID.
match files file=* /by ID / first = CheckDedupe.
value labels CheckDedupe 0 'Duplicated record' 1 'Unique record'.
freq CheckDedupe.

What I'd like to do is not only identify the duplicate case with a 1 but also its original (first) record with a 1 also. Even better would be to identify the first pair (or possibly more) of duplicate cases with a 1 then the next set of duplicate cases with 2 and so on (hope that makes sense?)

Any help much appreciated.

Many thanks,
Jignesh

David Marso

Re: Selecting Duplicate Cases (along with their unique pair)

Administrator

In reply to this post by Jignesh Sutar

DO IF $CASENUM=1 OR ID <> LAG(ID) .
+ COMPUTE YOKEPAIR=1.
ELSE.
+ COMPUTE YOKEPAIR=LAG(YOKEPAIR)+1.
END IF.
COMPUTE YOKEPAIR=TRUNC((YOKEPAIR+1)/2).

J Sutar wrote

Hi,

I use the following syntax to identify duplicate cases:

sort cases by ID.
match files file=* /by ID / first = CheckDedupe.
value labels CheckDedupe 0 'Duplicated record' 1 'Unique record'.
freq CheckDedupe.

What I'd like to do is not only identify the duplicate case with a 1 but
also its original (first) record with a 1 also. Even better would be to
identify the first pair (or possibly more) of duplicate cases with a 1 then
the next set of duplicate cases with 2 and so on (hope that makes sense?)

Any help much appreciated.

Many thanks,
Jignesh

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Bruce Weaver

Re: Selecting Duplicate Cases (along with their unique pair)

Administrator

In reply to this post by Jignesh Sutar

So you really want to create a new ID variable that ranges from 1 to N? How about AUTORECODE?

J Sutar wrote

Hi,

I use the following syntax to identify duplicate cases:

sort cases by ID.
match files file=* /by ID / first = CheckDedupe.
value labels CheckDedupe 0 'Duplicated record' 1 'Unique record'.
freq CheckDedupe.

What I'd like to do is not only identify the duplicate case with a 1 but
also its original (first) record with a 1 also. Even better would be to
identify the first pair (or possibly more) of duplicate cases with a 1 then
the next set of duplicate cases with 2 and so on (hope that makes sense?)

Any help much appreciated.

Many thanks,
Jignesh

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

David Marso

Re: Selecting Duplicate Cases (along with their unique pair)

Administrator

I read the question as desire to create something like:
ID Yokepair
1 1
1 1
1 2
1 2
2 1
3 1
3 1
3 2
3 2
4 1
4 1
....

Bruce Weaver wrote

So you really want to create a new ID variable that ranges from 1 to N? How about AUTORECODE?

J Sutar wrote

Hi,

I use the following syntax to identify duplicate cases:

sort cases by ID.
match files file=* /by ID / first = CheckDedupe.
value labels CheckDedupe 0 'Duplicated record' 1 'Unique record'.
freq CheckDedupe.

What I'd like to do is not only identify the duplicate case with a 1 but
also its original (first) record with a 1 also. Even better would be to
identify the first pair (or possibly more) of duplicate cases with a 1 then
the next set of duplicate cases with 2 and so on (hope that makes sense?)

Any help much appreciated.

Many thanks,
Jignesh

Bruce Weaver

Re: Selecting Duplicate Cases (along with their unique pair)

Administrator

The inclusion "or possibly more" (see below) made me think it's just a recode of the ID. I think the OP needs to clarify.

"What I'd like to do is not only identify the duplicate case with a 1 but also its original (first) record with a 1 also. Even better would be to identify the first pair (or possibly more) of duplicate cases with a 1 then the next set of duplicate cases with 2 and so on (hope that makes sense?)"

David Marso wrote

I read the question as desire to create something like:
ID Yokepair
1 1
1 1
1 2
1 2
2 1
3 1
3 1
3 2
3 2
4 1
4 1
....

Bruce Weaver wrote

So you really want to create a new ID variable that ranges from 1 to N? How about AUTORECODE?

J Sutar wrote

Hi,

I use the following syntax to identify duplicate cases:

sort cases by ID.
match files file=* /by ID / first = CheckDedupe.
value labels CheckDedupe 0 'Duplicated record' 1 'Unique record'.
freq CheckDedupe.

What I'd like to do is not only identify the duplicate case with a 1 but
also its original (first) record with a 1 also. Even better would be to
identify the first pair (or possibly more) of duplicate cases with a 1 then
the next set of duplicate cases with 2 and so on (hope that makes sense?)

Any help much appreciated.

Many thanks,
Jignesh

Richard Ristow

Re: Selecting Duplicate Cases (along with their unique pair)

In reply to this post by Jignesh Sutar

At 09:42 AM 8/2/2011, J Sutar wrote:

>I use the following syntax to identify duplicate cases:
>
>sort cases by ID.
>match files file=* /by ID / first = CheckDedupe.
>value labels CheckDedupe 0 'Duplicated record' 1 'Unique record'.
>freq CheckDedupe.
>
>What I'd like to do is not only identify the duplicate case with a 1
>but also its original (first) record with a 1 also.

(untested)
AGGREGATE OUTFILE=* MODE=ADDVARIABLES
/BREAK=ID
/CheckDedupe 'Occurrences of this ID in file' = NU.

VALUE LABELS CheckDedupe 1 'ID is unique'.

FREQUENCIES CheckDedupe.

The cumulative count of all CheckDedupe greater than 1 is the number
of *records* belonging to duplicated IDs. (The number of duplicated
IDs, themselves, will be at most half this number, since each
duplicated ID has at least two records.)

>Even better would be to identify the first pair (or possibly more)
>of duplicate cases with a 1 then the next set of duplicate cases
>with 2 and so on (hope that makes sense?)

There are ways of doing this -- some have been posted -- but, why
bother? You already have an identifying variable for the pair (or
triplet, or whatever): variable "ID", itself.

If you want to study the duplicated IDs in your file, as you very
well may, proceed as above and then

SELECT IF CheckDedupe GT 1.

to get all records belonging to duplicated IDs. (Don't do this on the
master copy of your file, of course, as it *loses* the non-duplicated IDs.)

Does this get you closer?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Rich Ulrich

Re: Selecting Duplicate Cases (along with their unique pair)

In reply to this post by Jon K Peck

I understand the original intention to be that the program
should define multiple pairs from a set of duplicates for one ID.

Jon's result of a sequence 1,2,3,4, ... within an ID
puts you there almost immediately. Just take the pairs
in order: Recode seqvar(1,2=1)(3,4=2)(5,6=3) ... /into seqpairs .

--
Rich Ulrich

Date: Tue, 2 Aug 2011 08:02:38 -0600
From: [hidden email]
Subject: Re: Selecting Duplicate Cases (along with their unique pair)
To: [hidden email]

The Identify Duplicate Cases wizard can generate syntax for sequentially numbering the duplicates (MatchSequence).

Jon Peck
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: J Sutar <[hidden email]>
To: [hidden email]
Date: 08/02/2011 07:54 AM
Subject: [SPSSX-L] Selecting Duplicate Cases (along with their unique pair)
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Jignesh Sutar

Re: Selecting Duplicate Cases (along with their unique pair)

In reply to this post by Richard Ristow

AGGREGATE NU function (and the comments Richard made) captures the answer to the question I posted. Apologies if there was ambiguity.

MatchSequence from Identify Duplicate Cases wizard works just as well with the added bonus in that it provides information on how many duplicate records there are for duplicated IDs.

Many thanks for the responses.

Jignesh

On 2 August 2011 21:12, Richard Ristow <[hidden email]> wrote:

At 09:42 AM 8/2/2011, J Sutar wrote:

I use the following syntax to identify duplicate cases:

sort cases by ID.
match files file=* /by ID / first = CheckDedupe.
value labels CheckDedupe 0 'Duplicated record' 1 'Unique record'.
freq CheckDedupe.

What I'd like to do is not only identify the duplicate case with a 1 but also its original (first) record with a 1 also.

(untested)
AGGREGATE OUTFILE=* MODE=ADDVARIABLES
/BREAK=ID
/CheckDedupe 'Occurrences of this ID in file' = NU.

VALUE LABELS CheckDedupe 1 'ID is unique'.

FREQUENCIES CheckDedupe.

The cumulative count of all CheckDedupe greater than 1 is the number of *records* belonging to duplicated IDs. (The number of duplicated IDs, themselves, will be at most half this number, since each duplicated ID has at least two records.)

Even better would be to identify the first pair (or possibly more) of duplicate cases with a 1 then the next set of duplicate cases with 2 and so on (hope that makes sense?)

There are ways of doing this -- some have been posted -- but, why bother? You already have an identifying variable for the pair (or triplet, or whatever): variable "ID", itself.

If you want to study the duplicated IDs in your file, as you very well may, proceed as above and then

SELECT IF CheckDedupe GT 1.

to get all records belonging to duplicated IDs. (Don't do this on the master copy of your file, of course, as it *loses* the non-duplicated IDs.)

Does this get you closer?