Selecting Duplicate Cases (along with their unique pair)

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Selecting Duplicate Cases (along with their unique pair)

Jignesh Sutar
Hi,
 
I use the following syntax to identify duplicate cases:
 
sort cases by ID.
match files file=* /by ID / first = CheckDedupe.
value labels CheckDedupe  0 'Duplicated record'  1 'Unique record'.
freq CheckDedupe.
 
What I'd like to do is not only identify the duplicate case with a 1 but also its original (first) record with a 1 also. Even better would be to identify the first pair (or possibly more) of duplicate cases with a 1 then the next set of duplicate cases with 2 and so on (hope that makes sense?)
 
Any help much appreciated.
 
Many thanks,
Jignesh
Reply | Threaded
Open this post in threaded view
|

Re: Selecting Duplicate Cases (along with their unique pair)

Jon K Peck
The Identify Duplicate Cases wizard can generate syntax for sequentially numbering the duplicates (MatchSequence).

Jon Peck
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        J Sutar <[hidden email]>
To:        [hidden email]
Date:        08/02/2011 07:54 AM
Subject:        [SPSSX-L] Selecting Duplicate Cases (along with their unique pair)
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Hi,
 
I use the following syntax to identify duplicate cases:
 
sort cases by ID.
match files file=* /by ID / first = CheckDedupe.
value labels CheckDedupe  0 'Duplicated record'  1 'Unique record'.
freq CheckDedupe.

 
What I'd like to do is not only identify the duplicate case with a 1 but also its original (first) record with a 1 also. Even better would be to identify the first pair (or possibly more) of duplicate cases with a 1 then the next set of duplicate cases with 2 and so on (hope that makes sense?)
 
Any help much appreciated.
 
Many thanks,
Jignesh
Reply | Threaded
Open this post in threaded view
|

Re: Selecting Duplicate Cases (along with their unique pair)

David Marso
Administrator
In reply to this post by Jignesh Sutar
DO IF $CASENUM=1 OR ID <> LAG(ID)  .
+  COMPUTE YOKEPAIR=1.
ELSE.
+  COMPUTE YOKEPAIR=LAG(YOKEPAIR)+1.
END IF.
COMPUTE YOKEPAIR=TRUNC((YOKEPAIR+1)/2).

J Sutar wrote
Hi,

I use the following syntax to identify duplicate cases:

sort cases by ID.
match files file=* /by ID / first = CheckDedupe.
value labels CheckDedupe  0 'Duplicated record'  1 'Unique record'.
freq CheckDedupe.

What I'd like to do is not only identify the duplicate case with a 1 but
also its original (first) record with a 1 also. Even better would be to
identify the first pair (or possibly more) of duplicate cases with a 1 then
the next set of duplicate cases with 2 and so on (hope that makes sense?)

Any help much appreciated.

Many thanks,
Jignesh
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Selecting Duplicate Cases (along with their unique pair)

Bruce Weaver
Administrator
In reply to this post by Jignesh Sutar
So you really want to create a new ID variable that ranges from 1 to N?  How about AUTORECODE?


J Sutar wrote
Hi,

I use the following syntax to identify duplicate cases:

sort cases by ID.
match files file=* /by ID / first = CheckDedupe.
value labels CheckDedupe  0 'Duplicated record'  1 'Unique record'.
freq CheckDedupe.

What I'd like to do is not only identify the duplicate case with a 1 but
also its original (first) record with a 1 also. Even better would be to
identify the first pair (or possibly more) of duplicate cases with a 1 then
the next set of duplicate cases with 2 and so on (hope that makes sense?)

Any help much appreciated.

Many thanks,
Jignesh
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Selecting Duplicate Cases (along with their unique pair)

David Marso
Administrator

I read the question as desire to create something like:
ID Yokepair
1  1
1  1
1  2
1  2
2  1
3  1
3  1
3  2
3  2
4  1
4  1
....
Bruce Weaver wrote
So you really want to create a new ID variable that ranges from 1 to N?  How about AUTORECODE?


J Sutar wrote
Hi,

I use the following syntax to identify duplicate cases:

sort cases by ID.
match files file=* /by ID / first = CheckDedupe.
value labels CheckDedupe  0 'Duplicated record'  1 'Unique record'.
freq CheckDedupe.

What I'd like to do is not only identify the duplicate case with a 1 but
also its original (first) record with a 1 also. Even better would be to
identify the first pair (or possibly more) of duplicate cases with a 1 then
the next set of duplicate cases with 2 and so on (hope that makes sense?)

Any help much appreciated.

Many thanks,
Jignesh
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Selecting Duplicate Cases (along with their unique pair)

Bruce Weaver
Administrator
The inclusion "or possibly more" (see below) made me think it's just a recode of the ID.  I think the OP needs to clarify.

"What I'd like to do is not only identify the duplicate case with a 1 but also its original (first) record with a 1 also. Even better would be to identify the first pair (or possibly more) of duplicate cases with a 1 then the next set of duplicate cases with 2 and so on (hope that makes sense?)"


David Marso wrote
I read the question as desire to create something like:
ID Yokepair
1  1
1  1
1  2
1  2
2  1
3  1
3  1
3  2
3  2
4  1
4  1
....
Bruce Weaver wrote
So you really want to create a new ID variable that ranges from 1 to N?  How about AUTORECODE?


J Sutar wrote
Hi,

I use the following syntax to identify duplicate cases:

sort cases by ID.
match files file=* /by ID / first = CheckDedupe.
value labels CheckDedupe  0 'Duplicated record'  1 'Unique record'.
freq CheckDedupe.

What I'd like to do is not only identify the duplicate case with a 1 but
also its original (first) record with a 1 also. Even better would be to
identify the first pair (or possibly more) of duplicate cases with a 1 then
the next set of duplicate cases with 2 and so on (hope that makes sense?)

Any help much appreciated.

Many thanks,
Jignesh
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Selecting Duplicate Cases (along with their unique pair)

Richard Ristow
In reply to this post by Jignesh Sutar
At 09:42 AM 8/2/2011, J Sutar wrote:

>I use the following syntax to identify duplicate cases:
>
>sort cases by ID.
>match files file=* /by ID / first = CheckDedupe.
>value labels CheckDedupe  0 'Duplicated record'  1 'Unique record'.
>freq CheckDedupe.
>
>What I'd like to do is not only identify the duplicate case with a 1
>but also its original (first) record with a 1 also.

(untested)
AGGREGATE OUTFILE=* MODE=ADDVARIABLES
    /BREAK=ID
    /CheckDedupe  'Occurrences of this ID in file' = NU.

VALUE LABELS CheckDedupe 1 'ID is unique'.

FREQUENCIES  CheckDedupe.

The cumulative count of all CheckDedupe greater than 1 is the number
of *records* belonging to duplicated IDs. (The number of duplicated
IDs, themselves, will be at most half this number, since each
duplicated ID has at least two records.)

>Even better would be to identify the first pair (or possibly more)
>of duplicate cases with a 1 then the next set of duplicate cases
>with 2 and so on (hope that makes sense?)

There are ways of doing this -- some have been posted -- but, why
bother? You already have an identifying variable for the pair (or
triplet, or whatever): variable "ID", itself.

If you want to study the duplicated IDs in your file, as you very
well may, proceed as above and then

SELECT IF CheckDedupe GT 1.

to get all records belonging to duplicated IDs. (Don't do this on the
master copy of your file, of course, as it *loses* the non-duplicated IDs.)

Does this get you closer?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Selecting Duplicate Cases (along with their unique pair)

Rich Ulrich
In reply to this post by Jon K Peck
I understand the original intention to be that the program
should define multiple pairs from a set of duplicates for one ID.

Jon's result of a sequence 1,2,3,4, ...   within an ID
puts you there almost immediately.  Just take the pairs
in order:  Recode  seqvar(1,2=1)(3,4=2)(5,6=3) ...   /into seqpairs .

--
Rich Ulrich



Date: Tue, 2 Aug 2011 08:02:38 -0600
From: [hidden email]
Subject: Re: Selecting Duplicate Cases (along with their unique pair)
To: [hidden email]

The Identify Duplicate Cases wizard can generate syntax for sequentially numbering the duplicates (MatchSequence).

Jon Peck
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        J Sutar <[hidden email]>
To:        [hidden email]
Date:        08/02/2011 07:54 AM
Subject:        [SPSSX-L] Selecting Duplicate Cases (along with their unique pair)
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Hi,
 
I use the following syntax to identify duplicate cases:
 
sort cases by ID.
match files file=* /by ID / first = CheckDedupe.
value labels CheckDedupe  0 'Duplicated record'  1 'Unique record'.
freq CheckDedupe.

 
What I'd like to do is not only identify the duplicate case with a 1 but also its original (first) record with a 1 also. Even better would be to identify the first pair (or possibly more) of duplicate cases with a 1 then the next set of duplicate cases with 2 and so on (hope that makes sense?)
 
Any help much appreciated.
 
Many thanks,
Jignesh
Reply | Threaded
Open this post in threaded view
|

Re: Selecting Duplicate Cases (along with their unique pair)

Jignesh Sutar
In reply to this post by Richard Ristow
AGGREGATE NU function (and the comments Richard made) captures the answer to the question I posted. Apologies if there was ambiguity.
 
MatchSequence from Identify Duplicate Cases wizard works just as well with the added bonus in that it provides information on how many duplicate records there are for duplicated IDs.
 
Many thanks for the responses.
 
Jignesh

 
On 2 August 2011 21:12, Richard Ristow <[hidden email]> wrote:
At 09:42 AM 8/2/2011, J Sutar wrote:

I use the following syntax to identify duplicate cases:

sort cases by ID.
match files file=* /by ID / first = CheckDedupe.
value labels CheckDedupe  0 'Duplicated record'  1 'Unique record'.
freq CheckDedupe.

What I'd like to do is not only identify the duplicate case with a 1 but also its original (first) record with a 1 also.

(untested)
AGGREGATE OUTFILE=* MODE=ADDVARIABLES
  /BREAK=ID
  /CheckDedupe  'Occurrences of this ID in file' = NU.

VALUE LABELS CheckDedupe 1 'ID is unique'.

FREQUENCIES  CheckDedupe.

The cumulative count of all CheckDedupe greater than 1 is the number of *records* belonging to duplicated IDs. (The number of duplicated IDs, themselves, will be at most half this number, since each duplicated ID has at least two records.)


Even better would be to identify the first pair (or possibly more) of duplicate cases with a 1 then the next set of duplicate cases with 2 and so on (hope that makes sense?)

There are ways of doing this -- some have been posted -- but, why bother? You already have an identifying variable for the pair (or triplet, or whatever): variable "ID", itself.

If you want to study the duplicated IDs in your file, as you very well may, proceed as above and then

SELECT IF CheckDedupe GT 1.

to get all records belonging to duplicated IDs. (Don't do this on the master copy of your file, of course, as it *loses* the non-duplicated IDs.)

Does this get you closer?