Identify duplicate case plus original

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Identify duplicate case plus original

Mike Pritchard
I'm trying to figure out a way to identify both the original plus the
duplicate case.  The next step would be to identify which survey was entered
first, and delete the other, but I think I need to get both cases first.

I've been using LAG, and successfully generated a numeric or string
variable.  Here's the string version.

STRING MultiParent (A20).
SORT CASES BY RecordID (A).
COMPUTE MultiParent = STRING(RecordID,F8.0).
IF LAG(RecordID) = RecordID MultiParent=CONCAT(STRING(RecordID,F8.0),"
2nd").
EXECUTE.

As a temporary workaround, I'm using "SORT CASES BY MultiParent Vdatesub"
which is OK because there are (so far) only a few duplicates.  But is there
a better way?

Thanks!
Mike

_________________________________________________________________________
Mike Pritchard | [hidden email] | 5 Circles Research | 425-444-3410 (c)
| 425-968-3883 (o)

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identify duplicate case plus original

Zdaniuk, Bozena-3
Hm, and why exactly are you not using the "identify duplicate id" function?
bozena

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mike Pritchard
Sent: October 28, 2015 11:16 AM
To: [hidden email]
Subject: Identify duplicate case plus original

I'm trying to figure out a way to identify both the original plus the duplicate case.  The next step would be to identify which survey was entered first, and delete the other, but I think I need to get both cases first.

I've been using LAG, and successfully generated a numeric or string variable.  Here's the string version.

STRING MultiParent (A20).
SORT CASES BY RecordID (A).
COMPUTE MultiParent = STRING(RecordID,F8.0).
IF LAG(RecordID) = RecordID MultiParent=CONCAT(STRING(RecordID,F8.0),"
2nd").
EXECUTE.

As a temporary workaround, I'm using "SORT CASES BY MultiParent Vdatesub"
which is OK because there are (so far) only a few duplicates.  But is there a better way?

Thanks!
Mike

_________________________________________________________________________
Mike Pritchard | [hidden email] | 5 Circles Research | 425-444-3410 (c)
| 425-968-3883 (o)

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identify duplicate case plus original

Art Kendall
In reply to this post by Mike Pritchard
Do you have a date-of-entry variable? Are you using double entry for  quality control?

What do you want to do if the two entries of what is supposed to be identical data in fact differs?

Do you have access to original forms, papers, etc.  that is being entered?

Please describe your situation in more detail.

As another reply said, did you try "identify duplicate cases"?
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Identify duplicate case plus original

Richard Ristow
In reply to this post by Mike Pritchard
At 02:16 PM 10/28/2015, Mike Pritchard wrote:

>I'm trying to figure out a way to identify both the original plus the
>duplicate case.  The next step would be to identify which survey was entered
>first, and delete the other, but I think I need to get both cases first.

Would this (untested) do what you want?  It keeps only the record
with the lowest value of Vdatesub among those with the same
RecordID.  (It assumes that Vdatesub is an SPSS date variable, or is
in some other form that sorts with earliest value first.)  Adding
variable NInstances, to flag records that occur in duplicate, is optional.

AGGREGATE OUTFILE=*
   /BREAK = RecordID
   /NInstances 'Instances of this value of RecordID' = NU
   /Earliest   'Earliest date for this RecordID' = MIN(Vdatesub).

Then, to keep only the earliest (but if there are several records on
the earliest date, it will keep them all),

SELECT IF Vdatesub EQ Earliest.

  STRING MultiParent (A20).
SORT CASES BY RecordID (A).
COMPUTE MultiParent = STRING(RecordID,F8.0).
IF LAG(RecordID) = RecordID MultiParent=CONCAT(STRING(RecordID,F8.0),"
2nd").
EXECUTE.

As a temporary workaround, I'm using "SORT CASES BY MultiParent Vdatesub"
which is OK because there are (so far) only a few duplicates.  But is there
a better way?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identify duplicate case plus original

Mike Pritchard
In reply to this post by Art Kendall
Thanks Art.  I didn’t want to overload the question with background, but I
can see now that would have been helpful.

I didn't see the other reply about "identify duplicate cases" - I couldn't
find it in spam folders.  I've dabbled with it in the past, but always been
a little scared because the syntax generated makes it look like cases are
dropped, not just identified.  But I tried it and it seems to do exactly
what I needed.

So, the background is probably moot, but just in case...  The survey is for
satisfaction among soccer club members who are parents/guardians of the
players.  Sampling was complicated because of a couple of factors including
multiple players in a household and two sets of parent/guardian fields for
most of the records.  We decided to send invitations to a single person in
the household.  It turns out this was a mistake for a couple of reasons,
including a scolding email from the president complaining about low response
rate - sent to everyone.  We "fixed" this with a campaign to parent/guardian
2, but now have to deal with multiple responses from different people in the
household.  They aren't all multiples - sometimes parent/guardian 1 didn't
take the survey.  

We do have a recordID that uniquely identifies each player, so that's what I
used as part of the SPSS identify duplicates procedure - along with date
submitted to accept the first survey submitted.  We still have to deal with
other combinations including the multiplayer households who may be
duplicates for parts of the survey.  So there is still some work to do to
get to a clean data set.

Mike

_________________________________________________________________________
Mike Pritchard | [hidden email] | 5 Circles Research | 425-444-3410 (c)
| 425-968-3883 (o)


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Art
Kendall
Sent: Wednesday, October 28, 2015 12:06 PM
To: [hidden email]
Subject: Re: Identify duplicate case plus original

Do you have a date-of-entry variable? Are you using double entry for
quality control?

What do you want to do if the two entries of what is supposed to be
identical data in fact differs?

Do you have access to original forms, papers, etc.  that is being entered?

Please describe your situation in more detail.

As another reply said, did you try "identify duplicate cases"?



-----
Art Kendall
Social Research Consultants
--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Identify-duplicate-case-plus-o
riginal-tp5730896p5730898.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identify duplicate case plus original

Mike Pritchard
In reply to this post by Zdaniuk, Bozena-3
Hi.  We've probably moved past this, but the original reason for not using
"identify duplicate cases" was because of some problems I thought I had and
misunderstandings from long ago, as noted in another reply.  Then when I
looked up "SPSS identify duplicate cases"  I found an article from the
University of Washington.  Even if it didn't resolve the issue (and the real
SPSS capability did), I learned a bit about LAG.

Thanks
Mike

_________________________________________________________________________
Mike Pritchard | [hidden email] | 425-444-3410 (c) | 425-968-3883 (o)

-----Original Message-----
From: Zdaniuk, Bozena [mailto:[hidden email]]
Sent: Wednesday, October 28, 2015 11:42 AM
To: 'Mike Pritchard' <[hidden email]>; [hidden email]
Subject: RE: Identify duplicate case plus original

Hm, and why exactly are you not using the "identify duplicate id" function?
bozena

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Mike Pritchard
Sent: October 28, 2015 11:16 AM
To: [hidden email]
Subject: Identify duplicate case plus original

I'm trying to figure out a way to identify both the original plus the
duplicate case.  The next step would be to identify which survey was entered
first, and delete the other, but I think I need to get both cases first.

I've been using LAG, and successfully generated a numeric or string
variable.  Here's the string version.

STRING MultiParent (A20).
SORT CASES BY RecordID (A).
COMPUTE MultiParent = STRING(RecordID,F8.0).
IF LAG(RecordID) = RecordID MultiParent=CONCAT(STRING(RecordID,F8.0),"
2nd").
EXECUTE.

As a temporary workaround, I'm using "SORT CASES BY MultiParent Vdatesub"
which is OK because there are (so far) only a few duplicates.  But is there
a better way?

Thanks!
Mike

_________________________________________________________________________
Mike Pritchard | [hidden email] | 5 Circles Research | 425-444-3410 (c)
| 425-968-3883 (o)

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identify duplicate case plus original

Mike Pritchard
In reply to this post by Richard Ristow
Thanks Richard.  I may need to use some additional techniques for this
particular project, so thanks for the syntax ideas.
Mike

_________________________________________________________________________
Mike Pritchard | [hidden email] | 425-444-3410 (c) | 425-968-3883 (o)



-----Original Message-----
From: Richard Ristow [mailto:[hidden email]]
Sent: Wednesday, October 28, 2015 2:48 PM
To: Mike Pritchard <[hidden email]>; [hidden email]
Subject: Re: Identify duplicate case plus original

At 02:16 PM 10/28/2015, Mike Pritchard wrote:

>I'm trying to figure out a way to identify both the original plus the
>duplicate case.  The next step would be to identify which survey was
>entered first, and delete the other, but I think I need to get both cases
first.

Would this (untested) do what you want?  It keeps only the record with the
lowest value of Vdatesub among those with the same RecordID.  (It assumes
that Vdatesub is an SPSS date variable, or is in some other form that sorts
with earliest value first.)  Adding variable NInstances, to flag records
that occur in duplicate, is optional.

AGGREGATE OUTFILE=*
   /BREAK = RecordID
   /NInstances 'Instances of this value of RecordID' = NU
   /Earliest   'Earliest date for this RecordID' = MIN(Vdatesub).

Then, to keep only the earliest (but if there are several records on the
earliest date, it will keep them all),

SELECT IF Vdatesub EQ Earliest.

  STRING MultiParent (A20).
SORT CASES BY RecordID (A).
COMPUTE MultiParent = STRING(RecordID,F8.0).
IF LAG(RecordID) = RecordID MultiParent=CONCAT(STRING(RecordID,F8.0),"
2nd").
EXECUTE.

As a temporary workaround, I'm using "SORT CASES BY MultiParent Vdatesub"
which is OK because there are (so far) only a few duplicates.  But is there
a better way?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD