Identify duplicates from more than one variable

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Identify duplicates from more than one variable

Christine-28
Hello list, I'm hoping someone can assist me with this duplicate item issue.


Record_Date Record_ID Person_ID Person_ID_Alternate Current dupe check What I need
25-Oct-05 XGD 341 616 Primary Case Duplicate Case
25-Oct-05 XGD 616 341 Duplicate Case Duplicate Case
25-Oct-05 XGD 616 341 Primary Case Primary Case
25-Oct-05 XGD 104 104 Duplicate Case Duplicate Case
25-Oct-05 XGD 104 104 Duplicate Case Duplicate Case
25-Oct-05 XGD 104 104 Primary Case Primary Case
 
I have a file in which I need to exclude duplicate records but I have some cases where people are known by more than one ID.
 
My current duplicate check is by Record_Date, then Record_ID, then Person_ID but if the person has an alternate ID the duplicate check is incorrect.
 
As you can see above, person 341 is also person 616 and is in the same record as person 104. I need to count only one record for this person on the 25th Oct 05. I would also count a record for person 104 on the same day.
 
Can someone assist with some syntax to provide me with the correct duplicate identification? I can't permanently recode IDs as I will be receiving more data with additional duplicate records and will therefore need to do this check again.

I have SPSS V15

Thanks,
Christine

Reply | Threaded
Open this post in threaded view
|

Re: Identify duplicates from more than one variable

Spousta Jan
Hi Christine,
 
1) Compute the lowest of all IDs of a person
 
compute Person_ID_min = min(Person_ID, Person_ID_Alternate).
 
2) Do duplicate cases analyzis for the new variable Person_ID_min. This ID is unique and the person contained in the first three rows will have 341 in all three cases.
 
Best regards,
 
Jan 
 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Christine
Sent: Friday, March 06, 2009 6:12 AM
To: [hidden email]
Subject: Identify duplicates from more than one variable

Hello list, I'm hoping someone can assist me with this duplicate item issue.


Record_Date Record_ID Person_ID Person_ID_Alternate Current dupe check What I need
25-Oct-05 XGD 341 616 Primary Case Duplicate Case
25-Oct-05 XGD 616 341 Duplicate Case Duplicate Case
25-Oct-05 XGD 616 341 Primary Case Primary Case
25-Oct-05 XGD 104 104 Duplicate Case Duplicate Case
25-Oct-05 XGD 104 104 Duplicate Case Duplicate Case
25-Oct-05 XGD 104 104 Primary Case Primary Case
 
I have a file in which I need to exclude duplicate records but I have some cases where people are known by more than one ID.
 
My current duplicate check is by Record_Date, then Record_ID, then Person_ID but if the person has an alternate ID the duplicate check is incorrect.
 
As you can see above, person 341 is also person 616 and is in the same record as person 104. I need to count only one record for this person on the 25th Oct 05. I would also count a record for person 104 on the same day.
 
Can someone assist with some syntax to provide me with the correct duplicate identification? I can't permanently recode IDs as I will be receiving more data with additional duplicate records and will therefore need to do this check again.

I have SPSS V15

Thanks,
Christine

 

_____________

Tato zpráva a všechny připojené soubory jsou důvěrné a určené výlučně adresátovi(-ům). Jestliže nejste oprávněným adresátem, je zakázáno jakékoliv zveřejňování, zprostředkování nebo jiné použití těchto informací. Jestliže jste tento mail dostali neoprávněně, prosím, uvědomte odesilatele a smažte zprávu i přiložené soubory. Odesilatel nezodpovídá za jakékoliv chyby nebo opomenutí způsobené tímto přenosem.

P Jste si jisti, že opravdu potřebujete vytisknout tuto zprávu a/nebo její přílohy? Myslete na přírodu.

 


This message and any attached files are confidential and intended solely for the addressee(s). Any publication, transmission or other use of the information by a person or entity other than the intended addressee is prohibited. If you receive this in error please contact the sender and delete the message as well as all attached documents. The sender does not accept liability for any errors or omissions as a result of the transmission.

 

P Are you sure that you really need a print version of this message and/or its attachments? Think about nature.

-.- --
Reply | Threaded
Open this post in threaded view
|

Re: Identify duplicates from more than one variable

Christine-28
Ok, thanks for the replies so far but the IDs are randomised and alphanumeric i.e IDs are similar to X-0701-G57P.  Neither the numerical order or the alpha order is sequenced. Sorry to confuse the matter more, I didn't want to include the original format (privacy) due but didn't realise it would complicate matters.

The dataset runs into the 000s of cases so I can't do this manually.

Any further assistance would be much appreciated.

Thankyou, Christine

2009/3/6 Spousta Jan <[hidden email]>
Hi Christine,
 
1) Compute the lowest of all IDs of a person
 
compute Person_ID_min = min(Person_ID, Person_ID_Alternate).
 
2) Do duplicate cases analyzis for the new variable Person_ID_min. This ID is unique and the person contained in the first three rows will have 341 in all three cases.
 
Best regards,
 
Jan 
 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Christine
Sent: Friday, March 06, 2009 6:12 AM
To: [hidden email]
Subject: Identify duplicates from more than one variable

Hello list, I'm hoping someone can assist me with this duplicate item issue.


Record_Date Record_ID Person_ID Person_ID_Alternate Current dupe check What I need
25-Oct-05 XGD 341 616 Primary Case Duplicate Case
25-Oct-05 XGD 616 341 Duplicate Case Duplicate Case
25-Oct-05 XGD 616 341 Primary Case Primary Case
25-Oct-05 XGD 104 104 Duplicate Case Duplicate Case
25-Oct-05 XGD 104 104 Duplicate Case Duplicate Case
25-Oct-05 XGD 104 104 Primary Case Primary Case
 
I have a file in which I need to exclude duplicate records but I have some cases where people are known by more than one ID.
 
My current duplicate check is by Record_Date, then Record_ID, then Person_ID but if the person has an alternate ID the duplicate check is incorrect.
 
As you can see above, person 341 is also person 616 and is in the same record as person 104. I need to count only one record for this person on the 25th Oct 05. I would also count a record for person 104 on the same day.
 
Can someone assist with some syntax to provide me with the correct duplicate identification? I can't permanently recode IDs as I will be receiving more data with additional duplicate records and will therefore need to do this check again.

I have SPSS V15

Thanks,
Christine

 

_____________

Tato zpráva a všechny připojené soubory jsou důvěrné a určené výlučně adresátovi(-ům). Jestliže nejste oprávněným adresátem, je zakázáno jakékoliv zveřejňování, zprostředkování nebo jiné použití těchto informací. Jestliže jste tento mail dostali neoprávněně, prosím, uvědomte odesilatele a smažte zprávu i přiložené soubory. Odesilatel nezodpovídá za jakékoliv chyby nebo opomenutí způsobené tímto přenosem.

P Jste si jisti, že opravdu potřebujete vytisknout tuto zprávu a/nebo její přílohy? Myslete na přírodu.

 


This message and any attached files are confidential and intended solely for the addressee(s). Any publication, transmission or other use of the information by a person or entity other than the intended addressee is prohibited. If you receive this in error please contact the sender and delete the message as well as all attached documents. The sender does not accept liability for any errors or omissions as a result of the transmission.

 

P Are you sure that you really need a print version of this message and/or its attachments? Think about nature.

-.- --

Reply | Threaded
Open this post in threaded view
|

Re: Identify duplicates from more than one variable

Christine-28
In reply to this post by Christine-28
I was wondering about some kind of third variable, can you have a look at my previous post about the sequence of the IDs (they're randomised) and tell me if this could still work?

I might try it to start with.

Thanks,
Christine

2009/3/6 Jason Schoeneberger <[hidden email]>

Could you try createa 3rd id variable that is a concatenation of Id and ID_ALT, then run dup check on the 3rd ID?

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Christine
Sent: Friday, March 06, 2009 12:12 AM


To: [hidden email]
Subject: Identify duplicates from more than one variable

 

Hello list, I'm hoping someone can assist me with this duplicate item issue.

Record_Date

Record_ID

Person_ID

Person_ID_Alternate

Current dupe check

What I need

25-Oct-05

XGD

341

616

Primary Case

Duplicate Case

25-Oct-05

XGD

616

341

Duplicate Case

Duplicate Case

25-Oct-05

XGD

616

341

Primary Case

Primary Case

25-Oct-05

XGD

104

104

Duplicate Case

Duplicate Case

25-Oct-05

XGD

104

104

Duplicate Case

Duplicate Case

25-Oct-05

XGD

104

104

Primary Case

Primary Case

 

I have a file in which I need to exclude duplicate records but I have some cases where people are known by more than one ID.

 

My current duplicate check is by Record_Date, then Record_ID, then Person_ID but if the person has an alternate ID the duplicate check is incorrect.

 

As you can see above, person 341 is also person 616 and is in the same record as person 104. I need to count only one record for this person on the 25th Oct 05. I would also count a record for person 104 on the same day.

 

Can someone assist with some syntax to provide me with the correct duplicate identification? I can't permanently recode IDs as I will be receiving more data with additional duplicate records and will therefore need to do this check again.

I have SPSS V15


Thanks,
Christine