SPSSX Discussion

Identify duplicates from more than one variable

Classic

List

Threaded

4 messages Options

Christine-28

Identify duplicates from more than one variable

Hello list, I'm hoping someone can assist me with this duplicate item issue.

Record_Date	Record_ID	Person_ID	Person_ID_Alternate	Current dupe check	What I need
25-Oct-05	XGD	341	616	Primary Case	Duplicate Case
25-Oct-05	XGD	616	341	Duplicate Case	Duplicate Case
25-Oct-05	XGD	616	341	Primary Case	Primary Case
25-Oct-05	XGD	104	104	Duplicate Case	Duplicate Case
25-Oct-05	XGD	104	104	Duplicate Case	Duplicate Case
25-Oct-05	XGD	104	104	Primary Case	Primary Case

I have a file in which I need to exclude duplicate records but I have some cases where people are known by more than one ID.

My current duplicate check is by Record_Date, then Record_ID, then Person_ID but if the person has an alternate ID the duplicate check is incorrect.

As you can see above, person 341 is also person 616 and is in the same record as person 104. I need to count only one record for this person on the 25th Oct 05. I would also count a record for person 104 on the same day.

Can someone assist with some syntax to provide me with the correct duplicate identification? I can't permanently recode IDs as I will be receiving more data with additional duplicate records and will therefore need to do this check again.

I have SPSS V15

Thanks,
Christine

Spousta Jan

Re: Identify duplicates from more than one variable

Hi Christine,

1) Compute the lowest of all IDs of a person

compute Person_ID_min = min(Person_ID, Person_ID_Alternate).

2) Do duplicate cases analyzis for the new variable Person_ID_min. This ID is unique and the person contained in the first three rows will have 341 in all three cases.

Best regards,

Jan

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Christine
Sent: Friday, March 06, 2009 6:12 AM
To: [hidden email]
Subject: Identify duplicates from more than one variable

Hello list, I'm hoping someone can assist me with this duplicate item issue.

Record_Date	Record_ID	Person_ID	Person_ID_Alternate	Current dupe check	What I need
25-Oct-05	XGD	341	616	Primary Case	Duplicate Case
25-Oct-05	XGD	616	341	Duplicate Case	Duplicate Case
25-Oct-05	XGD	616	341	Primary Case	Primary Case
25-Oct-05	XGD	104	104	Duplicate Case	Duplicate Case
25-Oct-05	XGD	104	104	Duplicate Case	Duplicate Case
25-Oct-05	XGD	104	104	Primary Case	Primary Case

I have a file in which I need to exclude duplicate records but I have some cases where people are known by more than one ID.

My current duplicate check is by Record_Date, then Record_ID, then Person_ID but if the person has an alternate ID the duplicate check is incorrect.

Thanks,
Christine

_____________

Tato zpráva a všechny připojené soubory jsou důvěrné a určené výlučně adresátovi(-ům). Jestliže nejste oprávněným adresátem, je zakázáno jakékoliv zveřejňování, zprostředkování nebo jiné použití těchto informací. Jestliže jste tento mail dostali neoprávněně, prosím, uvědomte odesilatele a smažte zprávu i přiložené soubory. Odesilatel nezodpovídá za jakékoliv chyby nebo opomenutí způsobené tímto přenosem.

P Jste si jisti, že opravdu potřebujete vytisknout tuto zprávu a/nebo její přílohy? Myslete na přírodu.

This message and any attached files are confidential and intended solely for the addressee(s). Any publication, transmission or other use of the information by a person or entity other than the intended addressee is prohibited. If you receive this in error please contact the sender and delete the message as well as all attached documents. The sender does not accept liability for any errors or omissions as a result of the transmission.

P Are you sure that you really need a print version of this message and/or its attachments? Think about nature.

-.- --

Christine-28

Re: Identify duplicates from more than one variable

Ok, thanks for the replies so far but the IDs are randomised and alphanumeric i.e IDs are similar to X-0701-G57P. Neither the numerical order or the alpha order is sequenced. Sorry to confuse the matter more, I didn't want to include the original format (privacy) due but didn't realise it would complicate matters.

The dataset runs into the 000s of cases so I can't do this manually.

Any further assistance would be much appreciated.

Thankyou, Christine

2009/3/6 Spousta Jan <[hidden email]>

Hi Christine,

1) Compute the lowest of all IDs of a person

compute Person_ID_min = min(Person_ID, Person_ID_Alternate).

2) Do duplicate cases analyzis for the new variable Person_ID_min. This ID is unique and the person contained in the first three rows will have 341 in all three cases.

Best regards,

Jan

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Christine
Sent: Friday, March 06, 2009 6:12 AM
To: [hidden email]
Subject: Identify duplicates from more than one variable

Hello list, I'm hoping someone can assist me with this duplicate item issue.

Record_Date Record_ID Person_ID Person_ID_Alternate Current dupe check What I need

25-Oct-05 XGD 341 616 Primary Case Duplicate Case

25-Oct-05 XGD 616 341 Duplicate Case Duplicate Case

25-Oct-05 XGD 616 341 Primary Case Primary Case

25-Oct-05 XGD 104 104 Duplicate Case Duplicate Case

25-Oct-05 XGD 104 104 Duplicate Case Duplicate Case

25-Oct-05 XGD 104 104 Primary Case Primary Case

I have a file in which I need to exclude duplicate records but I have some cases where people are known by more than one ID.

My current duplicate check is by Record_Date, then Record_ID, then Person_ID but if the person has an alternate ID the duplicate check is incorrect.

As you can see above, person 341 is also person 616 and is in the same record as person 104. I need to count only one record for this person on the 25th Oct 05. I would also count a record for person 104 on the same day.

Can someone assist with some syntax to provide me with the correct duplicate identification? I can't permanently recode IDs as I will be receiving more data with additional duplicate records and will therefore need to do this check again.

I have SPSS V15

Thanks,
Christine

_____________

Tato zpráva a všechny připojené soubory jsou důvěrné a určené výlučně adresátovi(-ům). Jestliže nejste oprávněným adresátem, je zakázáno jakékoliv zveřejňování, zprostředkování nebo jiné použití těchto informací. Jestliže jste tento mail dostali neoprávněně, prosím, uvědomte odesilatele a smažte zprávu i přiložené soubory. Odesilatel nezodpovídá za jakékoliv chyby nebo opomenutí způsobené tímto přenosem.

P Jste si jisti, že opravdu potřebujete vytisknout tuto zprávu a/nebo její přílohy? Myslete na přírodu.

This message and any attached files are confidential and intended solely for the addressee(s). Any publication, transmission or other use of the information by a person or entity other than the intended addressee is prohibited. If you receive this in error please contact the sender and delete the message as well as all attached documents. The sender does not accept liability for any errors or omissions as a result of the transmission.

P Are you sure that you really need a print version of this message and/or its attachments? Think about nature.
-.- --

Christine-28

Re: Identify duplicates from more than one variable

In reply to this post by Christine-28

I was wondering about some kind of third variable, can you have a look at my previous post about the sequence of the IDs (they're randomised) and tell me if this could still work?

I might try it to start with.

Thanks,
Christine

2009/3/6 Jason Schoeneberger <[hidden email]>

Could you try createa 3^rd id variable that is a concatenation of Id and ID_ALT, then run dup check on the 3^rd ID?

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Christine
Sent: Friday, March 06, 2009 12:12 AM

To: [hidden email]
Subject: Identify duplicates from more than one variable

Hello list, I'm hoping someone can assist me with this duplicate item issue.

Record_Date

Record_ID

Person_ID

Person_ID_Alternate

Current dupe check

What I need

25-Oct-05

XGD

341

616

Primary Case

Duplicate Case

25-Oct-05

XGD

616

341

Duplicate Case

Duplicate Case

25-Oct-05

XGD

616

341

Primary Case

Primary Case

25-Oct-05

XGD

104

104

Duplicate Case

Duplicate Case

25-Oct-05

XGD

104

104

Duplicate Case

Duplicate Case

25-Oct-05

XGD

104

104

Primary Case

Primary Case

I have a file in which I need to exclude duplicate records but I have some cases where people are known by more than one ID.

My current duplicate check is by Record_Date, then Record_ID, then Person_ID but if the person has an alternate ID the duplicate check is incorrect.

As you can see above, person 341 is also person 616 and is in the same record as person 104. I need to count only one record for this person on the 25th Oct 05. I would also count a record for person 104 on the same day.

Can someone assist with some syntax to provide me with the correct duplicate identification? I can't permanently recode IDs as I will be receiving more data with additional duplicate records and will therefore need to do this check again.

I have SPSS V15

Thanks,
Christine