Identifying Duplicates

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Identifying Duplicates

Bob Walker-2

Hi,

 

When flagging records that have duplicates (using a sorted email list), will…

 

MATCH FILES FILE=* / BY EMAIL / FIRST = FLAG.

 

... and …

 

COMPUTE FLAG = 1.

IF EMAIL = LAG(EMAIL) FLAG  = 0.

 

… always identify the same records as dupes? They seem to, but perhaps I am overlooking situations when this wouldn’t be true?

 

TIA,

 

Bob Walker

Surveys & Forecasts, LLC

www.safllc.com

 





E-mail message checked by Spyware Doctor (6.1.0.447)
Database version: 6.13880
http://www.pctools.com/spyware-doctor-antivirus/
Reply | Threaded
Open this post in threaded view
|

Re: Identifying Duplicates

Patrick Kyba
According to SPSS v15 help file:
  • FIRST creates a variable with the value 1 for the first case of each group and the value 0 for all other cases.
  • If one file has several cases with the same values for the key variables, FIRST or LAST can be used to create a variable that flags the first or last case of the group.
It provides an example:
MATCH FILES  TABLE='c:\data\house.sav'
 /FILE='c:\data\persons.sav'
 /BY=HOUSEID /FIRST=HEAD.
  • The variable HEAD contains the value 1 for the first person in each household and the value 0 for all other persons. Assuming that the persons.sav file is sorted with the head of household as the first case for each household, the variable HEAD identifies the case for the head of household.
So near as I can tell, it seems like both versions of syntax should do the same thing (assuming sorting is the same in each case).

Patrick

Bob Walker wrote:

Hi,

 

When flagging records that have duplicates (using a sorted email list), will…

 

MATCH FILES FILE=* / BY EMAIL / FIRST = FLAG.

 

... and …

 

COMPUTE FLAG = 1.

IF EMAIL = LAG(EMAIL) FLAG  = 0.

 

… always identify the same records as dupes? They seem to, but perhaps I am overlooking situations when this wouldn’t be true?

 

TIA,

 

Bob Walker

Surveys & Forecasts, LLC

www.safllc.com

 





E-mail message checked by Spyware Doctor (6.1.0.447)
Database version: 6.13880
http://www.pctools.com/spyware-doctor-antivirus/


--
Patrick Kyba
Senior Technical Analyst
Advanis

Unintended Recipient & Unauthorized Use of E-Mail:
This message and attachments may contain confidential or privileged
information that is intended only for the named recipient of this
e-mail. Any unauthorized use or distribution is not permitted. If you
have received this e-mail in error, deleting the e-mail and notifying
the sender would be appreciated. Thank you. 
Reply | Threaded
Open this post in threaded view
|

Re: Identifying Duplicates

Jon K Peck

See below.

Jon Peck
SPSS, an IBM Company
[hidden email]
312-651-3435



From: Patrick Kyba <[hidden email]>
To: [hidden email]
Date: 12/09/2009 04:58 PM
Subject: Re: [SPSSX-L] Identifying Duplicates
Sent by: "SPSSX(r) Discussion" <[hidden email]>





According to SPSS v15 help file:
  • FIRST creates a variable with the value 1 for the first case of each group and the value 0 for all other cases.
  • If one file has several cases with the same values for the key variables, FIRST or LAST can be used to create a variable that flags the first or last case of the group.
It provides an example:
MATCH FILES  TABLE='c:\data\house.sav'
/FILE='c:\data\persons.sav'
/BY=HOUSEID /FIRST=HEAD.
  • The variable HEAD contains the value 1 for the first person in each household and the value 0 for all other persons. Assuming that the persons.sav file is sorted with the head of household as the first case for each household, the variable HEAD identifies the case for the head of household.
So near as I can tell, it seems like both versions of syntax should do the same thing (assuming sorting is the same in each case).  
Patrick

>>> This is so in the absence of missing data.  However, system and user missing data will never be equal by the test
email = lag(email)
So if you have several sysmis or user-missing values in a row, MATCH will treat them as the same group, but the IF statement will treat each as a new group.

HTH,
Jon Peck


Bob Walker wrote:

Hi,
 
When flagging records that have duplicates (using a sorted email list), will…
 
MATCH FILES FILE=* / BY EMAIL / FIRST = FLAG.
 
... and …
 
COMPUTE FLAG = 1.
IF EMAIL = LAG(EMAIL) FLAG  = 0.
 
… always identify the same records as dupes? They seem to, but perhaps I am overlooking situations when this wouldn’t be true?
 
TIA,
 
Bob Walker
Surveys & Forecasts, LLC
www.safllc.com
 





E-mail message checked by Spyware Doctor (6.1.0.447)
Database version: 6.13880

http://www.pctools.com/spyware-doctor-antivirus/


--
Patrick Kyba
Senior Technical Analyst
Advanis

Unintended Recipient & Unauthorized Use of E-Mail:
This message and attachments may contain confidential or privileged
information that is intended only for the named recipient of this
e-mail. Any unauthorized use or distribution is not permitted. If you
have received this e-mail in error, deleting the e-mail and notifying
the sender would be appreciated. Thank you.


Reply | Threaded
Open this post in threaded view
|

Re: Identifying Duplicates

Bob Walker-2

Patrick, Jon,

 

Many thanks… missing data is an important distinction that I hadn’t fully thought through.

 

Regards,

 

Bob Walker

Surveys & Forecasts, LLC

www.safllc.com

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jon K Peck
Sent: Wednesday, December 09, 2009 10:07 PM
To: [hidden email]
Subject: Re: Identifying Duplicates

 


See below.

Jon Peck
SPSS, an IBM Company
[hidden email]
312-651-3435


From:

Patrick Kyba <[hidden email]>

To:

[hidden email]

Date:

12/09/2009 04:58 PM

Subject:

Re: [SPSSX-L] Identifying Duplicates

Sent by:

"SPSSX(r) Discussion" <[hidden email]>

 





According to SPSS v15 help file:

  • FIRST creates a variable with the value 1 for the first case of each group and the value 0 for all other cases.
  • If one file has several cases with the same values for the key variables, FIRST or LAST can be used to create a variable that flags the first or last case of the group.

It provides an example:
MATCH FILES  TABLE='c:\data\house.sav'
/FILE='c:\data\persons.sav'
/BY=HOUSEID /FIRST=HEAD.

  • The variable HEAD contains the value 1 for the first person in each household and the value 0 for all other persons. Assuming that the persons.sav file is sorted with the head of household as the first case for each household, the variable HEAD identifies the case for the head of household.

So near as I can tell, it seems like both versions of syntax should do the same thing (assuming sorting is the same in each case).  
Patrick

>>> This is so in the absence of missing data.  However, system and user missing data will never be equal by the test
email = lag(email)
So if you have several sysmis or user-missing values in a row, MATCH will treat them as the same group, but the IF statement will treat each as a new group.

HTH,
Jon Peck


Bob Walker wrote:
Hi,
 
When flagging records that have duplicates (using a sorted email list), will…
 
MATCH FILES FILE=* / BY EMAIL / FIRST = FLAG.
 
... and …
 
COMPUTE FLAG = 1.
IF EMAIL = LAG(EMAIL) FLAG  = 0.
 
… always identify the same records as dupes? They seem to, but perhaps I am overlooking situations when this wouldn’t be true?
 
TIA,
 
Bob Walker
Surveys & Forecasts, LLC
www.safllc.com
 





E-mail message checked by Spyware Doctor (6.1.0.447)
Database version: 6.13880
http://www.pctools.com/spyware-doctor-antivirus/


--
Patrick Kyba
Senior Technical Analyst
Advanis

Unintended Recipient & Unauthorized Use of E-Mail:
This message and attachments may contain confidential or privileged
information that is intended only for the named recipient of this
e-mail. Any unauthorized use or distribution is not permitted. If you
have received this e-mail in error, deleting the e-mail and notifying
the sender would be appreciated. Thank you.






E-mail message checked by Spyware Doctor (6.1.0.447)
Database version: 6.13890
http://www.pctools.com/spyware-doctor-antivirus/





E-mail message checked by Spyware Doctor (6.1.0.447)
Database version: 6.13890
http://www.pctools.com/spyware-doctor-antivirus/