SPSSX Discussion

Identifying cases that almost match

Classic

List

7 messages Options

Options

Snider-Lotz, Tom

Identifying cases that almost match

I'm trying to identify cases that may belong to the same individuals, even though their name might be entered slightly differently in the different records (e.g., Ben Jones and Benjamin Jones). Unless I'm missing something, I don't think the Duplicate Cases utility can do this.

I can mostly accomplish this with the following syntax, where Fname and Lname are the first and last names:

SORT CASES BY LNAME$ (A) FNAME$ (A).
String ShortWholeName (a30).
Compute ShortWholeName = Concat (RTRIM(Lname), ", ", SUBSTR(Fname,1,3)).
Compute DuplicateName=0.
If ShortWholeName=Lag(ShortWholeName) DuplicateName=1.

The syntax flags the second & subsequent cases that are matches, but not the first one. I.e., if Fname and Lname for the first three cases are:

Ben           Jones
Benjamin      Jones
Benjamin F   Jones

the syntax reduces them all to "Jones, Ben" and gives DuplicateName a value of 1 for the 2nd & 3rd cases, but not for the first case.

My question is, how can I make DuplicateName =1 for the first matching case (Ben Jones) as well? Or, is there a better way to accomplish this?

Thanks, all.

___________________________________

Thomas G. Snider-Lotz, Ph.D.

Principal Scientist

PreVisor

1805 Old Alabama Road

Suite 150

Roswell, GA 30076

Ph: 678-832-0555

Ph: 800-281-9713 x555

Fax: 770-642-6115

http://www.previsor.com

Snider-Lotz, Tom

FW: Identifying cases that almost match

It just occurred to me that I can easily solve my problem by using the Duplicate Cases utility to find duplicates for the variable ShortWholeName that I've created via the syntax. However, if anyone sees an even easier method, I'd like to hear about it.

Thanks, all.

-- Tom Snider-Lotz

From: Snider-Lotz, Tom
Sent: Sun 16-Jul-06 4:38 PM
To: [hidden email]
Subject: Identifying cases that almost match

I'm trying to identify cases that may belong to the same individuals, even though their name might be entered slightly differently in the different records (e.g., Ben Jones and Benjamin Jones). Unless I'm missing something, I don't think the Duplicate Cases utility can do this.

I can mostly accomplish this with the following syntax, where Fname and Lname are the first and last names:

SORT CASES BY LNAME$ (A) FNAME$ (A).
String ShortWholeName (a30).
Compute ShortWholeName = Concat (RTRIM(Lname), ", ", SUBSTR(Fname,1,3)).
Compute DuplicateName=0.
If ShortWholeName=Lag(ShortWholeName) DuplicateName=1.

The syntax flags the second & subsequent cases that are matches, but not the first one. I.e., if Fname and Lname for the first three cases are:

Ben           Jones
Benjamin      Jones
Benjamin F   Jones

the syntax reduces them all to "Jones, Ben" and gives DuplicateName a value of 1 for the 2nd & 3rd cases, but not for the first case.

My question is, how can I make DuplicateName =1 for the first matching case (Ben Jones) as well? Or, is there a better way to accomplish this?

Thanks, all.

___________________________________

Thomas G. Snider-Lotz, Ph.D.

Principal Scientist

PreVisor

1805 Old Alabama Road

Suite 150

Roswell, GA 30076

Ph: 678-832-0555

Ph: 800-281-9713 x555

Fax: 770-642-6115

http://www.previsor.com

Re: Identifying cases that almost match

In reply to this post by Snider-Lotz, Tom

Tom,

To the syntax you posted, add

compute duplicatename=0. /* add this line.

SORT CASES BY LNAME$ (A) FNAME$ (A).
String ShortWholeName (a30).
Compute ShortWholeName = Concat (RTRIM(Lname), ", ", SUBSTR(Fname,1,3)).
Compute DuplicateName=0.
If ShortWholeName=Lag(ShortWholeName) DuplicateName=1.

SORT CASES BY LNAME$ (A) FNAME$ (A) duplicatename (d). /* add.
If ShortWholeName=Lag(ShortWholeName) and
duplicatename eq lag(DuplicateName) duplicatename=1. /* add.

Re: FW: Identifying cases that almost match

In reply to this post by Snider-Lotz, Tom

At 06:23 PM 7/16/2006, Snider-Lotz, Tom wrote:

>I'm trying to identify cases that may belong to the same individuals,
>even though their name might be entered slightly differently in the
>different records (e.g., Ben Jones and Benjamin Jones). It just
>occurred to me that I can easily solve my problem by using the
>Duplicate Cases utility to find duplicates for the variable
>ShortWholeName that I've created via the syntax.
>
>String ShortWholeName (a30).
>Compute ShortWholeName = Concat (RTRIM(Lname), ", ",
>SUBSTR(Fname,1,3)).

That's more or less how you do it: create a key that's broader - more
permissive about matching - than is the one you're having trouble with.

There's no magic. You risk false matches, though you're using a pretty
strict key that won't get many. "Robert" will match "Robin", "Samuel"
match "Samantha". But requiring a strict match on the last name will
eliminate most of those. (Worst likely case is siblings in families
that like to use similar names for

You also risk false negatives, continuing to miss true matches. In your
case, I'd worry more about that: "William" won't match "Bill",
"Elizabeth" won't match "Betty", and any variation in spelling of the
last name will spoil the match. (You may also find ambiguity about what
name is the first. I'm "Walter Richard Ristow." You know me as "Richard
Ristow", but occasional lists have me as "Walter.")

Strategy depends on how big your file is, how much work it's worth
investing, and how many keys you have; for example, you can look for
people who match on address but not on name, if you have address.

That can be a long story, though, since you then need criteria for
evaluating the quality - likelihood of being correct - of matches that
meet various combinations of criteria. I did one of these, in SAS, some
years back, and all I can say is that it was a lot of work.

Barnett, Adrian (HEALTH)

Re: FW: Identifying cases that almost match

In reply to this post by Snider-Lotz, Tom

Hi,
As Richard points out, this task is hard in SPSS. What you really need
is a program which performs probabilistic record linkage. These can
perform matches based on specifiable degrees of similarity between
identifying variables. A good one will also make use of 'prior
knowledge' such as abbreviations and nicknames such as
Robert=Rob=Bob=Bobby etc . de-duplication is one of the tasks that
programs which perform record linkage are designed to carry out.

I've used one such program called LinkageWiz, which works very well. It
is a commercial product though and if your organization's budget does
not run to such things, you can download a free version which will work
on up to 750 records at a time.
There is a demonstration on the LinkageWiz website of a de-duplication
run in progress (and a number of other tasks the program can perform)
here : www.linkagewiz.com

If the 750 record limit is too low for your task, you could consider
sorting the file and then chopping it into groups of records of up to
750 each and doing separate runs on each.

Another possibility is a free, open-source application called FEBRL,
which is produced by the Data Mining Group at the Australian National
University. It uses some very sophisticated algorithms and will
certainly be able to do your de-duplication job. Its main drawbacks are
its manual, which does not explain things very well in places and the
fact that it is still a work in progress (it is currently at version
0.3). The whole thing is written in Python and so you have to obtain and
install Python yourself. FEBRL is described here:
http://datamining.anu.edu.au/linkage.html. FEBRL is a very clever bit
of software and can do some amazing things, but due to its manual, it's
something only its mother could love ;-) However for a version 0.3, it
is not fair to be hard on something so early in it's development cycle.
Besides, it's free.

There is a very good overview of the underlying concepts of record
linkage in a free publication you can obtain here:
http://www.statistics.gov.uk/StatBase/Product.asp?vlnk=9224 and there
are more references of a much more technical nature on the ANU Data
Mining website.

Hope this helps

Regards

Adrian

--
Adrian Barnett
Research & Information Officer Ph: +61 8 82266615
Research, Analysis and Evaluation Fax: +61 8
82267088
Strategic Planning and Research Branch
Strategic Planning and Population Health Division
SA Department of Health

This e-mail may contain confidential information, which also may be
legally privileged. Only the intended recipient(s) may access, use,
distribute or copy this e-mail. If this e-mail is received in error,
please inform the sender by return e-mail and delete the original. If
there are doubts about the validity of this message, please contact the
sender by telephone. It is the recipient's responsibility to check the
e-mail and any attached files for viruses.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Richard Ristow
Sent: Tuesday, 18 July 2006 3:58
To: [hidden email]
Subject: Re: FW: Identifying cases that almost match

At 06:23 PM 7/16/2006, Snider-Lotz, Tom wrote:

>I'm trying to identify cases that may belong to the same individuals,
>even though their name might be entered slightly differently in the
>different records (e.g., Ben Jones and Benjamin Jones). It just
>occurred to me that I can easily solve my problem by using the
>Duplicate Cases utility to find duplicates for the variable
>ShortWholeName that I've created via the syntax.
>
>String ShortWholeName (a30).
>Compute ShortWholeName = Concat (RTRIM(Lname), ", ",
>SUBSTR(Fname,1,3)).

That's more or less how you do it: create a key that's broader - more
permissive about matching - than is the one you're having trouble with.

There's no magic. You risk false matches, though you're using a pretty
strict key that won't get many. "Robert" will match "Robin", "Samuel"
match "Samantha". But requiring a strict match on the last name will
eliminate most of those. (Worst likely case is siblings in families that
like to use similar names for

You also risk false negatives, continuing to miss true matches. In your
case, I'd worry more about that: "William" won't match "Bill",
"Elizabeth" won't match "Betty", and any variation in spelling of the
last name will spoil the match. (You may also find ambiguity about what
name is the first. I'm "Walter Richard Ristow." You know me as "Richard
Ristow", but occasional lists have me as "Walter.")

Strategy depends on how big your file is, how much work it's worth
investing, and how many keys you have; for example, you can look for
people who match on address but not on name, if you have address.

That can be a long story, though, since you then need criteria for
evaluating the quality - likelihood of being correct - of matches that
meet various combinations of criteria. I did one of these, in SAS, some
years back, and all I can say is that it was a lot of work.

Re: FW: Identifying cases that almost match

In reply to this post by Snider-Lotz, Tom

Hello All

Very good answers to this problem so far, but I wonder if someone could go one better. We use a commercial deduplication piece of software outside of SPSS for checking of both names and addresses. Looking about on Google there seems to be a couple of deduplication programs that can link to SAS but I cannot locate any programs that can integrate with SPSS.

What we would like to achieve is automated checking of names and addresses from the Syntax windows or some other automation method. The problem with the suggested solutions seem to be mistypes, if someone for example called Ben Jones was accidently inputted as Ben Jines (common input error due to neighbouring keys on the keyboard) this would not be flagged as a potential problem. This seems to rely on some form of algorithm to notice potential mistypes and then flag them. I read the below site about the FEBRL project with great interest, does anyone have a link to the manual for the application as I cannot seem to locate the document on their site.

Any suggestions would be appreciated.

Thanks
Ian Maddrell PgDipM
Campaign Planner

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of
Barnett, Adrian (HEALTH)
Sent: 18 July 2006 02:25
To: [hidden email]
Subject: Re: FW: Identifying cases that almost match

Hi,
As Richard points out, this task is hard in SPSS. What you really need
is a program which performs probabilistic record linkage. These can
perform matches based on specifiable degrees of similarity between
identifying variables. A good one will also make use of 'prior
knowledge' such as abbreviations and nicknames such as
Robert=Rob=Bob=Bobby etc . de-duplication is one of the tasks that
programs which perform record linkage are designed to carry out.

I've used one such program called LinkageWiz, which works very well. It
is a commercial product though and if your organization's budget does
not run to such things, you can download a free version which will work
on up to 750 records at a time.
There is a demonstration on the LinkageWiz website of a de-duplication
run in progress (and a number of other tasks the program can perform)
here : www.linkagewiz.com

If the 750 record limit is too low for your task, you could consider
sorting the file and then chopping it into groups of records of up to
750 each and doing separate runs on each.

Another possibility is a free, open-source application called FEBRL,
which is produced by the Data Mining Group at the Australian National
University. It uses some very sophisticated algorithms and will
certainly be able to do your de-duplication job. Its main drawbacks are
its manual, which does not explain things very well in places and the
fact that it is still a work in progress (it is currently at version
0.3). The whole thing is written in Python and so you have to obtain and
install Python yourself. FEBRL is described here:
http://datamining.anu.edu.au/linkage.html. FEBRL is a very clever bit
of software and can do some amazing things, but due to its manual, it's
something only its mother could love ;-) However for a version 0.3, it
is not fair to be hard on something so early in it's development cycle.
Besides, it's free.

There is a very good overview of the underlying concepts of record
linkage in a free publication you can obtain here:
http://www.statistics.gov.uk/StatBase/Product.asp?vlnk=9224 and there
are more references of a much more technical nature on the ANU Data
Mining website.

Hope this helps

Regards

Adrian

--
Adrian Barnett
Research & Information Officer Ph: +61 8 82266615
Research, Analysis and Evaluation Fax: +61 8
82267088
Strategic Planning and Research Branch
Strategic Planning and Population Health Division
SA Department of Health

This e-mail may contain confidential information, which also may be
legally privileged. Only the intended recipient(s) may access, use,
distribute or copy this e-mail. If this e-mail is received in error,
please inform the sender by return e-mail and delete the original. If
there are doubts about the validity of this message, please contact the
sender by telephone. It is the recipient's responsibility to check the
e-mail and any attached files for viruses.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Richard Ristow
Sent: Tuesday, 18 July 2006 3:58
To: [hidden email]
Subject: Re: FW: Identifying cases that almost match

At 06:23 PM 7/16/2006, Snider-Lotz, Tom wrote:

>I'm trying to identify cases that may belong to the same individuals,
>even though their name might be entered slightly differently in the
>different records (e.g., Ben Jones and Benjamin Jones). It just
>occurred to me that I can easily solve my problem by using the
>Duplicate Cases utility to find duplicates for the variable
>ShortWholeName that I've created via the syntax.
>
>String ShortWholeName (a30).
>Compute ShortWholeName = Concat (RTRIM(Lname), ", ",
>SUBSTR(Fname,1,3)).

That's more or less how you do it: create a key that's broader - more
permissive about matching - than is the one you're having trouble with.

There's no magic. You risk false matches, though you're using a pretty
strict key that won't get many. "Robert" will match "Robin", "Samuel"
match "Samantha". But requiring a strict match on the last name will
eliminate most of those. (Worst likely case is siblings in families that
like to use similar names for

You also risk false negatives, continuing to miss true matches. In your
case, I'd worry more about that: "William" won't match "Bill",
"Elizabeth" won't match "Betty", and any variation in spelling of the
last name will spoil the match. (You may also find ambiguity about what
name is the first. I'm "Walter Richard Ristow." You know me as "Richard
Ristow", but occasional lists have me as "Walter.")

Strategy depends on how big your file is, how much work it's worth
investing, and how many keys you have; for example, you can look for
people who match on address but not on name, if you have address.

That can be a long story, though, since you then need criteria for
evaluating the quality - likelihood of being correct - of matches that
meet various combinations of criteria. I did one of these, in SAS, some
years back, and all I can say is that it was a lot of work.

______________________________________________________________
This message has been scanned for all viruses by BTnet VirusScreen.
The service is delivered in partnership with MessageLabs.

This service does not scan any password protected or encrypted
attachments.

If you are interested in finding out more about the service,
please visit our website at
http://www.btignite.com/internetservices/btnet/products_virusscreen.htm
==============================================================

This e-mail has been sent from a PC belonging to DSG Retail Limited or another company in the Dixons Group, registered office Maylands Avenue, Hemel Hempstead, Hertfordshire HP2 7TG. Its contents are confidential to the sender and the intended recipient.

If you receive it in error, please tell us by return and then delete it from your system; you may not rely on its contents nor copy/disclose it to anyone.

Opinions, conclusions and statements of intent in this e-mail are those of the sender and will not bind a Dixons Group company unless confirmed by an authorised representative independently of this message. We do not accept responsibility for viruses; you must scan for these.

Please note that e-mails sent to and from the Dixons Group are routinely monitored for record keeping, quality control and training purposes, to ensure regulatory compliance and to prevent viruses and unauthorised use of our computer systems.

Barnett, Adrian (HEALTH)

Re: FW: Identifying cases that almost match

In reply to this post by Snider-Lotz, Tom

Hi Ian

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Ian Maddrell
Sent: Tuesday, 18 July 2006 10:16
To: [hidden email]
Subject: Re: FW: Identifying cases that almost match

>What we would like to achieve is automated checking of names and
addresses
>from the Syntax windows or some other automation method.
>The problem with the suggested solutions seem to be mistypes,
>if someone for example called Ben Jones was accidently inputted
>as Ben Jines (common input error due to neighbouring keys on the
>keyboard) this would not be flagged as a potential problem.
>This seems to rely on some form of algorithm to notice potential
>mistypes and then flag them.

What you could try is converting the names via the SOUNDEX algorithm,
and checking to see any which match on their SOUNDEX codes but do not
match with an exact string match. Not foolproof, but may get you close
enough.
There is an implementation of the SOUNDEX algorithm in SPSS on
Raynauld's site www.spsstools.net.

There is a handy web-based implementation of SOUNDEX here:
http://www.dropby.com/

(Look under Phonetic Encoders)

You could try it out on some sample mis-spellings to see if it does what
you want.

The same site also has a NYSIIS calculator, which apparently has certain
advantages over SOUNDEX, but you would need to write your own
implementation.

These are just two of a big variety of string-comparison methods. There
is a list of other methods available here:
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

There are lots of other sources which might have something useful if you
Google "string similarity"

>I read the below site about the FEBRL project with great interest,
>does anyone have a link to the manual for the application
>as I cannot seem to locate the document on their site.

The documentation is with the software itself on the Sourceforge site:
http://sourceforge.net/project/showfiles.php?group_id=62161

(it's down low on the page, below the link to the software)

Hope there is something useful for you amongst this lot

Regards

Adrian

--
Adrian Barnett
Research & Information Officer Ph: +61 8 82266615
Research, Analysis and Evaluation Fax: +61 8
82267088
Strategic Planning and Research Branch
Strategic Planning and Population Health Division
SA Department of Health

This e-mail may contain confidential information, which also may be
legally privileged. Only the intended recipient(s) may access, use,
distribute or copy this e-mail. If this e-mail is received in error,
please inform the sender by return e-mail and delete the original. If
there are doubts about the validity of this message, please contact the
sender by telephone. It is the recipient's responsibility to check the
e-mail and any attached files for viruses.