SPSSX Discussion

Fuzzy Matching in SPSS?

Classic

List

Threaded

5 messages Options

Patrick Burns

Fuzzy Matching in SPSS?

Dear SPSS Listserv recipients,

Does SPSS have any modules that do probability matching? This task, often referred to as "fuzzy matching," determines approximate matches based upon patterns in string variables when an "exact" match is possible.

I am seeking to match two databases which will include the following string variables for people: names, address, gender, ethnicity, date of birth, and phone. If SPSS does not have a module for this, or not a great one, can you recommend another product?

Ideally, this software could do two things:
-- Standardize fields intended for matching
-- Produce probability matches (most important)

PATRICK

Patrick Burns
Los Angeles, California

Patrick Burns, Senior Researcher Economic Roundtable 315 W. 9th Street, Suite 502 Los Angeles, CA, 90015-4200 http://www.economicrt.org

Peck, Jon

Re: Fuzzy Matching in SPSS?

SPSS does not have a module dedicated to this process – there is a whole industry that specializes in this area, but there are a number of features in the programmability supplemental modules available on SPSS Developer Central (www.spss.com/devcentral) that can help in this process if you want to build your own solution. These require SPSS 15.0.1 with the Python programmability plug-in.

- The extendedTransforms module provides a number of functions useful for standardizing names and other strings. The soundex and nysiis functions compute name transforms that help to abstract from spelling and similar differences. The levenshteindistance function computes a distance metric between two strings. The regular expression capabilities can be useful in extracting features from strings where you can define them via patterns. So, for example, you could write a regular expression that would do a reasonably good job of extracting the core of a street name from an address.

- Once you have done this sort of preparation, you would merge the two databases and look for duplicates using Identify Duplicate Cases and other facilities. IDC is pretty flexible in how you define duplicates. It doesn’t do “fuzzy”, so you would handle that with the preliminary transforms.

HTH,

Jon Peck

SPSS

________________________________

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Patrick E. Burns
Sent: Wednesday, April 18, 2007 1:33 PM
To: [hidden email]
Subject: [SPSSX-L] Fuzzy Matching in SPSS?

Dear SPSS Listserv recipients,

Does SPSS have any modules that do probability matching? This task, often referred to as "fuzzy matching," determines approximate matches based upon patterns in string variables when an "exact" match is possible.

I am seeking to match two databases which will include the following string variables for people: names, address, gender, ethnicity, date of birth, and phone. If SPSS does not have a module for this, or not a great one, can you recommend another product?

Ideally, this software could do two things:
-- Standardize fields intended for matching
-- Produce probability matches (most important)

PATRICK

Patrick Burns
Los Angeles, California

Barnett, Adrian (HEALTH)-2

Re: Fuzzy Matching in SPSS?

In reply to this post by Patrick Burns

On Wed, 18 Apr 2007 11:32:40 -0700, Patrick E. Burns
<[hidden email]> wrote:

>Dear SPSS Listserv recipients, 
>Does SPSS have any modules that do probability matching?  This task,
>often referred to as "fuzzy matching," determines approximate
>matches based upon patterns in string variables when an "exact"
>match is possible. 
>I am seeking to match two databases which will include the following
>string variables for people: names, address, gender, ethnicity, date of
>birth, and phone.  If SPSS does not have a module for this, or not a
>great one, can you recommend another product? 
>Ideally, this software could do two things: 
> --  Standardize fields intended for matching 
> --  Produce probability matches (most important)

Hi Patrick

What you describe is also known as probabilistic data linkage.
I know of nothing in SPSS which can help, although no doubt something could
in principle be written in Python to do it. This may have been what Jon Peck
was telling you in his post, but as you probably saw, his post got turned
into word-salad.

I understand the major commercial record-linkage packages are quite expensive.

There is a very basic freebie on the CDC website here:
http://ftp.cdc.gov/pub/Software/RegistryPlus/Link_Plus/Link%20Plus.htm

I've not used it, and despite it being very basic, it may do what you need.

Someone a year or so ago published a reference here to a freebie plugin to
SAS which did record linkage. I'm sorry, but I didn't keep a reference to
it. You may find it in the list archives.

There is an open-source record-linkage project, based on Python, called
FEBRL, run out of the Computer Science Department at the Australian National
University. I have played with it, but it was version 0.3 and has a lot of
work to be done yet. It is very clever though. Sadly it has a manual only a
computer scientist could love and that I suspect will be its greatest
obstacle to adoption unless it is improved later. At present a great deal is
left for the reader to interpolate or deduce for themselves. However the
project is very much deserving of success. The group's website is here:

http://datamining.anu.edu.au/linkage.html

and the link to FEBRL is low down on the page.

There is a review below of some commercial products that are not priced at
'corporate data centre' levels:

http://www.chcf.org/topics/view.cfm?itemID=104595

Of the products reviewed on the above link, I've used LinkageWiz, and found
it very good and well supported (possible better than most products, as I
know the author and he worked in my building!). On more than one occasion
(this was with an earlier version) he had bugfixes to me the next morning
after my emailing the problem to him! Performance is good - I got 400,000
records linked in 5 minutes on a 3.2 GHz P4. The author has run
linkage/de-duplication runs involving over a million records on his old laptop.

There is another set of commercial products here:
http://www.intelligentsearch.com/

I know nothing about them, however.

If there's one thing I learnt from my use of record-linkage software, the
tools used after the linkage has been run are just as important as the
intelligence and customisability of the linkage tool itself. In data from
operational systems - especially from the health and human services arena -
you tend to get lots of false positives resulting from twins, other siblings
and parent-child links. You need to review the data for such links and have
a tool for unlinking one or more records from a group of 2 or more that
scored above the 'hit' criterion and appear to be the same person. To do
this properly, the review tool needs to allow you to search for likely false
positives such as 'matches on everything but sex' (to pick up twin
brother-sister pairs', or 'matches on everything but date of birth' (to pick
up a child named after a parent'. These links tend to score very highly, and
so it is a mistake to assume that just because a group has been formed from
records with very high scores, they all must be good matches. You need to
check the high scores and to do so you need something which will narrow the
search to a manageable size.

Similarly, you need to be able to review cases with linkage scores down near
the cutoff level, because the closer you get to the cutoff, the more likely
you are to find purely spurious matches, and you need to be able to nominate
each of these as a non-match. Once this is done - and this applies equally
to the previous lot of false positives - they need to get assigned a new
unique case-id (to save you having to keep track of what numbers are
presently allocated).

Of course you can get by without the review-and-reassign tools, but it would
be very hard and error-prone! I think this is likely to be where the
commercial products may have the edge over the freebies. This may not matter
if your data set is small, but, depending on the data, if it is large, you
could wind up with a very time-consuming job. The costs of false positives
in your data are another factor which will determine how much weight to give
the task of clerical review.

Hope something in here may be of use to you.

Regards

Adrian Barnett

Barnett, Adrian (HEALTH)-2

Re: Fuzzy Matching in SPSS?

In reply to this post by Patrick Burns

Hi Patrick

I overlooked part of your query - standardisation of the variables used in
linking.

This is definitely an important task and in my experience having clean and
standardised variables makes a huge difference. As an experiment I
calculated linkage scores for the same data set, with and without data
cleaning. The right hand peak got higher and further to the right, and the
trough between the peaks got deeper. It proved there was a very worthwhile
reduction in the number of dubious matches needing checking if the data are
cleaned first.

LinkageWiz (LW)definitely does this. With sex, it will standardise coding,
but will also attempt to guess sex from the given names where sex is
missing. It does very well on this, although I did find that Asian and
Middle Eastern names threw it a bit. Similarly unusual, made-up names can be
tricky for it.

It does standardize addresses - certainly Australian and French formats -
but I'm not sure about North American formats. FEBRL does attempt address
parsing and standardising, and is very clever at recognising and removing
embedded names of houses from anywhere in the address string (these are a
major pain and I really wish people would not include them when giving heir
address!)

LW will clean phone numbers and social security/Medicare numbers by removing
non-digits.

If memory serves, in LW names are cleaned by the removal of characters
outside the range A-Z and casting into upper case.

Hope this helps

Regards

Adrian Barnett

Dennis Deck

Re: Fuzzy Matching in SPSS?

In reply to this post by Patrick Burns

The SAS-based package you mention is likely Link King
(http://the-link-king.com/)

It is a slick package developed by Dr. Kevin Campbell (Division of
Alcohol and Substance Abuse, Washington state's Dept of Social and
Health Services). It integrates both probabilistic and deterministic
algorithms, with better results than either alone (we have an article in
press with comparison to Link Plus). It has been widely used on large
statewide administrative databases (Medicaid eligibility, arrests,
substance abuse and mental health treatment, mortality, etc).

Link King requires SAS but has a user-friendly front end that minimizes
what you need to know about SAS. However, it does assume a fixed list
of identifiers that does not include address or phone that might be
useful in some projects.

A much earlier version can be found on the federal (US) Substance Abuse
and Mental Health Services Administration (SAMHSA) web site but Link
King has a far superior interface and many enhancements.

Dennis Deck, PhD
RMC Research Corporation
[hidden email]

-----Original Message-----
From: Adrian Barnett [mailto:[hidden email]]
Sent: Tuesday, April 24, 2007 12:33 AM
Subject: Re: Fuzzy Matching in SPSS?

On Wed, 18 Apr 2007 11:32:40 -0700, Patrick E. Burns
<[hidden email]> wrote:

>Dear SPSS Listserv recipients, 
>Does SPSS have any modules that do probability matching?  This
task,
>often referred to as "fuzzy matching," determines approximate
>matches based upon patterns in string variables when an
"exact"
>match is possible. 
>I am seeking to match two databases which will include the following
>string variables for people: names, address, gender, ethnicity, date of
>birth, and phone.  If SPSS does not have a module for this, or not
a
>great one, can you recommend another product? 
>Ideally, this software could do two things: 
> --  Standardize fields intended for matching 
> --  Produce probability matches (most important) 

Hi Patrick

What you describe is also known as probabilistic data linkage.
I know of nothing in SPSS which can help, although no doubt something
could
in principle be written in Python to do it. This may have been what Jon
Peck
was telling you in his post, but as you probably saw, his post got
turned
into word-salad.

I understand the major commercial record-linkage packages are quite
expensive.

There is a very basic freebie on the CDC website here:
http://ftp.cdc.gov/pub/Software/RegistryPlus/Link_Plus/Link%20Plus.htm

I've not used it, and despite it being very basic, it may do what you
need.

Someone a year or so ago published a reference here to a freebie plugin
to
SAS which did record linkage. I'm sorry, but I didn't keep a reference
to
it. You may find it in the list archives.

There is an open-source record-linkage project, based on Python, called
FEBRL, run out of the Computer Science Department at the Australian
National
University. I have played with it, but it was version 0.3 and has a lot
of
work to be done yet. It is very clever though. Sadly it has a manual
only a
computer scientist could love and that I suspect will be its greatest
obstacle to adoption unless it is improved later. At present a great
deal is
left for the reader to interpolate or deduce for themselves. However the
project is very much deserving of success. The group's website is here:

http://datamining.anu.edu.au/linkage.html

and the link to FEBRL is low down on the page.

There is a review below of some commercial products that are not priced
at
'corporate data centre' levels:

http://www.chcf.org/topics/view.cfm?itemID=104595

Of the products reviewed on the above link, I've used LinkageWiz, and
found
it very good and well supported (possible better than most products, as
I
know the author and he worked in my building!). On more than one
occasion
(this was with an earlier version) he had bugfixes to me the next
morning
after my emailing the problem to him! Performance is good - I got
400,000
records linked in 5 minutes on a 3.2 GHz P4. The author has run
linkage/de-duplication runs involving over a million records on his old
laptop.

There is another set of commercial products here:
http://www.intelligentsearch.com/

I know nothing about them, however.

If there's one thing I learnt from my use of record-linkage software,
the
tools used after the linkage has been run are just as important as the
intelligence and customisability of the linkage tool itself. In data
from
operational systems - especially from the health and human services
arena -
you tend to get lots of false positives resulting from twins, other
siblings
and parent-child links. You need to review the data for such links and
have
a tool for unlinking one or more records from a group of 2 or more that
scored above the 'hit' criterion and appear to be the same person. To do
this properly, the review tool needs to allow you to search for likely
false
positives such as 'matches on everything but sex' (to pick up twin
brother-sister pairs', or 'matches on everything but date of birth' (to
pick
up a child named after a parent'. These links tend to score very highly,
and
so it is a mistake to assume that just because a group has been formed
from
records with very high scores, they all must be good matches. You need
to
check the high scores and to do so you need something which will narrow
the
search to a manageable size.

Similarly, you need to be able to review cases with linkage scores down
near
the cutoff level, because the closer you get to the cutoff, the more
likely
you are to find purely spurious matches, and you need to be able to
nominate
each of these as a non-match. Once this is done - and this applies
equally
to the previous lot of false positives - they need to get assigned a new
unique case-id (to save you having to keep track of what numbers are
presently allocated).

Of course you can get by without the review-and-reassign tools, but it
would
be very hard and error-prone! I think this is likely to be where the
commercial products may have the edge over the freebies. This may not
matter
if your data set is small, but, depending on the data, if it is large,
you
could wind up with a very time-consuming job. The costs of false
positives
in your data are another factor which will determine how much weight to
give
the task of clerical review.

Hope something in here may be of use to you.

Regards

Adrian Barnett