fuzzy data matching

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

fuzzy data matching

Jason Almerigi
Hi all,
I am often in the situation of matching school records with program
attendance records with self-identified student surveys.  Across each, there
are no unique identifiers beyond first and last name, birthday, and gender.
SPSS does not appear to be tolerant of errors, misspellings, name variants
(Ed vs. Edward), etc..   On a good day, I can match 80% of records, which
usually leaves me with a few thousand unmatched cases.

Do any of you have any syntax, software, or best practice recommendations
for fuzzy data matching/deduplication? I've checked out a software package
called MatchIt, which is pricey, but I see that there are a few programs out
there.

Any help would be appreciated.
Jason

--
Jason B. Almerigi, Ph.D.
Community Evaluation and Research Center, University Outreach & Engagement;
Michigan State University
36 Kellogg Center, Garden Level
East Lansing, MI 48824-1022
Office: 517-355-0135
Fax: 517-432-9541

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re:fuzzy data matching

Jerabek Jindrich
Hi Jason,

matching poorly identified cases must be a hard work. I do not have a solution,  only an idea:

First of all, misspells and typing errors are much more likely in names than in gender and birth date. You might try to use only the first one/two letter (s) of first and last name plus gender plus birth date. There will probably remain some mismatched and duplicity, hopefuly it could reduce the number of mismatched cases.

Do you have a correct list of your students, with names, gender etc. ? Because it could be easier to match the poorly self identified data to a correct list, than trying to match two wrongly identified datasets.

Of course not only unmatched cases are problem, all files should be checked for duplicity in identifiers and for missing values /blanks in ID variables.

good luck
Jindra


> ------------ Původní zpráva ------------
> Od: Jason Almerigi <[hidden email]>
> Předmět: fuzzy data matching
> Datum: 07.7.2008 21:11:07
> ----------------------------------------
> Hi all,
> I am often in the situation of matching school records with program
> attendance records with self-identified student surveys.  Across each, there
> are no unique identifiers beyond first and last name, birthday, and gender.
> SPSS does not appear to be tolerant of errors, misspellings, name variants
> (Ed vs. Edward), etc..   On a good day, I can match 80% of records, which
> usually leaves me with a few thousand unmatched cases.
>
> Do any of you have any syntax, software, or best practice recommendations
> for fuzzy data matching/deduplication? I've checked out a software package
> called MatchIt, which is pricey, but I see that there are a few programs out
> there.
>
> Any help would be appreciated.
> Jason
>
> --
> Jason B. Almerigi, Ph.D.
> Community Evaluation and Research Center, University Outreach & Engagement;
> Michigan State University
> 36 Kellogg Center, Garden Level
> East Lansing, MI 48824-1022
> Office: 517-355-0135
> Fax: 517-432-9541
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
>
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: fuzzy data matching

Peck, Jon
In reply to this post by Jason Almerigi
There is a whole fuzzy matching industry with specialized tools for this, but you might be able to cut down the size of the job a lot with a little preprocessing.

For names, you might try matching on the soundex code or the nysiis code.  These codes are designed to reduce the impact of spelling errors and name variations.  Functions to compute both of these are available in the programmabily module extendedTransforms that can be downloaded from SPSS Developer Central (www.spss.com/devcentral).

If you have SPSS 16, you might find the new extension command CASECTRL useful.  Although primarily designed for case-control exact matching on a list of characteristics, it could be used here (perhaps on the transformed names) to figure out the correspondence between the two datasets.

HTH,
Jon Peck

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jason Almerigi
Sent: Monday, July 07, 2008 12:57 PM
To: [hidden email]
Subject: [SPSSX-L] fuzzy data matching

Hi all,
I am often in the situation of matching school records with program
attendance records with self-identified student surveys.  Across each, there
are no unique identifiers beyond first and last name, birthday, and gender.
SPSS does not appear to be tolerant of errors, misspellings, name variants
(Ed vs. Edward), etc..   On a good day, I can match 80% of records, which
usually leaves me with a few thousand unmatched cases.

Do any of you have any syntax, software, or best practice recommendations
for fuzzy data matching/deduplication? I've checked out a software package
called MatchIt, which is pricey, but I see that there are a few programs out
there.

Any help would be appreciated.
Jason

--
Jason B. Almerigi, Ph.D.
Community Evaluation and Research Center, University Outreach & Engagement;
Michigan State University
36 Kellogg Center, Garden Level
East Lansing, MI 48824-1022
Office: 517-355-0135
Fax: 517-432-9541

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: fuzzy data matching

Barnett, Adrian (DECS)
In reply to this post by Jason Almerigi
Hi Jason
There are a few free options for doing what you want. You may first want
to read up on the theory of probabilistic record linkage. I don't claim
the following is the best introduction and overview there is, but it is
clear and comprehensive and may help you evaluate your options:

http://www.statistics.gov.uk/downloads/theme_other/GSSMethodology_No_25_
v2.pdf

There is a freebie called LinkPlusfor doing record linkage within a
health context available from the Centers for Disease Control here:

http://www.cdc.gov/cancer/npcr/tools/registryplus/lp_tech_info.htm

There is also a free open-source record linkage application under
development by the Data Mining Group at the Australian National
University. It's written in Python and is called FEBRL. It's got a
manual only a computer scientist could love but the software is very
clever.  They have recently added a GUI to it but I've not used that
yet.  It's still only at version 0.4, so there's lots more than plan to
do to it.  However it does a pretty respectable job if you persevere
with it.

http://datamining.anu.edu.au/linkage.html

(The link to FEBRL is at the bottom of the above page)

Someone wrote a free plugin for SAS which does probabilistic record
linkage, and there is a reference to it somewhere in the archives for
this list, but I don't know anything else about it.

You can use a limited version of a payware application called LinkageWiz
on datasets of < 2000 records.  Look under the Download link on
www.linkagewiz.com.  You may well have more than 2,000 records, but it
may be feasible to chop your data into chunks of <2,000 based on the
surname.

Most payware record linkage software is very expensive - the Health Dept
where I used to work paid north of AUD$60,000 for a copy of Automatch.
There is cheaper stuff around. There's a comparison of cheaper
alternatives (dated August 2004) here:
http://www.chcf.org/topics/view.cfm?itemid=104595

If your task is a one-off, and your time is limited, it may make sense
to contract out the linkage/de-duplication task.  I have no
comprehensive knowledge of what is available, but one I know of is
LinkaeWiz, which charges US$175 setup and US$100 per 100,000 records for
linking/de-duplicating plus a comprehensive clerical review. Details
here:

http://www.linkagewiz.com/services.htm

One thing I've learnt is that the tools available for the 'clerical
review' phase of a record linkage run can make a big difference to how
onerous the task is.  (Clerical review is where you go back over the
linked records and look for false positives, such as twins, other
siblings, spouses, parents linked to children etc which you can easily
get in some sorts of data,).

You need some means of homing in on the best candidates to cut down the
number of links to check, and an easy mechanism for unlinking.  A system
which will allow you to look for sets of twins by specifying a search
for (1)same surname (2) same address (3)same date of birth (4) different
first name would help.  Ditto for parent-child matches where you could
search for same, first name, surname, address, gender but different DOB.
This is much easier than having to check all the high-scoring matches,
and much safer than blindly accepting everything which scores above a
certain match-score level.

Once you've identified the false positives, a neat and easy way of
telling the software to unlink one or more records from one or more
others is an important, if trivial-sounding, feature.

The clerical review phase is very important, but mind-numbingly tedious
if you have a big data set, so anything the software can do to keep the
process (and you) focussed, and make the task easier, is valuable.

In my experience one other thing that makes a big difference to the
degree of matching you get, and to the number of potential matches you
have to check in the 'grey' area around the cutoff criterion you have
set for matching scores, is how clean you have got your data before the
matching process.  Although clever fuzzy matching algorithms help a lot,
having your data clean beforehand makes the whole process work better.
So, if one of the tasks that is part of your project is geocoding your
addresses, doing your geocoding first and making use of the cleaned (and
standardized) resulting addresses in the record linkage run will pay
dividends.  Patching up missing gender, getting non-alphabetic
characters out of names, standardizing phone numbers and anything else
you intend using for matching will help.  You will get more high-scoring
matches and fewer borderline matches.

I hope there's something in the above that is of some help to you.

Regards

Adrian Barnett

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Jason Almerigi
Sent: Tuesday, 8 July 2008 4:27 AM
To: [hidden email]
Subject: fuzzy data matching

Hi all,
I am often in the situation of matching school records with program
attendance records with self-identified student surveys.  Across each,
there
are no unique identifiers beyond first and last name, birthday, and
gender.
SPSS does not appear to be tolerant of errors, misspellings, name
variants
(Ed vs. Edward), etc..   On a good day, I can match 80% of records,
which
usually leaves me with a few thousand unmatched cases.

Do any of you have any syntax, software, or best practice
recommendations
for fuzzy data matching/deduplication? I've checked out a software
package
called MatchIt, which is pricey, but I see that there are a few programs
out
there.

Any help would be appreciated.
Jason

--
Jason B. Almerigi, Ph.D.
Community Evaluation and Research Center, University Outreach &
Engagement;
Michigan State University
36 Kellogg Center, Garden Level
East Lansing, MI 48824-1022
Office: 517-355-0135
Fax: 517-432-9541

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: fuzzy data matching

Albert-Jan Roskam
> It's got a
> manual only a computer scientist could love

==> That's not true. I've read the manual of both Febrl 0.3 and 0.4, and both are very readable. And I'm not a computer scientist. ;-) Contrary to v0.3, version 0.4  is GUI-based. However, it's still work in progress. After configuring the GUI, one cannot re-open a session and retain the settings. The user has to go back to the source code. I've mostly been using v0.3, and I am happy with it. Depending on the size of the datasets and the hardware used, however, you might run into constraints (i.e., memory errors). In v0.4, several improvements have been implemented, among which the BigMatch algoritm. That might solve things.

Febrl also has a listserv, but it's kinda silent. However, Peter Christen (the author of the program) often replies to any questions I ask on that list.

Cheers!!
Albert-Jan


--- On Tue, 7/8/08, Barnett, Adrian (DECS) <[hidden email]> wrote:

> From: Barnett, Adrian (DECS) <[hidden email]>
> Subject: Re: fuzzy data matching
> To: [hidden email]
> Date: Tuesday, July 8, 2008, 3:14 AM
> Hi Jason
> There are a few free options for doing what you want. You
> may first want
> to read up on the theory of probabilistic record linkage. I
> don't claim
> the following is the best introduction and overview there
> is, but it is
> clear and comprehensive and may help you evaluate your
> options:
>
> http://www.statistics.gov.uk/downloads/theme_other/GSSMethodology_No_25_
> v2.pdf
>
> There is a freebie called LinkPlusfor doing record linkage
> within a
> health context available from the Centers for Disease
> Control here:
>
> http://www.cdc.gov/cancer/npcr/tools/registryplus/lp_tech_info.htm
>
> There is also a free open-source record linkage application
> under
> development by the Data Mining Group at the Australian
> National
> University. It's written in Python and is called FEBRL.
> It's got a
> manual only a computer scientist could love but the
> software is very
> clever.  They have recently added a GUI to it but I've
> not used that
> yet.  It's still only at version 0.4, so there's
> lots more than plan to
> do to it.  However it does a pretty respectable job if you
> persevere
> with it.
>
> http://datamining.anu.edu.au/linkage.html
>
> (The link to FEBRL is at the bottom of the above page)
>
> Someone wrote a free plugin for SAS which does
> probabilistic record
> linkage, and there is a reference to it somewhere in the
> archives for
> this list, but I don't know anything else about it.
>
> You can use a limited version of a payware application
> called LinkageWiz
> on datasets of < 2000 records.  Look under the Download
> link on
> www.linkagewiz.com.  You may well have more than 2,000
> records, but it
> may be feasible to chop your data into chunks of <2,000
> based on the
> surname.
>
> Most payware record linkage software is very expensive -
> the Health Dept
> where I used to work paid north of AUD$60,000 for a copy of
> Automatch.
> There is cheaper stuff around. There's a comparison of
> cheaper
> alternatives (dated August 2004) here:
> http://www.chcf.org/topics/view.cfm?itemid=104595
>
> If your task is a one-off, and your time is limited, it may
> make sense
> to contract out the linkage/de-duplication task.  I have no
> comprehensive knowledge of what is available, but one I
> know of is
> LinkaeWiz, which charges US$175 setup and US$100 per
> 100,000 records for
> linking/de-duplicating plus a comprehensive clerical
> review. Details
> here:
>
> http://www.linkagewiz.com/services.htm
>
> One thing I've learnt is that the tools available for
> the 'clerical
> review' phase of a record linkage run can make a big
> difference to how
> onerous the task is.  (Clerical review is where you go back
> over the
> linked records and look for false positives, such as twins,
> other
> siblings, spouses, parents linked to children etc which you
> can easily
> get in some sorts of data,).
>
> You need some means of homing in on the best candidates to
> cut down the
> number of links to check, and an easy mechanism for
> unlinking.  A system
> which will allow you to look for sets of twins by
> specifying a search
> for (1)same surname (2) same address (3)same date of birth
> (4) different
> first name would help.  Ditto for parent-child matches
> where you could
> search for same, first name, surname, address, gender but
> different DOB.
> This is much easier than having to check all the
> high-scoring matches,
> and much safer than blindly accepting everything which
> scores above a
> certain match-score level.
>
> Once you've identified the false positives, a neat and
> easy way of
> telling the software to unlink one or more records from one
> or more
> others is an important, if trivial-sounding, feature.
>
> The clerical review phase is very important, but
> mind-numbingly tedious
> if you have a big data set, so anything the software can do
> to keep the
> process (and you) focussed, and make the task easier, is
> valuable.
>
> In my experience one other thing that makes a big
> difference to the
> degree of matching you get, and to the number of potential
> matches you
> have to check in the 'grey' area around the cutoff
> criterion you have
> set for matching scores, is how clean you have got your
> data before the
> matching process.  Although clever fuzzy matching
> algorithms help a lot,
> having your data clean beforehand makes the whole process
> work better.
> So, if one of the tasks that is part of your project is
> geocoding your
> addresses, doing your geocoding first and making use of the
> cleaned (and
> standardized) resulting addresses in the record linkage run
> will pay
> dividends.  Patching up missing gender, getting
> non-alphabetic
> characters out of names, standardizing phone numbers and
> anything else
> you intend using for matching will help.  You will get more
> high-scoring
> matches and fewer borderline matches.
>
> I hope there's something in the above that is of some
> help to you.
>
> Regards
>
> Adrian Barnett
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]]
> On Behalf Of
> Jason Almerigi
> Sent: Tuesday, 8 July 2008 4:27 AM
> To: [hidden email]
> Subject: fuzzy data matching
>
> Hi all,
> I am often in the situation of matching school records with
> program
> attendance records with self-identified student surveys.
> Across each,
> there
> are no unique identifiers beyond first and last name,
> birthday, and
> gender.
> SPSS does not appear to be tolerant of errors,
> misspellings, name
> variants
> (Ed vs. Edward), etc..   On a good day, I can match 80% of
> records,
> which
> usually leaves me with a few thousand unmatched cases.
>
> Do any of you have any syntax, software, or best practice
> recommendations
> for fuzzy data matching/deduplication? I've checked out
> a software
> package
> called MatchIt, which is pricey, but I see that there are a
> few programs
> out
> there.
>
> Any help would be appreciated.
> Jason
>
> --
> Jason B. Almerigi, Ph.D.
> Community Evaluation and Research Center, University
> Outreach &
> Engagement;
> Michigan State University
> 36 Kellogg Center, Garden Level
> East Lansing, MI 48824-1022
> Office: 517-355-0135
> Fax: 517-432-9541
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body
> text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body
> text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: fuzzy data matching (2)

Albert-Jan Roskam
In reply to this post by Jerabek Jindrich
In Febrl one can use several phonetical mechanisms (Soundex, NYSIIS,metaphone...) to designed to match fields with 'speling erors'. Also, one can use reference lists (lookup tables) with correct names.

Cheers!!
Albert-Jan


--- On Tue, 7/8/08, Jerabek Jindrich <[hidden email]> wrote:

> From: Jerabek Jindrich <[hidden email]>
> Subject: Re:fuzzy data matching
> To: [hidden email]
> Date: Tuesday, July 8, 2008, 12:34 AM
> Hi Jason,
>
> matching poorly identified cases must be a hard work. I do
> not have a solution,  only an idea:
>
> First of all, misspells and typing errors are much more
> likely in names than in gender and birth date. You might
> try to use only the first one/two letter (s) of first and
> last name plus gender plus birth date. There will probably
> remain some mismatched and duplicity, hopefuly it could
> reduce the number of mismatched cases.
>
> Do you have a correct list of your students, with names,
> gender etc. ? Because it could be easier to match the
> poorly self identified data to a correct list, than trying
> to match two wrongly identified datasets.
>
> Of course not only unmatched cases are problem, all files
> should be checked for duplicity in identifiers and for
> missing values /blanks in ID variables.
>
> good luck
> Jindra
>
>
> > ------------ Původní zpráva ------------
> > Od: Jason Almerigi <[hidden email]>
> > Předmět: fuzzy data matching
> > Datum: 07.7.2008 21:11:07
> > ----------------------------------------
> > Hi all,
> > I am often in the situation of matching school records
> with program
> > attendance records with self-identified student
> surveys.  Across each, there
> > are no unique identifiers beyond first and last name,
> birthday, and gender.
> > SPSS does not appear to be tolerant of errors,
> misspellings, name variants
> > (Ed vs. Edward), etc..   On a good day, I can match
> 80% of records, which
> > usually leaves me with a few thousand unmatched cases.
> >
> > Do any of you have any syntax, software, or best
> practice recommendations
> > for fuzzy data matching/deduplication? I've
> checked out a software package
> > called MatchIt, which is pricey, but I see that there
> are a few programs out
> > there.
> >
> > Any help would be appreciated.
> > Jason
> >
> > --
> > Jason B. Almerigi, Ph.D.
> > Community Evaluation and Research Center, University
> Outreach & Engagement;
> > Michigan State University
> > 36 Kellogg Center, Garden Level
> > East Lansing, MI 48824-1022
> > Office: 517-355-0135
> > Fax: 517-432-9541
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message
> to
> > [hidden email] (not to SPSSX-L), with no
> body text except the
> > command. To leave the list, send the command
> > SIGNOFF SPSSX-L
> > For a list of commands to manage subscriptions, send
> the command
> > INFO REFCARD
> >
> >
> >
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body
> text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: fuzzy data matching

Dennis Deck
In reply to this post by Jason Almerigi
Note that the Link Plus package offered by CDC that Adrian Barnett mentioned is a) free, b) readily available for download, c) easy to use, d) flexible, and e) does an excellent job.  Odds are good that this will fit your needs.  And it will handle the problem of matching names.

A colleague compared it to a package he developed in SAS which incorporated both deterministic and probabilistic methods and found that Link Plus held up very well.  See:  Campbell, Deck, Krupski (2008) Record linkage software in the public domain: a comparison of Link Plus, The Link King, and a `basic' deterministic algorithm. Health Informatics Journal, Vol. 14, No. 1, 5-15.  He has made the Link King package available (http://www.the-link-king.com/) free of charge as well - but note that it requires SAS.  (I considered writing an SPSS version but quickly realized it would require links to database software to work and would likely be slow.)  One advantage of Link King is some added attention to linking names in the deterministic routines.

We have just started using Link Plus for a particular application.  We were impressed at its flexibility.  We have found it easy to exchange data with SPSS (recommend writing files to be linked as comma or tab separated value format with variable names in first row).  My only complaint is that the package would benefit from a separate manual that went into a bit more depth on the linkage options (there is decent online help but no manual) .  

Any linkage effort will only be as good as the variables you have in common between the files and you only list 4.  Regardless of the package selected, you will need to judge where to draw the line - any probabilistic software package will provide a continuum of matches from very good to questionable and will represent this with a score.   You will need to decide where to set the cut off for your particular data set, balancing the trade off between increasing the number of matches against the risk of incorrect matches.  Seems like a low cutoff will work in your situation as one file is a subset of the other..

Dennis Deck, PhD
RMC Research Corporation
111 SW Columbia Street, Suite 1200
Portland, Oregon 97201-5843
voice: 503-223-8248 x715
voice: 800-788-1887 x715
fax:  503-223-8248
[hidden email]

====================To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD