Dear SPSS Listserv recipients,
Does SPSS have any modules that do probability matching? This task, often referred to as "fuzzy matching," determines approximate matches based upon patterns in string variables when an "exact" match is possible. I am seeking to match two databases which will include the following string variables for people: names, address, gender, ethnicity, date of birth, and phone. If SPSS does not have a module for this, or not a great one, can you recommend another product? Ideally, this software could do two things: -- Standardize fields intended for matching -- Produce probability matches (most important) PATRICK Los Angeles, California
Patrick Burns, Senior Researcher
Economic Roundtable
315 W. 9th Street, Suite 502
Los Angeles, CA, 90015-4200
http://www.economicrt.org
|
SPSS does not have a module dedicated to this process – there is a whole industry that specializes in this area, but there are a number of features in the programmability supplemental modules available on SPSS Developer Central (www.spss.com/devcentral) that can help in this process if you want to build your own solution. These require SPSS 15.0.1 with the Python programmability plug-in.
- The extendedTransforms module provides a number of functions useful for standardizing names and other strings. The soundex and nysiis functions compute name transforms that help to abstract from spelling and similar differences. The levenshteindistance function computes a distance metric between two strings. The regular expression capabilities can be useful in extracting features from strings where you can define them via patterns. So, for example, you could write a regular expression that would do a reasonably good job of extracting the core of a street name from an address. - Once you have done this sort of preparation, you would merge the two databases and look for duplicates using Identify Duplicate Cases and other facilities. IDC is pretty flexible in how you define duplicates. It doesn’t do “fuzzy”, so you would handle that with the preliminary transforms. HTH, Jon Peck SPSS ________________________________ From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Patrick E. Burns Sent: Wednesday, April 18, 2007 1:33 PM To: [hidden email] Subject: [SPSSX-L] Fuzzy Matching in SPSS? Dear SPSS Listserv recipients, Does SPSS have any modules that do probability matching? This task, often referred to as "fuzzy matching," determines approximate matches based upon patterns in string variables when an "exact" match is possible. I am seeking to match two databases which will include the following string variables for people: names, address, gender, ethnicity, date of birth, and phone. If SPSS does not have a module for this, or not a great one, can you recommend another product? Ideally, this software could do two things: -- Standardize fields intended for matching -- Produce probability matches (most important) PATRICK Patrick Burns Los Angeles, California |
In reply to this post by Patrick Burns
On Wed, 18 Apr 2007 11:32:40 -0700, Patrick E. Burns
<[hidden email]> wrote: >Dear SPSS Listserv recipients,<br><br> >Does SPSS have any modules that do probability matching? This task, >often referred to as "fuzzy matching," determines approximate >matches based upon patterns in string variables when an "exact" >match is possible.<br><br> >I am seeking to match two databases which will include the following >string variables for people: names, address, gender, ethnicity, date of >birth, and phone. If SPSS does not have a module for this, or not a >great one, can you recommend another product?<br><br> >Ideally, this software could do two things:<br> > -- Standardize fields intended for matching<br> > -- Produce probability matches (most important)<br><br> Hi Patrick What you describe is also known as probabilistic data linkage. I know of nothing in SPSS which can help, although no doubt something could in principle be written in Python to do it. This may have been what Jon Peck was telling you in his post, but as you probably saw, his post got turned into word-salad. I understand the major commercial record-linkage packages are quite expensive. There is a very basic freebie on the CDC website here: http://ftp.cdc.gov/pub/Software/RegistryPlus/Link_Plus/Link%20Plus.htm I've not used it, and despite it being very basic, it may do what you need. Someone a year or so ago published a reference here to a freebie plugin to SAS which did record linkage. I'm sorry, but I didn't keep a reference to it. You may find it in the list archives. There is an open-source record-linkage project, based on Python, called FEBRL, run out of the Computer Science Department at the Australian National University. I have played with it, but it was version 0.3 and has a lot of work to be done yet. It is very clever though. Sadly it has a manual only a computer scientist could love and that I suspect will be its greatest obstacle to adoption unless it is improved later. At present a great deal is left for the reader to interpolate or deduce for themselves. However the project is very much deserving of success. The group's website is here: http://datamining.anu.edu.au/linkage.html and the link to FEBRL is low down on the page. There is a review below of some commercial products that are not priced at 'corporate data centre' levels: http://www.chcf.org/topics/view.cfm?itemID=104595 Of the products reviewed on the above link, I've used LinkageWiz, and found it very good and well supported (possible better than most products, as I know the author and he worked in my building!). On more than one occasion (this was with an earlier version) he had bugfixes to me the next morning after my emailing the problem to him! Performance is good - I got 400,000 records linked in 5 minutes on a 3.2 GHz P4. The author has run linkage/de-duplication runs involving over a million records on his old laptop. There is another set of commercial products here: http://www.intelligentsearch.com/ I know nothing about them, however. If there's one thing I learnt from my use of record-linkage software, the tools used after the linkage has been run are just as important as the intelligence and customisability of the linkage tool itself. In data from operational systems - especially from the health and human services arena - you tend to get lots of false positives resulting from twins, other siblings and parent-child links. You need to review the data for such links and have a tool for unlinking one or more records from a group of 2 or more that scored above the 'hit' criterion and appear to be the same person. To do this properly, the review tool needs to allow you to search for likely false positives such as 'matches on everything but sex' (to pick up twin brother-sister pairs', or 'matches on everything but date of birth' (to pick up a child named after a parent'. These links tend to score very highly, and so it is a mistake to assume that just because a group has been formed from records with very high scores, they all must be good matches. You need to check the high scores and to do so you need something which will narrow the search to a manageable size. Similarly, you need to be able to review cases with linkage scores down near the cutoff level, because the closer you get to the cutoff, the more likely you are to find purely spurious matches, and you need to be able to nominate each of these as a non-match. Once this is done - and this applies equally to the previous lot of false positives - they need to get assigned a new unique case-id (to save you having to keep track of what numbers are presently allocated). Of course you can get by without the review-and-reassign tools, but it would be very hard and error-prone! I think this is likely to be where the commercial products may have the edge over the freebies. This may not matter if your data set is small, but, depending on the data, if it is large, you could wind up with a very time-consuming job. The costs of false positives in your data are another factor which will determine how much weight to give the task of clerical review. Hope something in here may be of use to you. Regards Adrian Barnett |
In reply to this post by Patrick Burns
Hi Patrick
I overlooked part of your query - standardisation of the variables used in linking. This is definitely an important task and in my experience having clean and standardised variables makes a huge difference. As an experiment I calculated linkage scores for the same data set, with and without data cleaning. The right hand peak got higher and further to the right, and the trough between the peaks got deeper. It proved there was a very worthwhile reduction in the number of dubious matches needing checking if the data are cleaned first. LinkageWiz (LW)definitely does this. With sex, it will standardise coding, but will also attempt to guess sex from the given names where sex is missing. It does very well on this, although I did find that Asian and Middle Eastern names threw it a bit. Similarly unusual, made-up names can be tricky for it. It does standardize addresses - certainly Australian and French formats - but I'm not sure about North American formats. FEBRL does attempt address parsing and standardising, and is very clever at recognising and removing embedded names of houses from anywhere in the address string (these are a major pain and I really wish people would not include them when giving heir address!) LW will clean phone numbers and social security/Medicare numbers by removing non-digits. If memory serves, in LW names are cleaned by the removal of characters outside the range A-Z and casting into upper case. Hope this helps Regards Adrian Barnett |
In reply to this post by Patrick Burns
The SAS-based package you mention is likely Link King
(http://the-link-king.com/) It is a slick package developed by Dr. Kevin Campbell (Division of Alcohol and Substance Abuse, Washington state's Dept of Social and Health Services). It integrates both probabilistic and deterministic algorithms, with better results than either alone (we have an article in press with comparison to Link Plus). It has been widely used on large statewide administrative databases (Medicaid eligibility, arrests, substance abuse and mental health treatment, mortality, etc). Link King requires SAS but has a user-friendly front end that minimizes what you need to know about SAS. However, it does assume a fixed list of identifiers that does not include address or phone that might be useful in some projects. A much earlier version can be found on the federal (US) Substance Abuse and Mental Health Services Administration (SAMHSA) web site but Link King has a far superior interface and many enhancements. Dennis Deck, PhD RMC Research Corporation [hidden email] -----Original Message----- From: Adrian Barnett [mailto:[hidden email]] Sent: Tuesday, April 24, 2007 12:33 AM Subject: Re: Fuzzy Matching in SPSS? On Wed, 18 Apr 2007 11:32:40 -0700, Patrick E. Burns <[hidden email]> wrote: >Dear SPSS Listserv recipients,<br><br> >Does SPSS have any modules that do probability matching? This task, >often referred to as "fuzzy matching," determines approximate >matches based upon patterns in string variables when an "exact" >match is possible.<br><br> >I am seeking to match two databases which will include the following >string variables for people: names, address, gender, ethnicity, date of >birth, and phone. If SPSS does not have a module for this, or not a >great one, can you recommend another product?<br><br> >Ideally, this software could do two things:<br> > -- Standardize fields intended for matching<br> > -- Produce probability matches (most important)<br><br> Hi Patrick What you describe is also known as probabilistic data linkage. I know of nothing in SPSS which can help, although no doubt something could in principle be written in Python to do it. This may have been what Jon Peck was telling you in his post, but as you probably saw, his post got turned into word-salad. I understand the major commercial record-linkage packages are quite expensive. There is a very basic freebie on the CDC website here: http://ftp.cdc.gov/pub/Software/RegistryPlus/Link_Plus/Link%20Plus.htm I've not used it, and despite it being very basic, it may do what you need. Someone a year or so ago published a reference here to a freebie plugin to SAS which did record linkage. I'm sorry, but I didn't keep a reference to it. You may find it in the list archives. There is an open-source record-linkage project, based on Python, called FEBRL, run out of the Computer Science Department at the Australian National University. I have played with it, but it was version 0.3 and has a lot of work to be done yet. It is very clever though. Sadly it has a manual only a computer scientist could love and that I suspect will be its greatest obstacle to adoption unless it is improved later. At present a great deal is left for the reader to interpolate or deduce for themselves. However the project is very much deserving of success. The group's website is here: http://datamining.anu.edu.au/linkage.html and the link to FEBRL is low down on the page. There is a review below of some commercial products that are not priced at 'corporate data centre' levels: http://www.chcf.org/topics/view.cfm?itemID=104595 Of the products reviewed on the above link, I've used LinkageWiz, and found it very good and well supported (possible better than most products, as I know the author and he worked in my building!). On more than one occasion (this was with an earlier version) he had bugfixes to me the next morning after my emailing the problem to him! Performance is good - I got 400,000 records linked in 5 minutes on a 3.2 GHz P4. The author has run linkage/de-duplication runs involving over a million records on his old laptop. There is another set of commercial products here: http://www.intelligentsearch.com/ I know nothing about them, however. If there's one thing I learnt from my use of record-linkage software, the tools used after the linkage has been run are just as important as the intelligence and customisability of the linkage tool itself. In data from operational systems - especially from the health and human services arena - you tend to get lots of false positives resulting from twins, other siblings and parent-child links. You need to review the data for such links and have a tool for unlinking one or more records from a group of 2 or more that scored above the 'hit' criterion and appear to be the same person. To do this properly, the review tool needs to allow you to search for likely false positives such as 'matches on everything but sex' (to pick up twin brother-sister pairs', or 'matches on everything but date of birth' (to pick up a child named after a parent'. These links tend to score very highly, and so it is a mistake to assume that just because a group has been formed from records with very high scores, they all must be good matches. You need to check the high scores and to do so you need something which will narrow the search to a manageable size. Similarly, you need to be able to review cases with linkage scores down near the cutoff level, because the closer you get to the cutoff, the more likely you are to find purely spurious matches, and you need to be able to nominate each of these as a non-match. Once this is done - and this applies equally to the previous lot of false positives - they need to get assigned a new unique case-id (to save you having to keep track of what numbers are presently allocated). Of course you can get by without the review-and-reassign tools, but it would be very hard and error-prone! I think this is likely to be where the commercial products may have the edge over the freebies. This may not matter if your data set is small, but, depending on the data, if it is large, you could wind up with a very time-consuming job. The costs of false positives in your data are another factor which will determine how much weight to give the task of clerical review. Hope something in here may be of use to you. Regards Adrian Barnett |
Free forum by Nabble | Edit this page |