Matching Problem

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Matching Problem

drfg2008
Matching Problem

From time to time we have records that are to be connected by means of a key variable (two data sets each time). The key variable is in the STRING format. Unfortunately, the key variables in both data sets are not written exactly the same, there are slight differences in the spelling. How can the records be matched?

Example:

Record ID 1:

PeterMueller21.06.2012
AndreasMaier12.08.1983
LiseMeitner03.10.1961
MaxMeier01.10.1956
SusanneSchneider1981

Record ID 2:

PeterMuller21.06.2012
AndreasMayer12.08.1983
LiseMeitner3.101961
M.Meier01.10.1956

Obviously, there are three persons who are respectively detected in data set 1 and data set 2. But how can I assign the records correctly despite the slight differences in spelling.

For example:

PeterMueller21.06.2012 = PeterMuller21.06.2012
AndreasMaier12.08.1983 = AndreasMayer12.08.1983
LiseMeitner03.10.1961 = LiseMeitner3.101961
MaxMeier01.10.1956 = M.Meier01.10.1956
SusanneSchneider1981 -> no match

Is there a way to solve the problem in SPSS?
Dr. Frank Gaeth

Reply | Threaded
Open this post in threaded view
|

Re: Matching Problem

Rich Ulrich
You illustrate a mixture of at least three problems, and these want at
least a couple of different solutions. 

"LiseMeitner" shows a (computer) correctable date record in #2. 
That is a little bit of tedious coding, but you want to pick out the
dates and correct them.... last 4, next 2, next 2 (or pad). 

What do you intend to do with non-matched records, anyway?
Is there supposed to be a match for every record? - If in that case, you
*might* re-write the keys for non-matches so that you can use date-only,
and inspect those date-only matches for fidelity. 

Depending on volume and frequency of problems, you might feel
okay with correcting all the others by sorting out all non-matched records,
listing these non-matches as one file or two, and dealing with them by hand. 
 - SusanneSchneider has an uncorrectable date error.  Do you set that
aside until you have info?
 - M.Meier will never reliably match MaxMeier.  Do you want to define a
key with surname+date?  and try that?

--
Rich Ulrich

> Date: Wed, 5 Jun 2013 14:56:26 -0700

> From: [hidden email]
> Subject: Matching Problem
> To: [hidden email]
>
> Matching Problem
>
> From time to time we have records that are to be connected by means of a key
> variable (two data sets each time). The key variable is in the STRING
> format. Unfortunately, the key variables in both data sets are not written
> exactly the same, there are slight differences in the spelling. How can the
> records be matched?
>
> Example:
>
> Record ID 1:
>
> PeterMueller21.06.2012
> AndreasMaier12.08.1983
> LiseMeitner03.10.1961
> MaxMeier01.10.1956
> SusanneSchneider1981
>
> Record ID 2:
>
> PeterMuller21.06.2012
> AndreasMayer12.08.1983
> LiseMeitner3.101961
> M.Meier01.10.1956
>
> Obviously, there are three persons who are respectively detected in data set
> 1 and data set 2. But how can I assign the records correctly despite the
> slight differences in spelling.
>
> For example:
>
> PeterMueller21.06.2012 = PeterMuller21.06.2012
> AndreasMaier12.08.1983 = AndreasMayer12.08.1983
> LiseMeitner03.10.1961 = LiseMeitner3.101961
> MaxMeier01.10.1956 = M.Meier01.10.1956
> SusanneSchneider1981 -> no match
>
> Is there a way to solve the problem in SPSS?
> ...
Reply | Threaded
Open this post in threaded view
|

Re: Matching Problem

Jon K Peck
This is a classic problem.  If you have SPSS Modeler and the Entity Analytics nodes, it can do a very sophisticated match up using multiple variables.  There is other specialized software for name matching, of course.

In Statistics, one approach would be to use the soundex or nysiis algorithm to create a canonical form of the name and then match on that.  nysiis is better than soundex, although soundex is better known.  soundex was designed for English names and may not work well on others.  I don't know how well nysiis will work on German names.  You would want to strip off the dates for computing the nysiis coding.

Of course, the coding can make two names that are actually different turn into the same thing.

You can compute the nysiis encoding using the SPSSINC TRANS extension command along with the extendedTransforms module from the SPSS Community site.

Here is an example.
data list fixed /name(a20).
begin data
PeterMueller
AndreasMaier
LiseMeitner
MaxMeier
SusanneSchneider
end data
dataset name names.
spssinc trans result = nysiisname type=20
/formula "extendedTransforms.nysiis(name)".
list.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Rich Ulrich <[hidden email]>
To:        [hidden email],
Date:        06/05/2013 04:25 PM
Subject:        Re: [SPSSX-L] Matching Problem
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




You illustrate a mixture of at least three problems, and these want at
least a couple of different solutions.  

"LiseMeitner" shows a (computer) correctable date record in #2.  
That is a little bit of tedious coding, but you want to pick out the
dates and correct them.... last 4, next 2, next 2 (or pad).  

What do you intend to do with non-matched records, anyway?
Is there supposed to be a match for every record? - If in that case, you
*might* re-write the keys for non-matches so that you can use date-only,
and inspect those date-only matches for fidelity.  

Depending on volume and frequency of problems, you might feel
okay with correcting all the others by sorting out all non-matched records,
listing these non-matches as one file or two, and dealing with them by hand.  
- SusanneSchneider has an uncorrectable date error.  Do you set that
aside until you have info?
- M.Meier will never reliably match MaxMeier.  Do you want to define a
key with surname+date?  and try that?

--
Rich Ulrich

> Date: Wed, 5 Jun 2013 14:56:26 -0700
> From: [hidden email]
> Subject: Matching Problem
> To: [hidden email]
>
> Matching Problem
>
> From time to time we have records that are to be connected by means of a key
> variable (two data sets each time). The key variable is in the STRING
> format. Unfortunately, the key variables in both data sets are not written
> exactly the same, there are slight differences in the spelling. How can the
> records be matched?
>
> Example:
>
> Record ID 1:
>
> PeterMueller21.06.2012
> AndreasMaier12.08.1983
> LiseMeitner03.10.1961
> MaxMeier01.10.1956
> SusanneSchneider1981
>
> Record ID 2:
>
> PeterMuller21.06.2012
> AndreasMayer12.08.1983
> LiseMeitner3.101961
> M.Meier01.10.1956
>
> Obviously, there are three persons who are respectively detected in data set
> 1 and data set 2. But how can I assign the records correctly despite the
> slight differences in spelling.
>
> For example:
>
> PeterMueller21.06.2012 = PeterMuller21.06.2012
> AndreasMaier12.08.1983 = AndreasMayer12.08.1983
> LiseMeitner03.10.1961 = LiseMeitner3.101961
> MaxMeier01.10.1956 = M.Meier01.10.1956
> SusanneSchneider1981 -> no match
>
> Is there a way to solve the problem in SPSS?
> ...

Reply | Threaded
Open this post in threaded view
|

Re: Matching Problem

Barnett, Adrian (DECD)
In reply to this post by drfg2008
Hi Frank,
Your problem is essentially the same as that faced in probabilistic record linkage.

I would be inclined to treat it in two steps. In the first, I'd do what Rich suggests and make use of the fact that the date portion of the string is correctable using the tools in SPSS. So first I'd create a new variable just for the date string and clean and standardize that. I think there's a date wizard you can use for that.

I'd then create a new variable for the name portion of the original string, splitting that into two fields of given name and surname/family name.

Next I'd use one of the free record linkage tools to do probabilistic record linkage on the three fields you now have, date, surname and given name. You should wind up with three types of matches - exact, partial and unmatched. You can comb through the partial matches and assign them to one of the other two groups. The partial matches will be of varying degree of closeness of match, so you might decide to assign everything above a certain level to the 'exact' group without further inspection, and only "eyeball" those below that level. You could choose the level above which you will accept a match without inspection based on an initial pass over the data.

Suggestions for free probabilistic matching:
- FEBRL (requires Python. Available from http://sourceforge.net/projects/febrl/ )
- LinkageWiz (free for datasets < 10,000 cases). Available from http://www.linkagewiz.net/
- LinkPlus , from the CDC. Available here: http://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm



Regards,

Adrian


Adrian Barnett
Project Officer
Educational Measurement and Analysis
Data and Information Systems
Department for Education and Child Development

"Children and young people are at the centre of everything we do"

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of drfg2008
Sent: Thursday, 6 June 2013 7:26 AM
To: [hidden email]
Subject: Matching Problem

Matching Problem

From time to time we have records that are to be connected by means of a key variable (two data sets each time). The key variable is in the STRING format. Unfortunately, the key variables in both data sets are not written exactly the same, there are slight differences in the spelling. How can the records be matched?

Example:

Record ID 1:

PeterMueller21.06.2012
AndreasMaier12.08.1983
LiseMeitner03.10.1961
MaxMeier01.10.1956
SusanneSchneider1981

Record ID 2:

PeterMuller21.06.2012
AndreasMayer12.08.1983
LiseMeitner3.101961
M.Meier01.10.1956

Obviously, there are three persons who are respectively detected in data set
1 and data set 2. But how can I assign the records correctly despite the slight differences in spelling.

For example:

PeterMueller21.06.2012 = PeterMuller21.06.2012
AndreasMaier12.08.1983 = AndreasMayer12.08.1983
LiseMeitner03.10.1961 = LiseMeitner3.101961
MaxMeier01.10.1956 = M.Meier01.10.1956
SusanneSchneider1981 -> no match

Is there a way to solve the problem in SPSS?




-----
Dr. Frank Gaeth
FU-Berlin

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Matching-Problem-tp5720600.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD