Re: Matching records

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Matching records

Nicholls, Natalie
On Wed, 14 Jun 2006 00:06:06 -0700, Dennis Deck <[hidden email]> wrote:

>However, it was developed entirely in SAS and makes extensive use of
>features not available in SPSS.  It does not match on address
>information.

Further to this use of LinkKing, I was wondering if there was a similar way
at all of matching on 'fuzzy' variables in SPSS.

I have a number of clients in our database, many of which present more then
once, and I am aiming to determine the duplicates within the database.
Currently I am matching on gender, dob, surname and forename OR gender, dob,
surname and postcode sector.

Is there a way within SPSS that I can match on gender, dob, surname and
forename or postcode? So that if, for example, the forename was given as
Bill (B) in one instance, and William (W), but the postcode was the same,
SPSS would recognise these as the same client. Likewise if two identical
forenames were present, but the postcodes were different, again they would
be recognised as belonging to the same client.

Thanks in advance,

Natalie Nicholls
Reply | Threaded
Open this post in threaded view
|

Re: Matching records

Bob Schacht-3
At 12:29 AM 12/13/2006, Natalie Nicholls wrote:

>On Wed, 14 Jun 2006 00:06:06 -0700, Dennis Deck <[hidden email]> wrote:
>
> >However, it was developed entirely in SAS and makes extensive use of
> >features not available in SPSS.  It does not match on address
> >information.
>
>Further to this use of LinkKing, I was wondering if there was a similar way
>at all of matching on 'fuzzy' variables in SPSS.
>
>I have a number of clients in our database, many of which present more then
>once, and I am aiming to determine the duplicates within the database.
>Currently I am matching on gender, dob, surname and forename OR gender, dob,
>surname and postcode sector.
>
>Is there a way within SPSS that I can match on gender, dob, surname and
>forename or postcode? So that if, for example, the forename was given as
>Bill (B) in one instance, and William (W), but the postcode was the same,
>SPSS would recognise these as the same client. Likewise if two identical
>forenames were present, but the postcodes were different, again they would
>be recognised as belonging to the same client.

Natalie,
Your task is similar to what the SPSS function from the menu DATA/Identify
Duplicate Cases does, except that your list includes an "or". Just a
cautionary tale from this corner.

Using the Duplicate Cases function, I tried some months ago to identify
matching cases after merging two files that had a key variable present in
about 95% of cases. I was trying to match up the last 5%.  I used a list of
demographic variables such as yours, but with AND-- in other words, all
conditions had to match. I was surprised at how many "matches" (that were
not really matches) emerged. This was a more stringent match standard than
you are proposing. So just because it seems reasonable to suppose that
cases matching on gender, dob, surname, forename and postcode should be the
same person, don't count on it.

Second, How fuzzy are you willing to let your logic get? If you get
duplicates with all 5 such demographic variables, with fuzzier logic you'll
wind up with more matches. I suggest you first "Identify duplicate cases"
with all of your demographic variables named, and see how that works.
Settle all matches that you can by that method first.

Then for the remainder you will have to decide how much of a match is good
enough? My guess is that you can't entirely automate the process, and will
still need to examine the potentially paired cases "by hand" as it were.

In your case, since you anticipate variability in forename,  you can
"Identify duplicate cases" specifying gender, dob, surname,  and postcode,
but not forename. You'll then have to "eyeball" the "duplicates" and make a
judgment about whether similar forenames are, or are not, equivalent.  Some
equivalents are relatively obvious, such as your cases of "Bill" vs.
"William", but others are less so. What about Billie, Billy, or Willie? It
will often be a judgment call that you cannot make hard and fast rules about.

Or more generally, with a fuzzy logic list involving "or," how many do you
need to make a match? 4 of 5 criteria? 3 of 5? So I come back to the
question of how much fuzziness you are willing to tolerate.

Bob in HI

Robert M. Schacht, Ph.D. <[hidden email]>
Pacific Basin Rehabilitation Research & Training Center
1268 Young Street, Suite #204
Research Center, University of Hawaii
Honolulu, HI 96814
Reply | Threaded
Open this post in threaded view
|

Re: Matching records

Richard Ristow
In reply to this post by Nicholls, Natalie
At 05:29 AM 12/13/2006, Natalie Nicholls wrote:

>I was wondering if there was a similar way at all of matching on
>'fuzzy' variables in SPSS.
>
>I have a number of clients in our database, many of which present more
>than once, and I am aiming to determine the duplicates. Currently I am
>matching on gender, dob, surname and forename OR gender, dob, surname
>and postcode sector.
>
>Is there a way within SPSS that I can match on gender, dob, surname
>and forename or postcode? So that if, for example, the forename was
>given as Bill (B) in one instance, and William (W), but the postcode
>was the same, SPSS would recognise these as the same client?

This is difficult, I think inherently difficult. Years back I did a
fairly complicated one in SAS. The problems in SPSS would be similar,
though in SAS,

a. PROC SQL supports many-to-many merges, which I think are necessary
for this, and which are awkward in SPSS

b. PROC SQL, with the script-like features of SAS's 'macro' facility,
also makes transitive closure easier than it would be in SPSS, though a
Python loop in SPSS would probably help. ("Transitive closure": If A is
the same person as B, and B is the same person as C, then A is the same
person as C.)

There are two inherent problems.

A. For exact matching on a key, there's a trick we're so used to we
forget how important it is: sort the file by the key, and matching
records will be adjacent in the file. There's no equivalent for fuzzy
merging, no approximate keys on which to sort.

B. There are no firm rules for when a fuzzy merge is 'correct'. It
sounds like you have a set of rules you like, though you have to be
careful: matching forename (first name) but not postcode will give you
many, many false matches unless the two also match on several other key
items.

The program I wrote did it in three steps:

1. Identify pairs of entities (mine were nursing homes) that MIGHT be
the same, because of exact match on some set of keys. In your case, I
might identify such 'candidate' pairs as a match on any of
. DOB
. DOB, month and day only?
. Surname
. Postcode? That might give you many false matches to check. Postcode
and forename?

This selection of 'candidate' matches is inherently many-to-many:
everybody with one DOB has to be checked against everybody else with
the same DOB. And the logic has to be broad; the candidates will
include many, many false matches.

2. Evaluate the candidate matches; more or less, score points for how
closely they match on each characteristic you have. Match on DOB will
score a good deal; match on postcode, moderately; match on gender, very
little. This is where Soundex techniques and the like (I think some are
available in Python) could help: score points for surnames that aren't
identical but are similar.

3. Given these evaluations, discard some pairs as very improbable;
accept some, as almost certain; and make a list that will need human
review, and acceptance or rejection. The human review, by the way, will
give you a better sense what key matches are useful, and what are not.
And maybe some special cases: 'Bill' as a forename matches 'William'.

4. Transitive closure, as described above. That may also cause you to
reject some matches: if A doesn't look at all like C, than either
identifying A with B, or B with C, is probably wrong.

That's a lot, but I don't see a quicker way. Does anybody?
Reply | Threaded
Open this post in threaded view
|

Re: Matching records

Dennis Deck
In reply to this post by Nicholls, Natalie
There are two main approaches:
- Apply a deterministic algorithm that allows a certain number of
matches and allows a mismatch or two
- Apply a probabilistic linkage algorithm that sums weights for each
match and mismatch according to the information value provided (eg.,
'Smith' is a common name so provides less information than a less common
name like 'Deck' is uncommon.  Gender has only two possible values so
helps mainly as a tie breaker.)  See literature on Probabilistic Linkage
for the formula.

In SPSS the comparison algorithm is not hard to implement but other
parts of the process like parsing the names, handling nicknames,
controlling the comparison of all possible (or probable) pairs would
probably require Python or some shortcuts.

You might want to check out existing linkage software (some are free
like LinkPlus from CDC) to do the linkages.

Dennis Deck, PhD
RMC Research Corporation
[hidden email]

-----Original Message-----
From: Natalie Nicholls [mailto:[hidden email]]
Sent: Wednesday, December 13, 2006 2:30 AM
Subject: Re: Matching records

On Wed, 14 Jun 2006 00:06:06 -0700, Dennis Deck <[hidden email]>
wrote:

>However, it was developed entirely in SAS and makes extensive use of
>features not available in SPSS.  It does not match on address
>information.

Further to this use of LinkKing, I was wondering if there was a similar
way
at all of matching on 'fuzzy' variables in SPSS.

I have a number of clients in our database, many of which present more
then
once, and I am aiming to determine the duplicates within the database.
Currently I am matching on gender, dob, surname and forename OR gender,
dob,
surname and postcode sector.

Is there a way within SPSS that I can match on gender, dob, surname and
forename or postcode? So that if, for example, the forename was given as
Bill (B) in one instance, and William (W), but the postcode was the
same,
SPSS would recognise these as the same client. Likewise if two identical
forenames were present, but the postcodes were different, again they
would
be recognised as belonging to the same client.

Thanks in advance,

Natalie Nicholls
Reply | Threaded
Open this post in threaded view
|

Re: Matching records

Nicholls, Natalie
In reply to this post by Nicholls, Natalie
Thank you all for your suggestions, I will persevere this end with them, and
see how well it goes, and update the list when (if) we have a final
solution.

Regards,

Natalie

_________________________________________________________________
NHS National Services Scotland Disclaimer

The information contained in this message may be confidential or
legally privileged and is intended for the addressee only. If you have
received this message in error or there are any problems please notify the
originator immediately. The unauthorised use, disclosure, copying or
alteration of this message is strictly forbidden.
_________________________________________________________________
Reply | Threaded
Open this post in threaded view
|

Re: Matching records

Hal 9000
Natalie,

I think if I had to do something like this TODAY, I'd do it like this
(....& taking notes on what's already been said....):

new file.
data list free /id(f4) name(a20) sex(a1) city(a20) state(a2).
begin data
1234 'Smith, Sally' 'F' 'Santa Rosa' 'CA'
1233 'Smith, Sally S.' 'F' 'Novato' 'CA'
1232 'Smith ,John' 'M' 'Novato' 'CA'
1231 'Smith, John' 'M' 'San Rafael' 'CA'
1230 'Brown,Terry' 'F' 'Mill Valley' 'CA'
1229 'Brown, Terry' 'M' 'Mill Valley' 'CA'
1228 'Brownie, Terry' 'M' 'Mill Valley' 'CA'
1227 'Watson, Jim' 'M' 'San Francisco' 'CA'
end data.
dataset name A.

* compute indicator for same last name & same sex .
string LN (a20).
compute LN = rtrim(substr(name,1, index(name, ',') - 1)).
sort cases by LN (a) sex(a) state(a) city(a) id(a).
if (ln = lag(ln) & sex = lag(sex)) i1 = 1.
sort cases by LN(d) sex(d) state(d) city(d) id(d).
if (ln = lag(ln) & sex = lag(sex)) i1 = 1.
exe.

* (don't run the syntax beyond this point) .

* compute the other indicator vars - i2, i3.....to i10 or beyond .
compute Tot = sum (i1 to i10).

* 'Tot' could serve as the score that Richard suggested .

* and then .
sort cases by tot(d) name(a) state(a).

All the likely matches are grouped at the top of the file, then maybe you
can break out that subset with:
dataset copy B.
dataset activate B.
select if Tot > 0.
exe.

This solution is not sophisticated, but the logic is transparent and it's
easy to modify - add or subtract indicators or adjust the logic for a single
one. I'm sure other list members can suggest improvements or altogether
better methods (please do!).

-Gary