On Wed, 14 Jun 2006 00:06:06 -0700, Dennis Deck <[hidden email]> wrote:
>However, it was developed entirely in SAS and makes extensive use of >features not available in SPSS. It does not match on address >information. Further to this use of LinkKing, I was wondering if there was a similar way at all of matching on 'fuzzy' variables in SPSS. I have a number of clients in our database, many of which present more then once, and I am aiming to determine the duplicates within the database. Currently I am matching on gender, dob, surname and forename OR gender, dob, surname and postcode sector. Is there a way within SPSS that I can match on gender, dob, surname and forename or postcode? So that if, for example, the forename was given as Bill (B) in one instance, and William (W), but the postcode was the same, SPSS would recognise these as the same client. Likewise if two identical forenames were present, but the postcodes were different, again they would be recognised as belonging to the same client. Thanks in advance, Natalie Nicholls |
At 12:29 AM 12/13/2006, Natalie Nicholls wrote:
>On Wed, 14 Jun 2006 00:06:06 -0700, Dennis Deck <[hidden email]> wrote: > > >However, it was developed entirely in SAS and makes extensive use of > >features not available in SPSS. It does not match on address > >information. > >Further to this use of LinkKing, I was wondering if there was a similar way >at all of matching on 'fuzzy' variables in SPSS. > >I have a number of clients in our database, many of which present more then >once, and I am aiming to determine the duplicates within the database. >Currently I am matching on gender, dob, surname and forename OR gender, dob, >surname and postcode sector. > >Is there a way within SPSS that I can match on gender, dob, surname and >forename or postcode? So that if, for example, the forename was given as >Bill (B) in one instance, and William (W), but the postcode was the same, >SPSS would recognise these as the same client. Likewise if two identical >forenames were present, but the postcodes were different, again they would >be recognised as belonging to the same client. Natalie, Your task is similar to what the SPSS function from the menu DATA/Identify Duplicate Cases does, except that your list includes an "or". Just a cautionary tale from this corner. Using the Duplicate Cases function, I tried some months ago to identify matching cases after merging two files that had a key variable present in about 95% of cases. I was trying to match up the last 5%. I used a list of demographic variables such as yours, but with AND-- in other words, all conditions had to match. I was surprised at how many "matches" (that were not really matches) emerged. This was a more stringent match standard than you are proposing. So just because it seems reasonable to suppose that cases matching on gender, dob, surname, forename and postcode should be the same person, don't count on it. Second, How fuzzy are you willing to let your logic get? If you get duplicates with all 5 such demographic variables, with fuzzier logic you'll wind up with more matches. I suggest you first "Identify duplicate cases" with all of your demographic variables named, and see how that works. Settle all matches that you can by that method first. Then for the remainder you will have to decide how much of a match is good enough? My guess is that you can't entirely automate the process, and will still need to examine the potentially paired cases "by hand" as it were. In your case, since you anticipate variability in forename, you can "Identify duplicate cases" specifying gender, dob, surname, and postcode, but not forename. You'll then have to "eyeball" the "duplicates" and make a judgment about whether similar forenames are, or are not, equivalent. Some equivalents are relatively obvious, such as your cases of "Bill" vs. "William", but others are less so. What about Billie, Billy, or Willie? It will often be a judgment call that you cannot make hard and fast rules about. Or more generally, with a fuzzy logic list involving "or," how many do you need to make a match? 4 of 5 criteria? 3 of 5? So I come back to the question of how much fuzziness you are willing to tolerate. Bob in HI Robert M. Schacht, Ph.D. <[hidden email]> Pacific Basin Rehabilitation Research & Training Center 1268 Young Street, Suite #204 Research Center, University of Hawaii Honolulu, HI 96814 |
In reply to this post by Nicholls, Natalie
At 05:29 AM 12/13/2006, Natalie Nicholls wrote:
>I was wondering if there was a similar way at all of matching on >'fuzzy' variables in SPSS. > >I have a number of clients in our database, many of which present more >than once, and I am aiming to determine the duplicates. Currently I am >matching on gender, dob, surname and forename OR gender, dob, surname >and postcode sector. > >Is there a way within SPSS that I can match on gender, dob, surname >and forename or postcode? So that if, for example, the forename was >given as Bill (B) in one instance, and William (W), but the postcode >was the same, SPSS would recognise these as the same client? This is difficult, I think inherently difficult. Years back I did a fairly complicated one in SAS. The problems in SPSS would be similar, though in SAS, a. PROC SQL supports many-to-many merges, which I think are necessary for this, and which are awkward in SPSS b. PROC SQL, with the script-like features of SAS's 'macro' facility, also makes transitive closure easier than it would be in SPSS, though a Python loop in SPSS would probably help. ("Transitive closure": If A is the same person as B, and B is the same person as C, then A is the same person as C.) There are two inherent problems. A. For exact matching on a key, there's a trick we're so used to we forget how important it is: sort the file by the key, and matching records will be adjacent in the file. There's no equivalent for fuzzy merging, no approximate keys on which to sort. B. There are no firm rules for when a fuzzy merge is 'correct'. It sounds like you have a set of rules you like, though you have to be careful: matching forename (first name) but not postcode will give you many, many false matches unless the two also match on several other key items. The program I wrote did it in three steps: 1. Identify pairs of entities (mine were nursing homes) that MIGHT be the same, because of exact match on some set of keys. In your case, I might identify such 'candidate' pairs as a match on any of . DOB . DOB, month and day only? . Surname . Postcode? That might give you many false matches to check. Postcode and forename? This selection of 'candidate' matches is inherently many-to-many: everybody with one DOB has to be checked against everybody else with the same DOB. And the logic has to be broad; the candidates will include many, many false matches. 2. Evaluate the candidate matches; more or less, score points for how closely they match on each characteristic you have. Match on DOB will score a good deal; match on postcode, moderately; match on gender, very little. This is where Soundex techniques and the like (I think some are available in Python) could help: score points for surnames that aren't identical but are similar. 3. Given these evaluations, discard some pairs as very improbable; accept some, as almost certain; and make a list that will need human review, and acceptance or rejection. The human review, by the way, will give you a better sense what key matches are useful, and what are not. And maybe some special cases: 'Bill' as a forename matches 'William'. 4. Transitive closure, as described above. That may also cause you to reject some matches: if A doesn't look at all like C, than either identifying A with B, or B with C, is probably wrong. That's a lot, but I don't see a quicker way. Does anybody? |
In reply to this post by Nicholls, Natalie
There are two main approaches:
- Apply a deterministic algorithm that allows a certain number of matches and allows a mismatch or two - Apply a probabilistic linkage algorithm that sums weights for each match and mismatch according to the information value provided (eg., 'Smith' is a common name so provides less information than a less common name like 'Deck' is uncommon. Gender has only two possible values so helps mainly as a tie breaker.) See literature on Probabilistic Linkage for the formula. In SPSS the comparison algorithm is not hard to implement but other parts of the process like parsing the names, handling nicknames, controlling the comparison of all possible (or probable) pairs would probably require Python or some shortcuts. You might want to check out existing linkage software (some are free like LinkPlus from CDC) to do the linkages. Dennis Deck, PhD RMC Research Corporation [hidden email] -----Original Message----- From: Natalie Nicholls [mailto:[hidden email]] Sent: Wednesday, December 13, 2006 2:30 AM Subject: Re: Matching records On Wed, 14 Jun 2006 00:06:06 -0700, Dennis Deck <[hidden email]> wrote: >However, it was developed entirely in SAS and makes extensive use of >features not available in SPSS. It does not match on address >information. Further to this use of LinkKing, I was wondering if there was a similar way at all of matching on 'fuzzy' variables in SPSS. I have a number of clients in our database, many of which present more then once, and I am aiming to determine the duplicates within the database. Currently I am matching on gender, dob, surname and forename OR gender, dob, surname and postcode sector. Is there a way within SPSS that I can match on gender, dob, surname and forename or postcode? So that if, for example, the forename was given as Bill (B) in one instance, and William (W), but the postcode was the same, SPSS would recognise these as the same client. Likewise if two identical forenames were present, but the postcodes were different, again they would be recognised as belonging to the same client. Thanks in advance, Natalie Nicholls |
In reply to this post by Nicholls, Natalie
Thank you all for your suggestions, I will persevere this end with them, and
see how well it goes, and update the list when (if) we have a final solution. Regards, Natalie _________________________________________________________________ NHS National Services Scotland Disclaimer The information contained in this message may be confidential or legally privileged and is intended for the addressee only. If you have received this message in error or there are any problems please notify the originator immediately. The unauthorised use, disclosure, copying or alteration of this message is strictly forbidden. _________________________________________________________________ |
Natalie,
I think if I had to do something like this TODAY, I'd do it like this (....& taking notes on what's already been said....): new file. data list free /id(f4) name(a20) sex(a1) city(a20) state(a2). begin data 1234 'Smith, Sally' 'F' 'Santa Rosa' 'CA' 1233 'Smith, Sally S.' 'F' 'Novato' 'CA' 1232 'Smith ,John' 'M' 'Novato' 'CA' 1231 'Smith, John' 'M' 'San Rafael' 'CA' 1230 'Brown,Terry' 'F' 'Mill Valley' 'CA' 1229 'Brown, Terry' 'M' 'Mill Valley' 'CA' 1228 'Brownie, Terry' 'M' 'Mill Valley' 'CA' 1227 'Watson, Jim' 'M' 'San Francisco' 'CA' end data. dataset name A. * compute indicator for same last name & same sex . string LN (a20). compute LN = rtrim(substr(name,1, index(name, ',') - 1)). sort cases by LN (a) sex(a) state(a) city(a) id(a). if (ln = lag(ln) & sex = lag(sex)) i1 = 1. sort cases by LN(d) sex(d) state(d) city(d) id(d). if (ln = lag(ln) & sex = lag(sex)) i1 = 1. exe. * (don't run the syntax beyond this point) . * compute the other indicator vars - i2, i3.....to i10 or beyond . compute Tot = sum (i1 to i10). * 'Tot' could serve as the score that Richard suggested . * and then . sort cases by tot(d) name(a) state(a). All the likely matches are grouped at the top of the file, then maybe you can break out that subset with: dataset copy B. dataset activate B. select if Tot > 0. exe. This solution is not sophisticated, but the logic is transparent and it's easy to modify - add or subtract indicators or adjust the logic for a single one. I'm sure other list members can suggest improvements or altogether better methods (please do!). -Gary |
Free forum by Nabble | Edit this page |