I'm trying to
identify cases that may belong to the same individuals, even though their name
might be entered slightly differently in the different records (e.g., Ben Jones
and Benjamin Jones). Unless I'm missing something, I don't think the
Duplicate Cases utility can do this.
I can mostly
accomplish this with the following syntax, where Fname and Lname are the first
and last names:
SORT CASES BY LNAME$
(A) FNAME$ (A).
String ShortWholeName (a30). Compute ShortWholeName = Concat (RTRIM(Lname), ", ", SUBSTR(Fname,1,3)). Compute DuplicateName=0. If ShortWholeName=Lag(ShortWholeName) DuplicateName=1. The syntax flags the
second & subsequent cases that are matches, but not the first one.
I.e., if Fname and Lname for the first three cases are:
Ben
Jones
Benjamin Jones Benjamin F Jones the syntax reduces
them all to "Jones, Ben" and gives DuplicateName a value of 1 for the 2nd &
3rd cases, but not for the first case.
My question is, how
can I make DuplicateName =1 for the first matching case (Ben Jones) as
well? Or, is there a better way to accomplish this?
Thanks, all.
___________________________________
Thomas G. Snider-Lotz, Ph.D.
Principal Scientist
PreVisor
1805 Old Alabama Road
Suite 150
Roswell, GA 30076
Ph:
678-832-0555
Ph: 800-281-9713
x555
Fax: 770-642-6115
|
It just occurred to me that
I can easily solve my problem by using the Duplicate Cases utility to find
duplicates for the variable ShortWholeName that I've created via the
syntax. However, if anyone sees an even easier method, I'd like to hear
about it.
Thanks, all.
-- Tom
Snider-Lotz
From: Snider-Lotz, Tom Sent: Sun 16-Jul-06 4:38 PM To: [hidden email] Subject: Identifying cases that almost match I'm trying to
identify cases that may belong to the same individuals, even though their name
might be entered slightly differently in the different records (e.g., Ben Jones
and Benjamin Jones). Unless I'm missing something, I don't think the
Duplicate Cases utility can do this.
I can mostly
accomplish this with the following syntax, where Fname and Lname are the first
and last names:
SORT CASES BY LNAME$
(A) FNAME$ (A).
String ShortWholeName (a30). Compute ShortWholeName = Concat (RTRIM(Lname), ", ", SUBSTR(Fname,1,3)). Compute DuplicateName=0. If ShortWholeName=Lag(ShortWholeName) DuplicateName=1. The syntax flags the
second & subsequent cases that are matches, but not the first one.
I.e., if Fname and Lname for the first three cases are:
Ben
Jones
Benjamin Jones Benjamin F Jones the syntax reduces
them all to "Jones, Ben" and gives DuplicateName a value of 1 for the 2nd &
3rd cases, but not for the first case.
My question is, how
can I make DuplicateName =1 for the first matching case (Ben Jones) as
well? Or, is there a better way to accomplish this?
Thanks, all.
___________________________________
Thomas G. Snider-Lotz, Ph.D.
Principal Scientist
PreVisor
1805 Old Alabama Road
Suite 150
Roswell, GA 30076
Ph:
678-832-0555
Ph: 800-281-9713
x555
Fax: 770-642-6115
|
In reply to this post by Snider-Lotz, Tom
Tom,
To the syntax you posted, add compute duplicatename=0. /* add this line. SORT CASES BY LNAME$ (A) FNAME$ (A). String ShortWholeName (a30). Compute ShortWholeName = Concat (RTRIM(Lname), ", ", SUBSTR(Fname,1,3)). Compute DuplicateName=0. If ShortWholeName=Lag(ShortWholeName) DuplicateName=1. SORT CASES BY LNAME$ (A) FNAME$ (A) duplicatename (d). /* add. If ShortWholeName=Lag(ShortWholeName) and duplicatename eq lag(DuplicateName) duplicatename=1. /* add. |
In reply to this post by Snider-Lotz, Tom
At 06:23 PM 7/16/2006, Snider-Lotz, Tom wrote:
>I'm trying to identify cases that may belong to the same individuals, >even though their name might be entered slightly differently in the >different records (e.g., Ben Jones and Benjamin Jones). It just >occurred to me that I can easily solve my problem by using the >Duplicate Cases utility to find duplicates for the variable >ShortWholeName that I've created via the syntax. > >String ShortWholeName (a30). >Compute ShortWholeName = Concat (RTRIM(Lname), ", ", >SUBSTR(Fname,1,3)). That's more or less how you do it: create a key that's broader - more permissive about matching - than is the one you're having trouble with. There's no magic. You risk false matches, though you're using a pretty strict key that won't get many. "Robert" will match "Robin", "Samuel" match "Samantha". But requiring a strict match on the last name will eliminate most of those. (Worst likely case is siblings in families that like to use similar names for You also risk false negatives, continuing to miss true matches. In your case, I'd worry more about that: "William" won't match "Bill", "Elizabeth" won't match "Betty", and any variation in spelling of the last name will spoil the match. (You may also find ambiguity about what name is the first. I'm "Walter Richard Ristow." You know me as "Richard Ristow", but occasional lists have me as "Walter.") Strategy depends on how big your file is, how much work it's worth investing, and how many keys you have; for example, you can look for people who match on address but not on name, if you have address. That can be a long story, though, since you then need criteria for evaluating the quality - likelihood of being correct - of matches that meet various combinations of criteria. I did one of these, in SAS, some years back, and all I can say is that it was a lot of work. |
In reply to this post by Snider-Lotz, Tom
Hi,
As Richard points out, this task is hard in SPSS. What you really need is a program which performs probabilistic record linkage. These can perform matches based on specifiable degrees of similarity between identifying variables. A good one will also make use of 'prior knowledge' such as abbreviations and nicknames such as Robert=Rob=Bob=Bobby etc . de-duplication is one of the tasks that programs which perform record linkage are designed to carry out. I've used one such program called LinkageWiz, which works very well. It is a commercial product though and if your organization's budget does not run to such things, you can download a free version which will work on up to 750 records at a time. There is a demonstration on the LinkageWiz website of a de-duplication run in progress (and a number of other tasks the program can perform) here : www.linkagewiz.com If the 750 record limit is too low for your task, you could consider sorting the file and then chopping it into groups of records of up to 750 each and doing separate runs on each. Another possibility is a free, open-source application called FEBRL, which is produced by the Data Mining Group at the Australian National University. It uses some very sophisticated algorithms and will certainly be able to do your de-duplication job. Its main drawbacks are its manual, which does not explain things very well in places and the fact that it is still a work in progress (it is currently at version 0.3). The whole thing is written in Python and so you have to obtain and install Python yourself. FEBRL is described here: http://datamining.anu.edu.au/linkage.html. FEBRL is a very clever bit of software and can do some amazing things, but due to its manual, it's something only its mother could love ;-) However for a version 0.3, it is not fair to be hard on something so early in it's development cycle. Besides, it's free. There is a very good overview of the underlying concepts of record linkage in a free publication you can obtain here: http://www.statistics.gov.uk/StatBase/Product.asp?vlnk=9224 and there are more references of a much more technical nature on the ANU Data Mining website. Hope this helps Regards Adrian -- Adrian Barnett Research & Information Officer Ph: +61 8 82266615 Research, Analysis and Evaluation Fax: +61 8 82267088 Strategic Planning and Research Branch Strategic Planning and Population Health Division SA Department of Health This e-mail may contain confidential information, which also may be legally privileged. Only the intended recipient(s) may access, use, distribute or copy this e-mail. If this e-mail is received in error, please inform the sender by return e-mail and delete the original. If there are doubts about the validity of this message, please contact the sender by telephone. It is the recipient's responsibility to check the e-mail and any attached files for viruses. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow Sent: Tuesday, 18 July 2006 3:58 To: [hidden email] Subject: Re: FW: Identifying cases that almost match At 06:23 PM 7/16/2006, Snider-Lotz, Tom wrote: >I'm trying to identify cases that may belong to the same individuals, >even though their name might be entered slightly differently in the >different records (e.g., Ben Jones and Benjamin Jones). It just >occurred to me that I can easily solve my problem by using the >Duplicate Cases utility to find duplicates for the variable >ShortWholeName that I've created via the syntax. > >String ShortWholeName (a30). >Compute ShortWholeName = Concat (RTRIM(Lname), ", ", >SUBSTR(Fname,1,3)). That's more or less how you do it: create a key that's broader - more permissive about matching - than is the one you're having trouble with. There's no magic. You risk false matches, though you're using a pretty strict key that won't get many. "Robert" will match "Robin", "Samuel" match "Samantha". But requiring a strict match on the last name will eliminate most of those. (Worst likely case is siblings in families that like to use similar names for You also risk false negatives, continuing to miss true matches. In your case, I'd worry more about that: "William" won't match "Bill", "Elizabeth" won't match "Betty", and any variation in spelling of the last name will spoil the match. (You may also find ambiguity about what name is the first. I'm "Walter Richard Ristow." You know me as "Richard Ristow", but occasional lists have me as "Walter.") Strategy depends on how big your file is, how much work it's worth investing, and how many keys you have; for example, you can look for people who match on address but not on name, if you have address. That can be a long story, though, since you then need criteria for evaluating the quality - likelihood of being correct - of matches that meet various combinations of criteria. I did one of these, in SAS, some years back, and all I can say is that it was a lot of work. |
In reply to this post by Snider-Lotz, Tom
Hello All
Very good answers to this problem so far, but I wonder if someone could go one better. We use a commercial deduplication piece of software outside of SPSS for checking of both names and addresses. Looking about on Google there seems to be a couple of deduplication programs that can link to SAS but I cannot locate any programs that can integrate with SPSS. What we would like to achieve is automated checking of names and addresses from the Syntax windows or some other automation method. The problem with the suggested solutions seem to be mistypes, if someone for example called Ben Jones was accidently inputted as Ben Jines (common input error due to neighbouring keys on the keyboard) this would not be flagged as a potential problem. This seems to rely on some form of algorithm to notice potential mistypes and then flag them. I read the below site about the FEBRL project with great interest, does anyone have a link to the manual for the application as I cannot seem to locate the document on their site. Any suggestions would be appreciated. Thanks Ian Maddrell PgDipM Campaign Planner -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of Barnett, Adrian (HEALTH) Sent: 18 July 2006 02:25 To: [hidden email] Subject: Re: FW: Identifying cases that almost match Hi, As Richard points out, this task is hard in SPSS. What you really need is a program which performs probabilistic record linkage. These can perform matches based on specifiable degrees of similarity between identifying variables. A good one will also make use of 'prior knowledge' such as abbreviations and nicknames such as Robert=Rob=Bob=Bobby etc . de-duplication is one of the tasks that programs which perform record linkage are designed to carry out. I've used one such program called LinkageWiz, which works very well. It is a commercial product though and if your organization's budget does not run to such things, you can download a free version which will work on up to 750 records at a time. There is a demonstration on the LinkageWiz website of a de-duplication run in progress (and a number of other tasks the program can perform) here : www.linkagewiz.com If the 750 record limit is too low for your task, you could consider sorting the file and then chopping it into groups of records of up to 750 each and doing separate runs on each. Another possibility is a free, open-source application called FEBRL, which is produced by the Data Mining Group at the Australian National University. It uses some very sophisticated algorithms and will certainly be able to do your de-duplication job. Its main drawbacks are its manual, which does not explain things very well in places and the fact that it is still a work in progress (it is currently at version 0.3). The whole thing is written in Python and so you have to obtain and install Python yourself. FEBRL is described here: http://datamining.anu.edu.au/linkage.html. FEBRL is a very clever bit of software and can do some amazing things, but due to its manual, it's something only its mother could love ;-) However for a version 0.3, it is not fair to be hard on something so early in it's development cycle. Besides, it's free. There is a very good overview of the underlying concepts of record linkage in a free publication you can obtain here: http://www.statistics.gov.uk/StatBase/Product.asp?vlnk=9224 and there are more references of a much more technical nature on the ANU Data Mining website. Hope this helps Regards Adrian -- Adrian Barnett Research & Information Officer Ph: +61 8 82266615 Research, Analysis and Evaluation Fax: +61 8 82267088 Strategic Planning and Research Branch Strategic Planning and Population Health Division SA Department of Health This e-mail may contain confidential information, which also may be legally privileged. Only the intended recipient(s) may access, use, distribute or copy this e-mail. If this e-mail is received in error, please inform the sender by return e-mail and delete the original. If there are doubts about the validity of this message, please contact the sender by telephone. It is the recipient's responsibility to check the e-mail and any attached files for viruses. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow Sent: Tuesday, 18 July 2006 3:58 To: [hidden email] Subject: Re: FW: Identifying cases that almost match At 06:23 PM 7/16/2006, Snider-Lotz, Tom wrote: >I'm trying to identify cases that may belong to the same individuals, >even though their name might be entered slightly differently in the >different records (e.g., Ben Jones and Benjamin Jones). It just >occurred to me that I can easily solve my problem by using the >Duplicate Cases utility to find duplicates for the variable >ShortWholeName that I've created via the syntax. > >String ShortWholeName (a30). >Compute ShortWholeName = Concat (RTRIM(Lname), ", ", >SUBSTR(Fname,1,3)). That's more or less how you do it: create a key that's broader - more permissive about matching - than is the one you're having trouble with. There's no magic. You risk false matches, though you're using a pretty strict key that won't get many. "Robert" will match "Robin", "Samuel" match "Samantha". But requiring a strict match on the last name will eliminate most of those. (Worst likely case is siblings in families that like to use similar names for You also risk false negatives, continuing to miss true matches. In your case, I'd worry more about that: "William" won't match "Bill", "Elizabeth" won't match "Betty", and any variation in spelling of the last name will spoil the match. (You may also find ambiguity about what name is the first. I'm "Walter Richard Ristow." You know me as "Richard Ristow", but occasional lists have me as "Walter.") Strategy depends on how big your file is, how much work it's worth investing, and how many keys you have; for example, you can look for people who match on address but not on name, if you have address. That can be a long story, though, since you then need criteria for evaluating the quality - likelihood of being correct - of matches that meet various combinations of criteria. I did one of these, in SAS, some years back, and all I can say is that it was a lot of work. ______________________________________________________________ This message has been scanned for all viruses by BTnet VirusScreen. The service is delivered in partnership with MessageLabs. This service does not scan any password protected or encrypted attachments. If you are interested in finding out more about the service, please visit our website at http://www.btignite.com/internetservices/btnet/products_virusscreen.htm ============================================================== This e-mail has been sent from a PC belonging to DSG Retail Limited or another company in the Dixons Group, registered office Maylands Avenue, Hemel Hempstead, Hertfordshire HP2 7TG. Its contents are confidential to the sender and the intended recipient. If you receive it in error, please tell us by return and then delete it from your system; you may not rely on its contents nor copy/disclose it to anyone. Opinions, conclusions and statements of intent in this e-mail are those of the sender and will not bind a Dixons Group company unless confirmed by an authorised representative independently of this message. We do not accept responsibility for viruses; you must scan for these. Please note that e-mails sent to and from the Dixons Group are routinely monitored for record keeping, quality control and training purposes, to ensure regulatory compliance and to prevent viruses and unauthorised use of our computer systems. |
In reply to this post by Snider-Lotz, Tom
Hi Ian
-----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Ian Maddrell Sent: Tuesday, 18 July 2006 10:16 To: [hidden email] Subject: Re: FW: Identifying cases that almost match >What we would like to achieve is automated checking of names and addresses >from the Syntax windows or some other automation method. >The problem with the suggested solutions seem to be mistypes, >if someone for example called Ben Jones was accidently inputted >as Ben Jines (common input error due to neighbouring keys on the >keyboard) this would not be flagged as a potential problem. >This seems to rely on some form of algorithm to notice potential >mistypes and then flag them. What you could try is converting the names via the SOUNDEX algorithm, and checking to see any which match on their SOUNDEX codes but do not match with an exact string match. Not foolproof, but may get you close enough. There is an implementation of the SOUNDEX algorithm in SPSS on Raynauld's site www.spsstools.net. There is a handy web-based implementation of SOUNDEX here: http://www.dropby.com/ (Look under Phonetic Encoders) You could try it out on some sample mis-spellings to see if it does what you want. The same site also has a NYSIIS calculator, which apparently has certain advantages over SOUNDEX, but you would need to write your own implementation. These are just two of a big variety of string-comparison methods. There is a list of other methods available here: http://www.dcs.shef.ac.uk/~sam/stringmetrics.html There are lots of other sources which might have something useful if you Google "string similarity" >I read the below site about the FEBRL project with great interest, >does anyone have a link to the manual for the application >as I cannot seem to locate the document on their site. The documentation is with the software itself on the Sourceforge site: http://sourceforge.net/project/showfiles.php?group_id=62161 (it's down low on the page, below the link to the software) Hope there is something useful for you amongst this lot Regards Adrian -- Adrian Barnett Research & Information Officer Ph: +61 8 82266615 Research, Analysis and Evaluation Fax: +61 8 82267088 Strategic Planning and Research Branch Strategic Planning and Population Health Division SA Department of Health This e-mail may contain confidential information, which also may be legally privileged. Only the intended recipient(s) may access, use, distribute or copy this e-mail. If this e-mail is received in error, please inform the sender by return e-mail and delete the original. If there are doubts about the validity of this message, please contact the sender by telephone. It is the recipient's responsibility to check the e-mail and any attached files for viruses. |
Free forum by Nabble | Edit this page |