Hi SPSS Users:
I have hundreds of addresses stored in a string variable. Many of these are the same address, but have slight spelling differences. For example: 6300 Buist Avenue 6300 Buist Ave. 53 Vermont Road, 53 Vermont Road 53 Vermnt Road 4400 North Reese Street 4400 N. Reese St What method can be used to identify these near matches as the same address? Once identified as a likely match, how to best assign a unique key to the record? Many thanks... ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Ben,
I would look into Python for a solution for you on this. They have robust libraries built for this type of thing (look into NLTK, Natural Language Tool Kit). I don't think SPSS has any builtins for handling fuzzy context analysis. And to script it all in SPSS syntax, especially if it's unvetted, might be way more work than you are willing to commit to. However, Python is very well suited for this and is well integrated into SPSS. If Python is REAL option for you, then let me know and I'll look into providing you more of a solution.
In terms of "how you would assign unique values", well, I'd eliminate overlapping addresses first; assigning unique ID's is cakewalk. -J On Thu, May 23, 2013 at 11:24 AM, Ben <[hidden email]> wrote: Hi SPSS Users: |
I might be wrong but I think this is a regular expression (regex) problem and I think Jon posted some messages, either this week or last week, about this and its relationship to python. You might look in the archives. Gene Maguin From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of J. R. Carroll Ben, I would look into Python for a solution for you on this. They have robust libraries built for this type of thing (look into NLTK, Natural Language Tool Kit). I don't think SPSS has any builtins for handling fuzzy context analysis. And to script it all in SPSS syntax, especially if it's unvetted, might be way more work than you are willing to commit to. However, Python is very well suited for this and is well integrated into SPSS. If Python is REAL option for you, then let me know and I'll look into providing you more of a solution. In terms of "how you would assign unique values", well, I'd eliminate overlapping addresses first; assigning unique ID's is cakewalk. -J On Thu, May 23, 2013 at 11:24 AM, Ben <[hidden email]> wrote: Hi SPSS Users: |
In reply to this post by Ben Cohen
These is specialized software for address
matching. I don't have any experience with it, so I won't recommend
anything in particular. However, if you have access to SPSS Modeler
with Entity Analytics, it would probably do a good job with this.
If you want to do it with Statistics, here is how I would approach it. 1. Expand common abbreviations such as N or N. -->North and St --> Street using Python regular expressions and the SPSSINC TRANS extension command. 2. Perhaps some other cleanup such as eliminating trailing periods in abbreviations. 3. Use the function levenshteindistance in the extendedTransforms.py module or jaroWinker in that same module to calculate all the pairwise differences in addresses. They calculate a distance between pairs of strings. Then looking at nonzero but small differences, transform the text into exact matches. A short Python program could go directly from the distance matrix to the implied transformations. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Ben <[hidden email]> To: [hidden email], Date: 05/23/2013 10:38 AM Subject: [SPSSX-L] fuzzy data analysis Sent by: "SPSSX(r) Discussion" <[hidden email]> Hi SPSS Users: I have hundreds of addresses stored in a string variable. Many of these are the same address, but have slight spelling differences. For example: 6300 Buist Avenue 6300 Buist Ave. 53 Vermont Road, 53 Vermont Road 53 Vermnt Road 4400 North Reese Street 4400 N. Reese St What method can be used to identify these near matches as the same address? Once identified as a likely match, how to best assign a unique key to the record? Many thanks... ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Ben Cohen
Geocoding (finding long/lat) often includes
address "normalization" (making the address match USPO
standards), checking
to see if such an address exists, as well at
attaching long/lat.
Here is one place that does batch address correction for free for up to 1000 addresses at a time. http://www.melissadata.com/lookups/batchaddresscheck.asp I have used this in the past. http://www.findlatitudeandlongitude.com/batch-geocode/ It gives feedback on accuracy of the address check Art Kendall Social Research ConsultantsOn 5/23/2013 12:36 PM, Ben Cohen [via SPSSX Discussion] wrote: Hi SPSS Users:
Art Kendall
Social Research Consultants |
In reply to this post by Ben Cohen
Hello,
I have used this a couple of years ago: http://cs.anu.edu.au/~Peter.Christen/Febrl/febrl-0.3/febrldoc-0.3/node10.html 6.1 Name and Address Cleaning and Standardisation
Febrl data cleaning and standardisation for the name and
address components primarily employs a supervised machine learning
approach implemented through a novel
application of hidden Markov models (HMMs). A brief
introduction to HMMs and their use for data standardisation is given
in Chapter 7. Before data standardisation can
be performed with a given data set, the user needs to train
HMMs using training data
from the same or similar data sets. Two HMMs need to be trained, one
for names and one for addresses. The process of
creating training data is described in
Chapter 8. Once HMMs are available for a
given data set (or class of data sets), the data cleaning and
standardisation process becomes easy and efficient.
The data cleaning
and standardisation process for the name and address components in
Febrl consists of the three steps described in the following
sections.
Febrl also contains the functionality to check word
spilling, i.e. if words in a field are cut off because of
limited field length (for example in fixed width input fields) and
continue in the next field. Section 6.1.4
explains how word spilling is dealt with. Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
In reply to this post by Ben Cohen
At 11:24 AM 5/23/2013, Ben wrote:
>I have hundreds of addresses stored in a string variable. Many of these are >the same address, but have slight spelling differences. > >For example: >6300 Buist Avenue >6300 Buist Ave. >53 Vermont Road, >53 Vermont Road >53 Vermnt Road >4400 North Reese Street >4400 N. Reese St > >What method can be used to identify these near matches as the same address? It depends on whether you have any information for these entities other than the addresses. If you have the addresses and no other identifying values, I concur that available address-matching software is probably the best bet. You don't give ZIP codes with these addresses. If you have the ZIP codes, and trust them (and your locations are somewhat spread out geographically), you can break your large problem into a set of much smaller ones by matching only within ZIP codes. If you have other identifying information, even if it's also fuzzy, you can go through the more complex, but more powerful, procedure, of looking at pairs that exact-match on any of a set of partial keys (say, house number only) and evaluating how similar the members of the pair are comparing all available information. I did a job like this some years back, matching records for nursing homes where we had a lot of fuzzy identifying information. I was working in SAS, and it was a lot easier than SPSS would have been -- PROC SQL for many-to-many matches, and macro loops for transitive closure. (If A is the same as B and B is the same as C, then A is the same as C.) >Once identified as a likely match, how to best assign a unique key to the >record? Start by assigning a unique key, maybe just a sequence number, to each separate address *form*. I'll call this the 'address form key'. If you have data coming in from time to time, do this in such a way that you can continue adding unique keys as new records come in, but always in increasing sequence. Then, when you have a judgement that two or more address forms actually represent the same address, assign that 'address' a unique key: the lowest-numbered address-form key in the group. Add this 'address key' to every record. When the address form has not been judged the same as any other form, the 'address key' will be the same as the record's 'address form key'; otherwise, it will be the lowest 'address form key' found in the group of matched records. After you've done the matching, check carefully for instances where A is matched to B, B is matched to C, but A is not matched to C. (This can occur with some matching methods.) In every such instance you'll have to override the matching algorithm, either by adding the A-C match, or by breaking one of A-B and B-C. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
How about using Excel or any other spreadsheet software. You can open the SPSS file in Excel and use "sort" which will give you all of the alike addresses, like the examples you gave in your original e-mail. You have to go to that column and change the addresses manually.
Ismail K. Noor, Ph.D., Data Group Consultant, LLC 1335 Culver Avenue Dearborn, Michigan 48124 [hidden email] +1 (313) 690-0755 +1 (313) 240-5259 From: Richard Ristow <[hidden email]> To: [hidden email] Sent: Tuesday, May 28, 2013 8:02 PM Subject: Re: fuzzy data analysis At 11:24 AM 5/23/2013, Ben wrote: >I have hundreds of addresses stored in a string variable. Many of these are >the same address, but have slight spelling differences. > >For example: >6300 Buist Avenue >6300 Buist Ave. >53 Vermont Road, >53 Vermont Road >53 Vermnt Road >4400 North Reese Street >4400 N. Reese St > >What method can be used to identify these near matches as the same address? It depends on whether you have any information for these entities other than the addresses. If you have the addresses and no other identifying values, I concur that available address-matching software is probably the best bet. You don't give ZIP codes with these addresses. If you have the ZIP codes, and trust them (and your locations are somewhat spread out geographically), you can break your large problem into a set of much smaller ones by matching only within ZIP codes. If you have other identifying information, even if it's also fuzzy, you can go through the more complex, but more powerful, procedure, of looking at pairs that exact-match on any of a set of partial keys (say, house number only) and evaluating how similar the members of the pair are comparing all available information. I did a job like this some years back, matching records for nursing homes where we had a lot of fuzzy identifying information. I was working in SAS, and it was a lot easier than SPSS would have been -- PROC SQL for many-to-many matches, and macro loops for transitive closure. (If A is the same as B and B is the same as C, then A is the same as C.) >Once identified as a likely match, how to best assign a unique key to the >record? Start by assigning a unique key, maybe just a sequence number, to each separate address *form*. I'll call this the 'address form key'. If you have data coming in from time to time, do this in such a way that you can continue adding unique keys as new records come in, but always in increasing sequence. Then, when you have a judgement that two or more address forms actually represent the same address, assign that 'address' a unique key: the lowest-numbered address-form key in the group. Add this 'address key' to every record. When the address form has not been judged the same as any other form, the 'address key' will be the same as the record's 'address form key'; otherwise, it will be the lowest 'address form key' found in the group of matched records. After you've done the matching, check carefully for instances where A is matched to B, B is matched to C, but A is not matched to C. (This can occur with some matching methods.) In every such instance you'll have to override the matching algorithm, either by adding the A-C match, or by breaking one of A-B and B-C. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Ben Cohen
If you do not want to cut-and-paste
your addresses into one of the free address checking websites,
you could do some clean up using AUTORECODE then recode. Although I use the address correction websites mostly for obtaining long/lat they do help with ZIP'S, variations on spelling,etc. They also help assure addresses are in accordance with USPO addressing standards. Art Kendall Social Research ConsultantsOn 5/23/2013 12:36 PM, Ben Cohen [via SPSSX Discussion] wrote: Hi SPSS Users:
Art Kendall
Social Research Consultants |
Hi Ben,
I think Art's suggestion is the most pragmatic if, as you say, you have only "hundreds of addresses". Automated approaches are available, though. I have used a package called Trillium, published by Harte Hanks. Whilst it is effective, it is expensive, and time-consuming to set up; it would not be worthwhile for a one-off job of hundreds of records. Although my experience is out of date now, I do recall there were many other packages, including some free ones, of varying levels of quality. I'm sure Google would find them for you.
For interest, there is a lot published on the subject of identifying "near matches". It is referred to in the literature as "record linkage". Newcombe published the seminal work, and Fellegi and Sunter later formalised his methodology. This Wikipedia article contains the references and an overview: https://en.wikipedia.org/wiki/Record_linkage If anyone is aware of an SPSS implementation of Fellegi and Sunter's methodology, I would be very interested to see it. On 29 May 2013 23:17, Art Kendall <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |