fuzzy data analysis

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

fuzzy data analysis

Ben Cohen
Hi SPSS Users:
I have hundreds of addresses stored in a string variable. Many of these are
the same address, but have slight spelling differences.

For example:
6300 Buist Avenue
6300 Buist Ave.
53 Vermont Road,
53 Vermont Road
53 Vermnt Road
4400 North Reese Street
4400 N. Reese St

What method can be used to identify these near matches as the same address?
Once identified as a likely match, how to best assign a unique key to the
record?

Many thanks...

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: fuzzy data analysis

J. R. Carroll
Ben,

I would look into Python for a solution for you on this.  They have robust libraries built for this type of thing (look into NLTK, Natural Language Tool Kit).  I don't think SPSS has any builtins for handling fuzzy context analysis.  And to script it all in SPSS syntax, especially if it's unvetted, might be way more work than you are willing to commit to.  However, Python is very well suited for this and is well integrated into SPSS.  If Python is REAL option for you, then let me know and I'll look into providing you more of a solution.

In terms of "how you would assign unique values", well, I'd eliminate overlapping addresses first; assigning unique ID's is cakewalk.  

-J

 
----


J. R. Carroll
Cell:  (650) 776-6613
          [hidden email]
          [hidden email]



On Thu, May 23, 2013 at 11:24 AM, Ben <[hidden email]> wrote:
Hi SPSS Users:
I have hundreds of addresses stored in a string variable. Many of these are
the same address, but have slight spelling differences.

For example:
6300 Buist Avenue
6300 Buist Ave.
53 Vermont Road,
53 Vermont Road
53 Vermnt Road
4400 North Reese Street
4400 N. Reese St

What method can be used to identify these near matches as the same address?
Once identified as a likely match, how to best assign a unique key to the
record?

Many thanks...

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: fuzzy data analysis

Maguin, Eugene

I might be wrong but I think this is a regular expression (regex) problem and I think Jon posted some messages, either this week or last week, about this and its relationship to python. You might look in the archives. Gene Maguin

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of J. R. Carroll
Sent: Thursday, May 23, 2013 1:08 PM
To: [hidden email]
Subject: Re: fuzzy data analysis

 

Ben,

 

I would look into Python for a solution for you on this.  They have robust libraries built for this type of thing (look into NLTK, Natural Language Tool Kit).  I don't think SPSS has any builtins for handling fuzzy context analysis.  And to script it all in SPSS syntax, especially if it's unvetted, might be way more work than you are willing to commit to.  However, Python is very well suited for this and is well integrated into SPSS.  If Python is REAL option for you, then let me know and I'll look into providing you more of a solution.

 

In terms of "how you would assign unique values", well, I'd eliminate overlapping addresses first; assigning unique ID's is cakewalk.  

 

-J

 

 

----


J. R. Carroll

Cell:  (650) 776-6613

          [hidden email]

          [hidden email]

 

 

On Thu, May 23, 2013 at 11:24 AM, Ben <[hidden email]> wrote:

Hi SPSS Users:
I have hundreds of addresses stored in a string variable. Many of these are
the same address, but have slight spelling differences.

For example:
6300 Buist Avenue
6300 Buist Ave.
53 Vermont Road,
53 Vermont Road
53 Vermnt Road
4400 North Reese Street
4400 N. Reese St

What method can be used to identify these near matches as the same address?
Once identified as a likely match, how to best assign a unique key to the
record?

Many thanks...

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

 

Reply | Threaded
Open this post in threaded view
|

Re: fuzzy data analysis

Jon K Peck
In reply to this post by Ben Cohen
These is specialized software for address matching.  I don't have any experience with it, so I won't recommend anything in particular.  However, if you have access to SPSS Modeler with Entity Analytics, it would probably do a good job with this.

If you want to do it with Statistics, here is how I would approach it.
1. Expand common abbreviations such as N or N. -->North and St --> Street using Python regular expressions and the SPSSINC TRANS extension command.
2. Perhaps some other cleanup such as eliminating trailing periods in abbreviations.
3. Use the function levenshteindistance in the extendedTransforms.py module or jaroWinker in that same module to calculate all the pairwise differences in addresses.  They calculate a distance between pairs of strings.  Then looking at nonzero but small differences, transform the text into exact matches.  A short Python program could go directly from the distance matrix to the implied transformations.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Ben <[hidden email]>
To:        [hidden email],
Date:        05/23/2013 10:38 AM
Subject:        [SPSSX-L] fuzzy data analysis
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Hi SPSS Users:
I have hundreds of addresses stored in a string variable. Many of these are
the same address, but have slight spelling differences.

For example:
6300 Buist Avenue
6300 Buist Ave.
53 Vermont Road,
53 Vermont Road
53 Vermnt Road
4400 North Reese Street
4400 N. Reese St

What method can be used to identify these near matches as the same address?
Once identified as a likely match, how to best assign a unique key to the
record?

Many thanks...

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: fuzzy data analysis

Art Kendall
In reply to this post by Ben Cohen
Geocoding (finding long/lat) often includes address "normalization" (making the address match USPO standards), checking to see if such an address exists, as well at attaching  long/lat.

Here is one place that does batch address correction for free for up to 1000 addresses
at a time.

http://www.melissadata.com/lookups/batchaddresscheck.asp

I have used this in the past.
http://www.findlatitudeandlongitude.com/batch-geocode/
It gives feedback on accuracy of the address check
Art Kendall
Social Research Consultants
On 5/23/2013 12:36 PM, Ben Cohen [via SPSSX Discussion] wrote:
Hi SPSS Users:
I have hundreds of addresses stored in a string variable. Many of these are
the same address, but have slight spelling differences.

For example:
6300 Buist Avenue
6300 Buist Ave.
53 Vermont Road,
53 Vermont Road
53 Vermnt Road
4400 North Reese Street
4400 N. Reese St

What method can be used to identify these near matches as the same address?
Once identified as a likely match, how to best assign a unique key to the
record?

Many thanks...

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/fuzzy-data-analysis-tp5720378.html
To start a new topic under SPSSX Discussion, email [hidden email]
To unsubscribe from SPSSX Discussion, click here.
NAML

Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: fuzzy data analysis

Albert-Jan Roskam
In reply to this post by Ben Cohen
Hello,

I have used this a couple of years ago: http://cs.anu.edu.au/~Peter.Christen/Febrl/febrl-0.3/febrldoc-0.3/node10.html


6.1 Name and Address Cleaning and Standardisation

Febrl data cleaning and standardisation for the name and address components primarily employs a supervised machine learning approach implemented through a novel application of hidden Markov models (HMMs). A brief introduction to HMMs and their use for data standardisation is given in Chapter 7. Before data standardisation can be performed with a given data set, the user needs to train HMMs using training data from the same or similar data sets. Two HMMs need to be trained, one for names and one for addresses. The process of creating training data is described in Chapter 8. Once HMMs are available for a given data set (or class of data sets), the data cleaning and standardisation process becomes easy and efficient.
The data cleaning and standardisation process for the name and address components in Febrl consists of the three steps described in the following sections.
Febrl also contains the functionality to check word spilling, i.e. if words in a field are cut off because of limited field length (for example in fixed width input fields) and continue in the next field. Section 6.1.4 explains how word spilling is dealt with.
 
Regards,
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a
fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

From: Ben <[hidden email]>
To: [hidden email]
Sent: Thursday, May 23, 2013 5:24 PM
Subject: [SPSSX-L] fuzzy data analysis

Hi SPSS Users:
I have hundreds of addresses stored in a string variable. Many of these are
the same address, but have slight spelling differences.

For example:
6300 Buist Avenue
6300 Buist Ave.
53 Vermont Road,
53 Vermont Road
53 Vermnt Road
4400 North Reese Street
4400 N. Reese St

What method can be used to identify these near matches as the same address?
Once identified as a likely match, how to best assign a unique key to the
record?

Many thanks...

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: fuzzy data analysis

Richard Ristow
In reply to this post by Ben Cohen
At 11:24 AM 5/23/2013, Ben wrote:

>I have hundreds of addresses stored in a string variable. Many of these are
>the same address, but have slight spelling differences.
>
>For example:
>6300 Buist Avenue
>6300 Buist Ave.
>53 Vermont Road,
>53 Vermont Road
>53 Vermnt Road
>4400 North Reese Street
>4400 N. Reese St
>
>What method can be used to identify these near matches as the same address?

It depends on whether you have any information for these entities
other than the addresses. If you have the addresses and no other
identifying values, I concur that available address-matching software
is probably the best bet.

You don't give ZIP codes with these addresses. If you have the ZIP
codes, and trust them (and your locations are somewhat spread out
geographically), you can break your large problem into a set of much
smaller ones by matching only within ZIP codes.

If you have other identifying information, even if it's also fuzzy,
you can go through the more complex, but more powerful, procedure, of
looking at pairs that exact-match on any of a set of partial keys
(say, house number only) and evaluating how similar the members of
the pair are comparing all available information. I did a job like
this some years back, matching records for nursing homes where we had
a lot of fuzzy identifying information. I was working in SAS, and it
was a lot easier than SPSS would have been -- PROC SQL for
many-to-many matches, and macro loops for transitive closure. (If A
is the same as B and B  is the same as C, then A is the same as C.)

>Once identified as a likely match, how to best assign a unique key to the
>record?

Start by assigning a unique key, maybe just a sequence number, to
each separate address *form*. I'll call this the 'address form key'.
If you have data coming in from time to time, do this in such a way
that you can continue adding unique keys as new records come in, but
always in increasing sequence.

Then, when you have a judgement that two or more address forms
actually represent the same address, assign that 'address' a unique
key: the lowest-numbered address-form key in the group. Add this
'address key' to every record. When the address form has not been
judged the same as any other form, the 'address key' will be the same
as the record's 'address form key'; otherwise, it will be the lowest
'address form key' found in the group of matched records.

After you've done the matching, check carefully for instances where A
is matched to B, B is matched to C, but A is not matched to C. (This
can occur with some matching methods.) In every such instance you'll
have to override the matching algorithm, either by adding the A-C
match, or by breaking one of A-B and B-C.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: fuzzy data analysis

Ismail Noor
How about using Excel or any other spreadsheet software. You can open the SPSS file in Excel and use "sort" which will give you all of the alike addresses, like the examples you gave in your original e-mail. You have to go to that column and change the addresses manually.

 
 
Ismail K. Noor, Ph.D.,
Data Group Consultant, LLC
1335 Culver Avenue
Dearborn, Michigan   48124
[hidden email]
+1 (313) 690-0755
+1 (313) 240-5259



From: Richard Ristow <[hidden email]>
To: [hidden email]
Sent: Tuesday, May 28, 2013 8:02 PM
Subject: Re: fuzzy data analysis

At 11:24 AM 5/23/2013, Ben wrote:

>I have hundreds of addresses stored in a string variable. Many of these are
>the same address, but have slight spelling differences.
>
>For example:
>6300 Buist Avenue
>6300 Buist Ave.
>53 Vermont Road,
>53 Vermont Road
>53 Vermnt Road
>4400 North Reese Street
>4400 N. Reese St
>
>What method can be used to identify these near matches as the same address?

It depends on whether you have any information for these entities
other than the addresses. If you have the addresses and no other
identifying values, I concur that available address-matching software
is probably the best bet.

You don't give ZIP codes with these addresses. If you have the ZIP
codes, and trust them (and your locations are somewhat spread out
geographically), you can break your large problem into a set of much
smaller ones by matching only within ZIP codes.

If you have other identifying information, even if it's also fuzzy,
you can go through the more complex, but more powerful, procedure, of
looking at pairs that exact-match on any of a set of partial keys
(say, house number only) and evaluating how similar the members of
the pair are comparing all available information. I did a job like
this some years back, matching records for nursing homes where we had
a lot of fuzzy identifying information. I was working in SAS, and it
was a lot easier than SPSS would have been -- PROC SQL for
many-to-many matches, and macro loops for transitive closure. (If A
is the same as B and B  is the same as C, then A is the same as C.)

>Once identified as a likely match, how to best assign a unique key to the
>record?

Start by assigning a unique key, maybe just a sequence number, to
each separate address *form*. I'll call this the 'address form key'.
If you have data coming in from time to time, do this in such a way
that you can continue adding unique keys as new records come in, but
always in increasing sequence.

Then, when you have a judgement that two or more address forms
actually represent the same address, assign that 'address' a unique
key: the lowest-numbered address-form key in the group. Add this
'address key' to every record. When the address form has not been
judged the same as any other form, the 'address key' will be the same
as the record's 'address form key'; otherwise, it will be the lowest
'address form key' found in the group of matched records.

After you've done the matching, check carefully for instances where A
is matched to B, B is matched to C, but A is not matched to C. (This
can occur with some matching methods.) In every such instance you'll
have to override the matching algorithm, either by adding the A-C
match, or by breaking one of A-B and B-C.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: fuzzy data analysis

Art Kendall
In reply to this post by Ben Cohen
If you do not want to cut-and-paste your addresses into one of the free address checking websites,
you could do some clean up using AUTORECODE then recode. 

Although I use the address correction websites mostly for obtaining long/lat they do help with ZIP'S, variations on spelling,etc.
They also help assure addresses are in accordance with USPO addressing standards.
Art Kendall
Social Research Consultants
On 5/23/2013 12:36 PM, Ben Cohen [via SPSSX Discussion] wrote:
Hi SPSS Users:
I have hundreds of addresses stored in a string variable. Many of these are
the same address, but have slight spelling differences.

For example:
6300 Buist Avenue
6300 Buist Ave.
53 Vermont Road,
53 Vermont Road
53 Vermnt Road
4400 North Reese Street
4400 N. Reese St

What method can be used to identify these near matches as the same address?
Once identified as a likely match, how to best assign a unique key to the
record?

Many thanks...

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/fuzzy-data-analysis-tp5720378.html
To start a new topic under SPSSX Discussion, email [hidden email]
To unsubscribe from SPSSX Discussion, click here.
NAML

Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: fuzzy data analysis

Paul Cook
Hi Ben,

I think Art's suggestion is the most pragmatic if, as you say, you have only "hundreds of addresses".

Automated approaches are available, though. I have used a package called Trillium, published by Harte Hanks. Whilst it is effective, it is expensive, and time-consuming to set up; it would not be worthwhile for a one-off job of hundreds of records. Although my experience is out of date now, I do recall there were many other packages, including some free ones, of varying levels of quality. I'm sure Google would find them for you.

For interest, there is a lot published on the subject of identifying "near matches". It is referred to in the literature as "record linkage". Newcombe published the seminal work, and Fellegi and Sunter later formalised his methodology. This Wikipedia article contains the references and an overview: https://en.wikipedia.org/wiki/Record_linkage 

If anyone is aware of an SPSS implementation of Fellegi and Sunter's methodology, I would be very interested to see it.

Kind regards,

Paul Cook

Dangerous Enterprises Ltd


On 29 May 2013 23:17, Art Kendall <[hidden email]> wrote:
If you do not want to cut-and-paste your addresses into one of the free address checking websites,
you could do some clean up using AUTORECODE then recode. 

Although I use the address correction websites mostly for obtaining long/lat they do help with ZIP'S, variations on spelling,etc.
They also help assure addresses are in accordance with USPO addressing standards.
Art Kendall
Social Research Consultants
On 5/23/2013 12:36 PM, Ben Cohen [via SPSSX Discussion] wrote:
Hi SPSS Users:
I have hundreds of addresses stored in a string variable. Many of these are
the same address, but have slight spelling differences.

For example:
6300 Buist Avenue
6300 Buist Ave.
53 Vermont Road,
53 Vermont Road
53 Vermnt Road
4400 North Reese Street
4400 N. Reese St

What method can be used to identify these near matches as the same address?
Once identified as a likely match, how to best assign a unique key to the
record?

Many thanks...

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/fuzzy-data-analysis-tp5720378.html
To start a new topic under SPSSX Discussion, email [hidden email]
To unsubscribe from SPSSX Discussion, click here.
NAML

Art Kendall
Social Research Consultants


View this message in context: Re: fuzzy data analysis
Sent from the SPSSX Discussion mailing list archive at Nabble.com.