Fuzzy matching

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Fuzzy matching

vijayanti
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: Fuzzy matching

David Marso
Administrator
See SORT,ADD FILES, LAG, XSAVE and do some research on flavors if SOUNDEX.

--
vijayanti wrote
I have two data sets that I would like to match using fuzzy matching
in SPSS. Is SPSS able to do this? I have read about the Python
function "Fuzzy" but am unsure of how to make this work with string
variables.

If I can't find an exact match by last name and first name, I want to
do a fuzzy match using date of birth, last name and first name within
geographic region.

Cases that are an exact match on date of birth and geographic region
and are a highly probable match on last name and first name would be
matched together.

Does anyone have an example of a syntax that would accomplish this?

Vijayanti

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Fuzzy matching

Jon K Peck
The FUZZY extension command can do matches that are, well, fuzzy, on numeric variables, but strings have to match exactly, since there is  no obvious metric for differences.  For names, though, there are some functions available for metrics.  You can get these in the extendedTransforms programmability module.  I can provide details if you want to go that route.

Soundex is a primitive way to code names into a 4-character code that roughly approximates the sound.  So you could code and match
nysiis is a more sophisticated name matching function.
And, if you are mainly concerned about things like spelling errors, levenshtein distance can be used, but it is a lot more complicated to set up.

So you could do the matching as a two-step process.  In step 1, use FUZZY to do exact matches including the names.  Remove the matched cases, and then do an exact match on the encoded names using one of the functions above.  Or just do a 1-step process using, say nysiis.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        David Marso <[hidden email]>
To:        [hidden email]
Date:        02/16/2012 10:01 PM
Subject:        Re: [SPSSX-L] Fuzzy matching
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




See SORT,ADD FILES, LAG, XSAVE and do some research on flavors if SOUNDEX.

--

vijayanti wrote
>
> I have two data sets that I would like to match using fuzzy matching
> in SPSS. Is SPSS able to do this? I have read about the Python
> function "Fuzzy" but am unsure of how to make this work with string
> variables.
>
> If I can't find an exact match by last name and first name, I want to
> do a fuzzy match using date of birth, last name and first name within
> geographic region.
>
> Cases that are an exact match on date of birth and geographic region
> and are a highly probable match on last name and first name would be
> matched together.
>
> Does anyone have an example of a syntax that would accomplish this?
>
> Vijayanti
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> LISTSERV@.UGA (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>


--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Fuzzy-matching-tp5491229p5491485.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: Fuzzy matching

Veena Nambiar

We collect student data and our main issue is that our students use a variation of their name that doesn’t match their official name of record but is something close.

 

I didn’t realize that you could use soundex or nysiis within SPSS. A one step process like nysiis would be nice. Where can I get more information on this?

 

Veena

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jon K Peck
Sent: Friday, February 17, 2012 6:15 AM
To: [hidden email]
Subject: Re: Fuzzy matching

 

The FUZZY extension command can do matches that are, well, fuzzy, on numeric variables, but strings have to match exactly, since there is  no obvious metric for differences.  For names, though, there are some functions available for metrics.  You can get these in the extendedTransforms programmability module.  I can provide details if you want to go that route.

Soundex is a primitive way to code names into a 4-character code that roughly approximates the sound.  So you could code and match
nysiis is a more sophisticated name matching function.
And, if you are mainly concerned about things like spelling errors, levenshtein distance can be used, but it is a lot more complicated to set up.

So you could do the matching as a two-step process.  In step 1, use FUZZY to do exact matches including the names.  Remove the matched cases, and then do an exact match on the encoded names using one of the functions above.  Or just do a 1-step process using, say nysiis.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        David Marso <[hidden email]>
To:        [hidden email]
Date:        02/16/2012 10:01 PM
Subject:        Re: [SPSSX-L] Fuzzy matching
Sent by:        "SPSSX(r) Discussion" <[hidden email]>





See SORT,ADD FILES, LAG, XSAVE and do some research on flavors if SOUNDEX.

--

vijayanti wrote
>
> I have two data sets that I would like to match using fuzzy matching
> in SPSS. Is SPSS able to do this? I have read about the Python
> function "Fuzzy" but am unsure of how to make this work with string
> variables.
>
> If I can't find an exact match by last name and first name, I want to
> do a fuzzy match using date of birth, last name and first name within
> geographic region.
>
> Cases that are an exact match on date of birth and geographic region
> and are a highly probable match on last name and first name would be
> matched together.
>
> Does anyone have an example of a syntax that would accomplish this?
>
> Vijayanti
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>


--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Fuzzy-matching-tp5491229p5491485.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: Fuzzy matching

Jon K Peck
These functions are implemented using Python programmability, so you would need to
- install the Python Essentials from the SPSS Community website
- download the extendedTransforms.py module from that site from the Utilities Collection and save it in the extensions subdirectory of your Statistics installation (or elsewhere that Python can find it)
- download and install the SPSSINC TRANS extension command from the Extension Commands Collection
- To use FUZZY, download and install that extension command from the Extension Commands Collection if it isn't included in your Essentials module

That's the hard part.  Then this syntax would generate the nysiis value in a variable named code, assuming an input variable called name.

spssinc trans result=code type=30
/formula "extendedTransforms.nysiis(name)".

For soundex, it would be
spssinc trans result=code type=30
/formula "extendedTransforms.soundex(name)".

Note that the letter case of the part in quotation marks matters.

HTH,

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        Veena Nambiar <[hidden email]>
To:        [hidden email]
Date:        02/17/2012 10:52 AM
Subject:        Re: [SPSSX-L] Fuzzy matching
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




We collect student data and our main issue is that our students use a variation of their name that doesn’t match their official name of record but is something close.
 
I didn’t realize that you could use soundex or nysiis within SPSS. A one step process like nysiis would be nice. Where can I get more information on this?
 
Veena
From: SPSSX(r) Discussion [[hidden email]] On Behalf Of Jon K Peck
Sent:
Friday, February 17, 2012 6:15 AM
To:
[hidden email]
Subject:
Re: Fuzzy matching

 
The FUZZY extension command can do matches that are, well, fuzzy, on numeric variables, but strings have to match exactly, since there is  no obvious metric for differences.  For names, though, there are some functions available for metrics.  You can get these in the extendedTransforms programmability module.  I can provide details if you want to go that route.

Soundex is a primitive way to code names into a 4-character code that roughly approximates the sound.  So you could code and match

nysiis is a more sophisticated name matching function.

And, if you are mainly concerned about things like spelling errors, levenshtein distance can be used, but it is a lot more complicated to set up.


So you could do the matching as a two-step process.  In step 1, use FUZZY to do exact matches including the names.  Remove the matched cases, and then do an exact match on the encoded names using one of the functions above.  Or just do a 1-step process using, say nysiis.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM

peck@...
new phone: 720-342-5621





From:        
David Marso <david.marso@...>
To:        
[hidden email]
Date:        
02/16/2012 10:01 PM
Subject:        
Re: [SPSSX-L] Fuzzy matching
Sent by:        
"SPSSX(r) Discussion" <[hidden email]>






See SORT,ADD FILES, LAG, XSAVE and do some research on flavors if SOUNDEX.

--

vijayanti wrote
>
> I have two data sets that I would like to match using fuzzy matching
> in SPSS. Is SPSS able to do this? I have read about the Python
> function "Fuzzy" but am unsure of how to make this work with string
> variables.
>
> If I can't find an exact match by last name and first name, I want to
> do a fuzzy match using date of birth, last name and first name within
> geographic region.
>
> Cases that are an exact match on date of birth and geographic region
> and are a highly probable match on last name and first name would be
> matched together.
>
> Does anyone have an example of a syntax that would accomplish this?
>
> Vijayanti
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
>
LISTSERV@.UGA (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>


--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Fuzzy-matching-tp5491229p5491485.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to

LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD