testing string similarity using extendedTransforms

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

testing string similarity using extendedTransforms

Albert-Jan Roskam
Hi,

I am testing two strings to get an impression how� similar they are. I am using Jaro Winkler and edit distance (that other name is such a tongue twister!), but neither measure seems to be precisely what I am looking for.
--Jaro Winkler gives a big penalty if the beginning of s1 and s2 differ ("a" vs "-a" gives a Jaro Winkler distance of 0).
--The edit distance seems better, but what it does not take into account is the relative number of edits (relative to the length of the string). For instance "a"�  vs "-a" requires just one edit, but� HALF of the string was wrong. I created a "relativeEditDistance" to compensate for this, but is there some "out of the box" string similarity measure?

dataset close all.
data list free /s1 (a4) s2 (a4).
begin data
"a" "a"
"a" "a."
"a" ""
"a" "b"
"a"�  "-a"
"a"�  "xxxx"
end data.
spssinc trans result = jaroWinkler type = 0 /formula "extendedTransforms.jaroWinkler(s1, s2, 0)".
spssinc trans result = editDistance type = 0 /formula "extendedTransforms.levenshteindistance(s1, s2)".
compute #len_s1 = char.length(rtrim(s1)).
compute #len_s2 = char.length(rtrim(s2)).
compute #maximum = max(#len_s1, #len_s2).
do if ( #len_s1 eq 0 or #len_s2 eq 0 or sysmis(#maximum) ).
+compute relativeEditDistance = $sysmis.
else.
+compute relativeEditDistance = editDistance / #maximum.
end if.
formats relativeEditDistance (f5.3).
list.
 
 
*output.
s1 s2 jaroWinkler editDistance relativeEditDistance
a a 1,00 ,00 ,000
a a. ,83 1,00 ,500
a . 1,00 .
a b ,00 1,00 1,000
a -a ,00 1,00 ,500
a xxxx ,00 4,00 1,000
 
Number of cases read: 6 Number of cases listed: 6 

Regards,
Albert-Jan


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a
fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~�

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: testing string similarity using extendedTransforms

Jon K Peck
The best string similarity measure really depends on the context.  For names, consideration of how errors might creep in is important.  For example, were they spoken or submitted in writing.  These would tend to lead to different criteria - phonetic errors vs, e.g., transposition errors.  Handwritten vs entered from a keyboard, etc.

These considerations would be applied differently for a spell checker.  Language considerations would also matter.

Besides soundex, nysiis, Jaro-Winkler and Levenshtein, which are in extendedTransforms, you might be interested in Dice similarity. See
http://www.catalysoft.com/articles/StrikeAMatch.html
The Dice metric attempts to give credit for the same words in a string when they occur in different order.
In most contexts, I think, Dice wouldn't be a good choice for names.


Here is some Python code for Dice.  I will add this to extendedTransforms along with a class wrapper for finding the best choice set up for use with SPSSINC TRANS.  (I hope the listserv does not mangle the indentation.)

# An implementation of the Dice string similarity measure
def bigrams(s, casesensitive=False, splitwhite=True):
    """Return a set of bigrams
   
    s is a string
    casesensitive indicates whether to ignore case or not
    splitwhite indicates whether to consider white space as a boundary
    in computing the bigrams.  Only affects multiword strings.
    """
   
    if not casesensitive:
        s = s.lower()
    if splitwhite:
        s = s.split()
    else:
        s = [s]
    result = set()
    for token in s:
        result.update([token[i:i+2] for i in xrange(len(token) - 1)])
    return result

def DiceStringSimilarity(s1, s2, casesensitive=False, splitwhite=True):
    """Return a Dice match statistic for two strings
    The value is 1 if the strings are identical and 0 if they have no
    bigrams in common.
   
    s1 and s2 are the strings to compare.
    casesensitive indicates whether bigrams are case sensitive.
    splitwhite indicates whether to treat each word separately
    in computing bigrams where a word is defined by whitespace.
    Preprocess the text with a regular expression if you want a
    different definition of a word."""
   
    a = bigrams(s1, False, splitwhite)
    b = bigrams(s2, False, splitwhite)
   
    matches = len(a.intersection(b))
    return (matches * 2.) / max(len(a) + len(b), 1)





Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Albert-Jan Roskam <[hidden email]>
To:        [hidden email],
Date:        08/07/2013 03:38 AM
Subject:        [SPSSX-L] testing string similarity using extendedTransforms
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Hi,

I am testing two strings to get an impression how similar they are. I am using Jaro Winkler and edit distance (that other name is such a tongue twister!), but neither measure seems to be precisely what I am looking for.
--Jaro Winkler gives a big penalty if the beginning of s1 and s2 differ ("a" vs "-a" gives a Jaro Winkler distance of 0).
--The edit distance seems better, but what it does not take into account is the relative number of edits (relative to the length of the string). For instance "a"  vs "-a" requires just one edit, but HALF of the string was wrong. I created a "relativeEditDistance" to compensate for this, but is there some "out of the box" string similarity measure?

dataset close all.
data list free /s1 (a4) s2 (a4).
begin data
"a" "a"
"a" "a."
"a" ""
"a" "b"
"a"  "-a"
"a"  "xxxx"
end data.
spssinc trans result = jaroWinkler type = 0 /formula "extendedTransforms.jaroWinkler(s1, s2, 0)".
spssinc trans result = editDistance type = 0 /formula "extendedTransforms.levenshteindistance(s1, s2)".
compute #len_s1 = char.length(rtrim(s1)).
compute #len_s2 = char.length(rtrim(s2)).
compute #maximum = max(#len_s1, #len_s2).
do if ( #len_s1 eq 0 or #len_s2 eq 0 or sysmis(#maximum) ).
+compute relativeEditDistance = $sysmis.
else.
+compute relativeEditDistance = editDistance / #maximum.
end if.
formats relativeEditDistance (f5.3).
list.
$B!! (B
$B!! (B
*output.
s1 s2 jaroWinkler editDistance relativeEditDistance
a a 1,00 ,00 ,000
a a. ,83 1,00 ,500
a . 1,00 .
a b ,00 1,00 1,000
a -a ,00 1,00 ,500
a xxxx ,00 4,00 1,000
$B!! (B
Number of cases read: 6 Number of cases listed: 6 $B!! (B

Regards,
Albert-Jan


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a
fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD