Hi,
� I am testing two strings to get an impression how� similar they are. I am using Jaro Winkler and edit distance (that other name is such a tongue twister!), but neither measure seems to be precisely what I am looking for. --Jaro Winkler gives a big penalty if the beginning of s1 and s2 differ ("a" vs "-a" gives a Jaro Winkler distance of 0). --The edit distance seems better, but what it does not take into account is the relative number of edits (relative to the length of the string). For instance "a"� vs "-a" requires just one edit, but� HALF of the string was wrong. I created a "relativeEditDistance" to compensate for this, but is there some "out of the box" string similarity measure? � dataset close all. data list free /s1 (a4) s2 (a4). begin data "a" "a" "a" "a." "a" "" "a" "b" "a"� "-a" "a"� "xxxx" end data. spssinc trans result = jaroWinkler type = 0 /formula "extendedTransforms.jaroWinkler(s1, s2, 0)". spssinc trans result = editDistance type = 0 /formula "extendedTransforms.levenshteindistance(s1, s2)". compute #len_s1 = char.length(rtrim(s1)). compute #len_s2 = char.length(rtrim(s2)). compute #maximum = max(#len_s1, #len_s2). do if ( #len_s1 eq 0 or #len_s2 eq 0 or sysmis(#maximum) ). +compute relativeEditDistance = $sysmis. else. +compute relativeEditDistance = editDistance / #maximum. end if. formats relativeEditDistance (f5.3). list. *output. s1 s2 jaroWinkler editDistance relativeEditDistance a a 1,00 ,00 ,000 a a. ,83 1,00 ,500 a . 1,00 . a b ,00 1,00 1,000 a -a ,00 1,00 ,500 a xxxx ,00 4,00 1,000 Number of cases read: 6 Number of cases listed: 6 Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~� ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
The best string similarity measure really
depends on the context. For names, consideration of how errors might
creep in is important. For example, were they spoken or submitted
in writing. These would tend to lead to different criteria - phonetic
errors vs, e.g., transposition errors. Handwritten vs entered from
a keyboard, etc.
These considerations would be applied differently for a spell checker. Language considerations would also matter. Besides soundex, nysiis, Jaro-Winkler and Levenshtein, which are in extendedTransforms, you might be interested in Dice similarity. See http://www.catalysoft.com/articles/StrikeAMatch.html The Dice metric attempts to give credit for the same words in a string when they occur in different order. In most contexts, I think, Dice wouldn't be a good choice for names. Here is some Python code for Dice. I will add this to extendedTransforms along with a class wrapper for finding the best choice set up for use with SPSSINC TRANS. (I hope the listserv does not mangle the indentation.) # An implementation of the Dice string similarity measure def bigrams(s, casesensitive=False, splitwhite=True): """Return a set of bigrams s is a string casesensitive indicates whether to ignore case or not splitwhite indicates whether to consider white space as a boundary in computing the bigrams. Only affects multiword strings. """ if not casesensitive: s = s.lower() if splitwhite: s = s.split() else: s = [s] result = set() for token in s: result.update([token[i:i+2] for i in xrange(len(token) - 1)]) return result def DiceStringSimilarity(s1, s2, casesensitive=False, splitwhite=True): """Return a Dice match statistic for two strings The value is 1 if the strings are identical and 0 if they have no bigrams in common. s1 and s2 are the strings to compare. casesensitive indicates whether bigrams are case sensitive. splitwhite indicates whether to treat each word separately in computing bigrams where a word is defined by whitespace. Preprocess the text with a regular expression if you want a different definition of a word.""" a = bigrams(s1, False, splitwhite) b = bigrams(s2, False, splitwhite) matches = len(a.intersection(b)) return (matches * 2.) / max(len(a) + len(b), 1) Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Albert-Jan Roskam <[hidden email]> To: [hidden email], Date: 08/07/2013 03:38 AM Subject: [SPSSX-L] testing string similarity using extendedTransforms Sent by: "SPSSX(r) Discussion" <[hidden email]> Hi, I am testing two strings to get an impression how similar they are. I am using Jaro Winkler and edit distance (that other name is such a tongue twister!), but neither measure seems to be precisely what I am looking for. --Jaro Winkler gives a big penalty if the beginning of s1 and s2 differ ("a" vs "-a" gives a Jaro Winkler distance of 0). --The edit distance seems better, but what it does not take into account is the relative number of edits (relative to the length of the string). For instance "a" vs "-a" requires just one edit, but HALF of the string was wrong. I created a "relativeEditDistance" to compensate for this, but is there some "out of the box" string similarity measure? dataset close all. data list free /s1 (a4) s2 (a4). begin data "a" "a" "a" "a." "a" "" "a" "b" "a" "-a" "a" "xxxx" end data. spssinc trans result = jaroWinkler type = 0 /formula "extendedTransforms.jaroWinkler(s1, s2, 0)". spssinc trans result = editDistance type = 0 /formula "extendedTransforms.levenshteindistance(s1, s2)". compute #len_s1 = char.length(rtrim(s1)). compute #len_s2 = char.length(rtrim(s2)). compute #maximum = max(#len_s1, #len_s2). do if ( #len_s1 eq 0 or #len_s2 eq 0 or sysmis(#maximum) ). +compute relativeEditDistance = $sysmis. else. +compute relativeEditDistance = editDistance / #maximum. end if. formats relativeEditDistance (f5.3). list. $B!! (B $B!! (B *output. s1 s2 jaroWinkler editDistance relativeEditDistance a a 1,00 ,00 ,000 a a. ,83 1,00 ,500 a . 1,00 . a b ,00 1,00 1,000 a -a ,00 1,00 ,500 a xxxx ,00 4,00 1,000 $B!! (B Number of cases read: 6 Number of cases listed: 6 $B!! (B Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |