SPSSX Discussion

testing string similarity using extendedTransforms

Classic

List

Threaded

2 messages Options

Albert-Jan Roskam

testing string similarity using extendedTransforms

Hi,
�
I am testing two strings to get an impression how� similar they are. I am using Jaro Winkler and edit distance (that other name is such a tongue twister!), but neither measure seems to be precisely what I am looking for.
--Jaro Winkler gives a big penalty if the beginning of s1 and s2 differ ("a" vs "-a" gives a Jaro Winkler distance of 0).
--The edit distance seems better, but what it does not take into account is the relative number of edits (relative to the length of the string). For instance "a"� vs "-a" requires just one edit, but� HALF of the string was wrong. I created a "relativeEditDistance" to compensate for this, but is there some "out of the box" string similarity measure?
�
dataset close all.
data list free /s1 (a4) s2 (a4).
begin data
"a" "a"
"a" "a."
"a" ""
"a" "b"
"a"� "-a"
"a"� "xxxx"
end data.
spssinc trans result = jaroWinkler type = 0 /formula "extendedTransforms.jaroWinkler(s1, s2, 0)".
spssinc trans result = editDistance type = 0 /formula "extendedTransforms.levenshteindistance(s1, s2)".
compute #len_s1 = char.length(rtrim(s1)).
compute #len_s2 = char.length(rtrim(s2)).
compute #maximum = max(#len_s1, #len_s2).
do if ( #len_s1 eq 0 or #len_s2 eq 0 or sysmis(#maximum) ).
+compute relativeEditDistance = $sysmis.
else.
+compute relativeEditDistance = editDistance / #maximum.
end if.
formats relativeEditDistance (f5.3).
list.
　
　
*output.
s1 s2 jaroWinkler editDistance relativeEditDistance
a a 1,00 ,00 ,000
a a. ,83 1,00 ,500
a . 1,00 .
a b ,00 1,00 1,000
a -a ,00 1,00 ,500
a xxxx ,00 4,00 1,000
　
Number of cases read: 6 Number of cases listed: 6　

Regards,
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a
fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~�

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck

Re: testing string similarity using extendedTransforms

The best string similarity measure really depends on the context. For names, consideration of how errors might creep in is important. For example, were they spoken or submitted in writing. These would tend to lead to different criteria - phonetic errors vs, e.g., transposition errors. Handwritten vs entered from a keyboard, etc.

These considerations would be applied differently for a spell checker. Language considerations would also matter.

Besides soundex, nysiis, Jaro-Winkler and Levenshtein, which are in extendedTransforms, you might be interested in Dice similarity. See
http://www.catalysoft.com/articles/StrikeAMatch.html
The Dice metric attempts to give credit for the same words in a string when they occur in different order.
In most contexts, I think, Dice wouldn't be a good choice for names.

Here is some Python code for Dice. I will add this to extendedTransforms along with a class wrapper for finding the best choice set up for use with SPSSINC TRANS. (I hope the listserv does not mangle the indentation.)

# An implementation of the Dice string similarity measure
def bigrams(s, casesensitive=False, splitwhite=True):
"""Return a set of bigrams

s is a string
casesensitive indicates whether to ignore case or not
splitwhite indicates whether to consider white space as a boundary
in computing the bigrams. Only affects multiword strings.
"""

if not casesensitive:
s = s.lower()
if splitwhite:
s = s.split()
else:
s = [s]
result = set()
for token in s:
result.update([token[i:i+2] for i in xrange(len(token) - 1)])
return result

def DiceStringSimilarity(s1, s2, casesensitive=False, splitwhite=True):
"""Return a Dice match statistic for two strings
The value is 1 if the strings are identical and 0 if they have no
bigrams in common.

s1 and s2 are the strings to compare.
casesensitive indicates whether bigrams are case sensitive.
splitwhite indicates whether to treat each word separately
in computing bigrams where a word is defined by whitespace.
Preprocess the text with a regular expression if you want a
different definition of a word."""

a = bigrams(s1, False, splitwhite)
b = bigrams(s2, False, splitwhite)

matches = len(a.intersection(b))
return (matches * 2.) / max(len(a) + len(b), 1)

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: Albert-Jan Roskam <[hidden email]>
To: [hidden email],
Date: 08/07/2013 03:38 AM
Subject: [SPSSX-L] testing string similarity using extendedTransforms
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Hi, I am testing two strings to get an impression how similar they are. I am using Jaro Winkler and edit distance (that other name is such a tongue twister!), but neither measure seems to be precisely what I am looking for. --Jaro Winkler gives a big penalty if the beginning of s1 and s2 differ ("a" vs "-a" gives a Jaro Winkler distance of 0). --The edit distance seems better, but what it does not take into account is the relative number of edits (relative to the length of the string). For instance "a" vs "-a" requires just one edit, but HALF of the string was wrong. I created a "relativeEditDistance" to compensate for this, but is there some "out of the box" string similarity measure? dataset close all. data list free /s1 (a4) s2 (a4). begin data "a" "a" "a" "a." "a" "" "a" "b" "a" "-a" "a" "xxxx" end data. spssinc trans result = jaroWinkler type = 0 /formula "extendedTransforms.jaroWinkler(s1, s2, 0)". spssinc trans result = editDistance type = 0 /formula "extendedTransforms.levenshteindistance(s1, s2)". compute #len_s1 = char.length(rtrim(s1)). compute #len_s2 = char.length(rtrim(s2)). compute #maximum = max(#len_s1, #len_s2). do if ( #len_s1 eq 0 or #len_s2 eq 0 or sysmis(#maximum) ). +compute relativeEditDistance = $sysmis. else. +compute relativeEditDistance = editDistance / #maximum. end if. formats relativeEditDistance (f5.3). list. $B!! (B $B!! (B *output. s1 s2 jaroWinkler editDistance relativeEditDistance a a 1,00 ,00 ,000 a a. ,83 1,00 ,500 a . 1,00 . a b ,00 1,00 1,000 a -a ,00 1,00 ,500 a xxxx ,00 4,00 1,000 $B!! (B Number of cases read: 6 Number of cases listed: 6 $B!! (B Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD