Posted by
Andy W on
URL: http://spssx-discussion.165.s1.nabble.com/How-to-look-for-possible-name-matches-in-restricted-subsets-tp5740478p5740482.html
How about this Art, calculate the Levenshtein distance between the string
names, and just a binary same/different for the sex variable. Can be done
for the whole set, not just a small sample.
Then you can combine those two differences somehow (e.g. normalized
Levenshtein over 0.2 + different sex), and then aggregate up to the city
level. Or do a scatterplot/boxplot with City on X and distances on Y to see
if any city has weird distances.
To get an estimate of what is a reasonable distance, you might randomly
match pairs you know are different,
https://andrewpwheeler.com/2015/07/01/some-ad-hoc-fuzzy-name-matching-within-police-databases/,
but I bet even without going to all that trouble if your assertion is
correct some cities will show a much larger variance in the distances.
-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/--
Sent from:
http://spssx-discussion.1045642.n5.nabble.com/=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD