Re: How to look for possible name matches in restricted subsets?

Posted by Andy W on
URL: http://spssx-discussion.165.s1.nabble.com/How-to-look-for-possible-name-matches-in-restricted-subsets-tp5740478p5740482.html

How about this Art, calculate the Levenshtein distance between the string
names, and just a binary same/different for the sex variable. Can be done
for the whole set, not just a small sample.

Then you can combine those two differences somehow (e.g. normalized
Levenshtein over 0.2 + different sex), and then aggregate up to the city
level. Or do a scatterplot/boxplot with City on X and distances on Y to see
if any city has weird distances.

To get an estimate of what is a reasonable distance, you might randomly
match pairs you know are different,
https://andrewpwheeler.com/2015/07/01/some-ad-hoc-fuzzy-name-matching-within-police-databases/,
but I bet even without going to all that trouble if your assertion is
correct some cities will show a much larger variance in the distances.



-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/