Context: I suspect that field operations may have not been consistent in
assuring they correctly matched Pre and Post cases. There are a few hundred HousIDs in each city. each case has Country City HouseID PrePost Name sex v1 v2 v3. PrePost has values 1 'Pre' 2 'Post'. As a quality check on whether HouseID is plausibly referring to roughly the same people, I am looking for a way to get data to eyeball for name sex v1 v2 v3. 1) take the first (say) 5 or 10 HouseIDs with 'Pre' on PrePost in a city 2) see if there is overlap between the set of names in that HouseID and those in any HouseID in the 'Post' set for that city. 3) output needed 1 'HouseID with some overlap' 2 'No overlap with any HouseID. 3) overlap means that at least 1 Name from a 'Pre' HouseID has a match Match means the last word in the set of words is the same AND at least 1 of the other words matches. ----- Art Kendall Social Research Consultants -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
I think I understand the problem but some elements are confusing. Sounds like you're doing a household census. Is it that Name is FN+LN of one person for a one-person household and FN1+LN1, ... FN(j)+LN(j) for a j person household. Or, is Name FN+LN for a specific person in the household.
In your definition of a match does 'set of words' refer only to the contents, i.e., the FNs and LNs, of Name or does 'set of words' refer to the contents of Name and sex, v1 v2 v3. Maybe an example? Gene Maguin -----Original Message----- From: SPSSX(r) Discussion <[hidden email]> On Behalf Of Art Kendall Sent: Saturday, April 24, 2021 11:28 AM To: [hidden email] Subject: How to look for possible name matches in restricted subsets? Context: I suspect that field operations may have not been consistent in assuring they correctly matched Pre and Post cases. There are a few hundred HousIDs in each city. each case has Country City HouseID PrePost Name sex v1 v2 v3. PrePost has values 1 'Pre' 2 'Post'. As a quality check on whether HouseID is plausibly referring to roughly the same people, I am looking for a way to get data to eyeball for name sex v1 v2 v3. 1) take the first (say) 5 or 10 HouseIDs with 'Pre' on PrePost in a city 2) see if there is overlap between the set of names in that HouseID and those in any HouseID in the 'Post' set for that city. 3) output needed 1 'HouseID with some overlap' 2 'No overlap with any HouseID. 3) overlap means that at least 1 Name from a 'Pre' HouseID has a match Match means the last word in the set of words is the same AND at least 1 of the other words matches. ----- Art Kendall Social Research Consultants -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Art Kendall
How about this Art, calculate the Levenshtein distance between the string
names, and just a binary same/different for the sex variable. Can be done for the whole set, not just a small sample. Then you can combine those two differences somehow (e.g. normalized Levenshtein over 0.2 + different sex), and then aggregate up to the city level. Or do a scatterplot/boxplot with City on X and distances on Y to see if any city has weird distances. To get an estimate of what is a reasonable distance, you might randomly match pairs you know are different, https://andrewpwheeler.com/2015/07/01/some-ad-hoc-fuzzy-name-matching-within-police-databases/, but I bet even without going to all that trouble if your assertion is correct some cities will show a much larger variance in the distances. ----- Andy W [hidden email] http://andrewpwheeler.wordpress.com/ -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Maguin, Eugene
The NGO meant well. They just picked houses here and there before
evacuation. After return, they went back but had no way to affirm they went to the same house. <head slap> I just need a few things to point out that would support my argument that *as is *the data can only be a way to point out problems to avoid in the future. ----- Art Kendall Social Research Consultants -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
Free forum by Nabble | Edit this page |