This topic has come up a number of times on this list recently. You might want to check the archives.
There are no built-in tools in SPSS for fuzzy matching, but if you can use SPSS programmability (SPSS 14 or later, 15 for some things), you could go about this in two parts.
Use regular expressions to "canonicalize" names as much as possible. For example, del "Inc", standardize capitalization, standardize whitespace, eliminate periods and commas. Whatever you can think of that draws common variations together.
Apply an algorithm like nysiis, although it was really designed for people names, to the names to produce an approximate phonetic encoding, assuming that spelling variations may be relevant.
Then use SPSS Identify Duplicate Cases to find exact matches based on the regularized names. You might try it with and without the nysiis transformation.
The regular expression and nysiis work can be done using programmability using the trans and extendedTransforms modules on SPSS Developer Central (www.spss.com/devcentral) and some straightforward Python code.
HTH,
Jon Peck
-----Original Message-----
From: SPSSX(r) Discussion [mailto:
[hidden email]] On Behalf Of Eugenio Grant
Sent: Wednesday, June 06, 2007 1:06 PM
To:
[hidden email]
Subject: [SPSSX-L] Finding Duplicate Names
Importance: High
Hi:
I have a BIG database (45.000) records. It has a variable called "name" that
has the name of the company. I'm trying to do 2 things with it.
1. Identify (not eliminate) duplicate names.
2. Identify similar names, meaning not the same but a similar. For example
Time Warner Inc, and Time Warner might be the same.
Any ideas.
Regards,