Finding Duplicate Names

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Finding Duplicate Names

Eugenio Grant
Hi:

I have a BIG database (45.000) records. It has a variable called "name" that
has the name of the company. I'm trying to do 2 things with it.

1. Identify (not eliminate) duplicate names.

2. Identify similar names, meaning not the same but a similar. For example
Time Warner Inc, and Time Warner might be the same.

Any ideas.

Regards,
Reply | Threaded
Open this post in threaded view
|

Re: Finding Duplicate Names

ViAnn Beadle
I guess a quick start would be to AUTORECODE the name variable and then run
frequencies on the resulting variable. Then scan down the list to look for
close matches.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Eugenio Grant
Sent: Wednesday, June 06, 2007 12:06 PM
To: [hidden email]
Subject: Finding Duplicate Names
Importance: High

Hi:

I have a BIG database (45.000) records. It has a variable called "name" that
has the name of the company. I'm trying to do 2 things with it.

1. Identify (not eliminate) duplicate names.

2. Identify similar names, meaning not the same but a similar. For example
Time Warner Inc, and Time Warner might be the same.

Any ideas.

Regards,
Reply | Threaded
Open this post in threaded view
|

Re: Finding Duplicate Names

Peck, Jon
In reply to this post by Eugenio Grant
This topic has come up a number of times on this list recently.  You might want to check the archives.

There are no built-in tools in SPSS for fuzzy matching, but if you can use SPSS programmability (SPSS 14 or later, 15 for some things), you could go about this in two parts.
Use regular expressions to "canonicalize" names as much as possible.  For example, del "Inc", standardize capitalization,  standardize whitespace, eliminate periods and commas.  Whatever you can think of that draws common variations together.

Apply an algorithm like nysiis, although it was really designed for people names, to the names to produce an approximate phonetic encoding, assuming that spelling variations may be relevant.

Then use SPSS Identify Duplicate Cases to find exact matches based on the regularized names.  You might try it with and without the nysiis transformation.

The regular expression and nysiis work can be done using programmability using the trans and extendedTransforms modules on SPSS Developer Central (www.spss.com/devcentral) and some straightforward Python code.

HTH,
Jon Peck


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Eugenio Grant
Sent: Wednesday, June 06, 2007 1:06 PM
To: [hidden email]
Subject: [SPSSX-L] Finding Duplicate Names
Importance: High

Hi:

I have a BIG database (45.000) records. It has a variable called "name" that
has the name of the company. I'm trying to do 2 things with it.

1. Identify (not eliminate) duplicate names.

2. Identify similar names, meaning not the same but a similar. For example
Time Warner Inc, and Time Warner might be the same.

Any ideas.

Regards,