SPSSX Discussion

Finding Duplicate Names

Classic

List

Threaded

3 messages Options

Eugenio Grant

Finding Duplicate Names

Hi:

I have a BIG database (45.000) records. It has a variable called "name" that
has the name of the company. I'm trying to do 2 things with it.

1. Identify (not eliminate) duplicate names.

2. Identify similar names, meaning not the same but a similar. For example
Time Warner Inc, and Time Warner might be the same.

Any ideas.

Regards,

ViAnn Beadle

Re: Finding Duplicate Names

I guess a quick start would be to AUTORECODE the name variable and then run
frequencies on the resulting variable. Then scan down the list to look for
close matches.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Eugenio Grant
Sent: Wednesday, June 06, 2007 12:06 PM
To: [hidden email]
Subject: Finding Duplicate Names
Importance: High

Hi:

I have a BIG database (45.000) records. It has a variable called "name" that
has the name of the company. I'm trying to do 2 things with it.

1. Identify (not eliminate) duplicate names.

2. Identify similar names, meaning not the same but a similar. For example
Time Warner Inc, and Time Warner might be the same.

Any ideas.

Regards,

Peck, Jon

Re: Finding Duplicate Names

In reply to this post by Eugenio Grant

This topic has come up a number of times on this list recently. You might want to check the archives.

There are no built-in tools in SPSS for fuzzy matching, but if you can use SPSS programmability (SPSS 14 or later, 15 for some things), you could go about this in two parts.
Use regular expressions to "canonicalize" names as much as possible. For example, del "Inc", standardize capitalization, standardize whitespace, eliminate periods and commas. Whatever you can think of that draws common variations together.

Apply an algorithm like nysiis, although it was really designed for people names, to the names to produce an approximate phonetic encoding, assuming that spelling variations may be relevant.

Then use SPSS Identify Duplicate Cases to find exact matches based on the regularized names. You might try it with and without the nysiis transformation.

The regular expression and nysiis work can be done using programmability using the trans and extendedTransforms modules on SPSS Developer Central (www.spss.com/devcentral) and some straightforward Python code.

HTH,
Jon Peck

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Eugenio Grant
Sent: Wednesday, June 06, 2007 1:06 PM
To: [hidden email]
Subject: [SPSSX-L] Finding Duplicate Names
Importance: High

Hi:

I have a BIG database (45.000) records. It has a variable called "name" that
has the name of the company. I'm trying to do 2 things with it.

1. Identify (not eliminate) duplicate names.

2. Identify similar names, meaning not the same but a similar. For example
Time Warner Inc, and Time Warner might be the same.

Any ideas.

Regards,