Merging files, add variables

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Merging files, add variables

Osterlind, Steven J.

A perplexing problem:

 

Two data sets: Set 1 & Set 2. Set 1 is records for individual, and many individuals have two or more records. Set 2 is addresses, one for each individual. So, I want to merge the addresses of Set 2 with the into Set 1 (Mergeàadd variables).  SSN appears in both data sets and is my keyed variable. However,  when I merge the data sets, the address of Set 1 appears only for the first occurrence of a given SSN.  I want to addresses to appear at every  occurrence of a given SSN.  I’ve tried all combinations I can think of with Key table/variable.  

Thanks for assistance,

 

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Steven J. Osterlind, Ph.D., Professor

Measurement & Statistics

University of Missouri

(573) 268-7514 - cell

[hidden email]

 

Reply | Threaded
Open this post in threaded view
|

Re: Merging files, add variables

Maguin, Eugene
Steven,

Is this the syntax that you are using


Match files file=set1_individuals/table=set2_addresses/by ssn.

AND both filesa are sorted already by SSN?

Gene Maguin

________________________________

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Osterlind, Steven J.
Sent: Thursday, September 02, 2010 2:07 PM
To: [hidden email]
Subject: Merging files, add variables

A perplexing problem:

Two data sets: Set 1 & Set 2. Set 1 is records for individual, and many
individuals have two or more records. Set 2 is addresses, one for each
individual. So, I want to merge the addresses of Set 2 with the into Set 1
(Mergeàadd variables).  SSN appears in both data sets and is my keyed
variable. However,  when I merge the data sets, the address of Set 1 appears
only for the first occurrence of a given SSN.  I want to addresses to appear
at every  occurrence of a given SSN.  I’ve tried all combinations I can
think of with Key table/variable.

Thanks for assistance,
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Steven J. Osterlind, Ph.D., Professor
Measurement & Statistics
University of Missouri
(573) 268-7514 - cell
[hidden email]

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Merging files, add variables

Zdaniuk, Bozena-3
hm, interesting... I dealt with this problem using LAG function to fill in the values of the merged in variable. But will try Gene's solution next time.
Steven, please let us know if it works, will you? If not, i will send you the LAG syntax that i use.
bozena

----- Original Message -----
From: "Gene Maguin" <[hidden email]>
To: [hidden email]
Sent: Thursday, September 2, 2010 11:16:17 AM GMT -08:00 US/Canada Pacific
Subject: Re: Merging files, add variables

Steven,

Is this the syntax that you are using


Match files file=set1_individuals/table=set2_addresses/by ssn.

AND both filesa are sorted already by SSN?

Gene Maguin

________________________________

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Osterlind, Steven J.
Sent: Thursday, September 02, 2010 2:07 PM
To: [hidden email]
Subject: Merging files, add variables

A perplexing problem:

Two data sets: Set 1 & Set 2. Set 1 is records for individual, and many
individuals have two or more records. Set 2 is addresses, one for each
individual. So, I want to merge the addresses of Set 2 with the into Set 1
(Merge� add variables).  SSN appears in both data sets and is my keyed
variable. However,  when I merge the data sets, the address of Set 1 appears
only for the first occurrence of a given SSN.  I want to addresses to appear
at every  occurrence of a given SSN.  I’ve tried all combinations I can
think of with Key table/variable.

Thanks for assistance,
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Steven J. Osterlind, Ph.D., Professor
Measurement & Statistics
University of Missouri
(573) 268-7514 - cell
[hidden email]

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Merging files, add variables

J P-6
For what it's worth. if you can move these files into a database (e.g., Access) a match like this is a quick and easy many-to-one using SQL. One of my wishes is to be able to run SQL within SPSS without jumping through a bunch of hoops.
John

From: Bozena Zdaniuk <[hidden email]>
To: [hidden email]
Sent: Thu, September 2, 2010 2:27:18 PM
Subject: Re: Merging files, add variables

hm, interesting... I dealt with this problem using LAG function to fill in the values of the merged in variable. But will try Gene's solution next time.
Steven, please let us know if it works, will you? If not, i will send you the LAG syntax that i use.
bozena

----- Original Message -----
From: "Gene Maguin" <[hidden email]>
To: [hidden email]
Sent: Thursday, September 2, 2010 11:16:17 AM GMT -08:00 US/Canada Pacific
Subject: Re: Merging files, add variables

Steven,

Is this the syntax that you are using


Match files file=set1_individuals/table=set2_addresses/by ssn.

AND both filesa are sorted already by SSN?

Gene Maguin

________________________________

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Osterlind, Steven J.
Sent: Thursday, September 02, 2010 2:07 PM
To: [hidden email]
Subject: Merging files, add variables

A perplexing problem:

Two data sets: Set 1 & Set 2. Set 1 is records for individual, and many
individuals have two or more records. Set 2 is addresses, one for each
individual. So, I want to merge the addresses of Set 2 with the into Set 1
(Mergeà add variables).  SSN appears in both data sets and is my keyed
variable. However,  when I merge the data sets, the address of Set 1 appears
only for the first occurrence of a given SSN.  I want to addresses to appear
at every  occurrence of a given SSN.  I’ve tried all combinations I can
think of with Key table/variable.

Thanks for assistance,
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Steven J. Osterlind, Ph.D., Professor
Measurement & Statistics
University of Missouri
(573) 268-7514 - cell
[hidden email]

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Name Normalization

Kevan Edwards (MDH)

Hello all...

 

Is anyone aware of a process, or a data file that can be use to normalize first names?

 

My goal is to be able to de-duplicate a data file that was put together from several sources of data by converting all instances or Bill, Billy, Willy, William,  to William and all instances of Rob, Bob, Bobby, Robby, Robbie, Robert to Robert.

 

I envision using “IF” “THEN” syntax structures pointing to a data file with two variables, first the specific instance of the first name and second the normalized (standardized) format of that name.

 

However, I need to find the data file with common variations and a normalized version of first names and I haven’t been able to find one to assist the automation of this process..

 

Thanks.

 

Kevan

 

----------------------------

Kevan Edwards Ph.D.

Research Scientist III
Health Economics Program, DHP/MDH

651-201-3551

Reply | Threaded
Open this post in threaded view
|

Re: Name Normalization

Bruce Weaver
Administrator
Kevan Edwards (MDH) wrote
Hello all...

Is anyone aware of a process, or a data file that can be use to normalize first names?

My goal is to be able to de-duplicate a data file that was put together from several sources of data by converting all instances or Bill, Billy, Willy, William,  to William and all instances of Rob, Bob, Bobby, Robby, Robbie, Robert to Robert.

I envision using "IF" "THEN" syntax structures pointing to a data file with two variables, first the specific instance of the first name and second the normalized (standardized) format of that name.

However, I need to find the data file with common variations and a normalized version of first names and I haven't been able to find one to assist the automation of this process..

Thanks.

Kevan
I'd start by converting everything to uppercase (see the UPCASE function).

Are William and Robert (and their variations) the only names you need to worry about?  Or were they just given as examples to illustrate the problem?  If they are the only ones, or even if the number of names is relatively small, I'd probably just use RECODE, like this:

RECODE name
 ("BILL" "BILLY" "WILLY" "WILLIAM" = "WILLIAM")
 ("ROB" "BOB" "BOBBY" "ROBBY" "ROBBIE" "ROBERT" = "ROBERT) .
EXE.

Another option you could explore is creating a separate file to serve as a look-up table.  E.g.,

Name    Name2
"BILL"  "WILLIAM"
"BILLY"  "WILLIAM"
"WILLY"  "WILLIAM"
"WILLIAM" "WILLIAM"
"ROB"  "ROBERT"
"BOB"  "ROBERT"
"BOBBY"  "ROBERT"
"ROBBY"  "ROBERT"
"ROBBIE"  "ROBERT"
"ROBERT" "ROBERT"

Use MATCH FILES to bring variable NAME2 into the original file (matching on variable NAME), and where NAME2 does not equal NAME (in the original file), replace NAME with NAME2.  I.e.,

match files
 file = 'original' /
 table = 'lookuptable' /
 by NAME .
exe.

if (name NE name2) name = name2.

This second method would probably be neater if there are a lot of names to worry about.  And note that both data sets would have to be sorted on NAME before doing the match.

HTH.
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Name Normalization

Albert-Jan Roskam
Hi,

You can do that using a part of the Febrl package: http://datamining.anu.edu.au/software/febrl/febrldoc/node8.html
I used Febrl before, but not for name standardization, only for its main purpose, probabilistic data linkage.
But it's possible to use only pyStandard.py for the standardization and not use other program features.
 
Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



From: Bruce Weaver <[hidden email]>
To: [hidden email]
Sent: Thu, September 2, 2010 9:59:49 PM
Subject: Re: [SPSSX-L] Name Normalization

Kevan Edwards (MDH) wrote:

>
> Hello all...
>
> Is anyone aware of a process, or a data file that can be use to normalize
> first names?
>
> My goal is to be able to de-duplicate a data file that was put together
> from several sources of data by converting all instances or Bill, Billy,
> Willy, William,  to William and all instances of Rob, Bob, Bobby, Robby,
> Robbie, Robert to Robert.
>
> I envision using "IF" "THEN" syntax structures pointing to a data file
> with two variables, first the specific instance of the first name and
> second the normalized (standardized) format of that name.
>
> However, I need to find the data file with common variations and a
> normalized version of first names and I haven't been able to find one to
> assist the automation of this process..
>
> Thanks.
>
> Kevan
>
>

I'd start by converting everything to uppercase (see the UPCASE function).

Are William and Robert (and their variations) the only names you need to
worry about?  Or were they just given as examples to illustrate the problem?
If they are the only ones, or even if the number of names is relatively
small, I'd probably just use RECODE, like this:

RECODE name
("BILL" "BILLY" "WILLY" "WILLIAM" = "WILLIAM")
("ROB" "BOB" "BOBBY" "ROBBY" "ROBBIE" "ROBERT" = "ROBERT) .
EXE.

Another option you could explore is creating a separate file to serve as a
look-up table.  E.g.,

Name    Name2
"BILL"  "WILLIAM"
"BILLY"  "WILLIAM"
"WILLY"  "WILLIAM"
"WILLIAM" "WILLIAM"
"ROB"  "ROBERT"
"BOB"  "ROBERT"
"BOBBY"  "ROBERT"
"ROBBY"  "ROBERT"
"ROBBIE"  "ROBERT"
"ROBERT" "ROBERT"

Use MATCH FILES to bring variable NAME2 into the original file (matching on
variable NAME), and where NAME2 does not equal NAME (in the original file),
replace NAME with NAME2.  I.e.,

match files
file = 'original' /
table = 'lookuptable' /
by NAME .
exe.

if (name NE name2) name = name2.

This second method would probably be neater if there are a lot of names to
worry about.  And note that both data sets would have to be sorted on NAME
before doing the match.

HTH.


-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Merging-files-add-variables-tp2801151p2801372.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: Merging files, add variables

Jon K Peck
In reply to this post by Osterlind, Steven J.

The following syntax does what I think you are asking.  I don't know why you are not getting the expected result.

* table lookup dataset.
data list free /key value.
begin data.
1 100
2 200
4 400
end data.
dataset name lookup.

* main dataset with duplicate key.
data list free/ key data.
begin data.
1 123
1 456
2 789
3 012
4 1234
end data.
dataset name main.

DATASET ACTIVATE main.
MATCH FILES /FILE=*
  /TABLE='lookup'
  /BY key.
list.
* output:
     key     data    value

    1.00   123.00   100.00
    1.00   456.00   100.00
    2.00   789.00   200.00
    3.00    12.00      .
    4.00  1234.00   400.00

 
Regards,
Jon Peck

Jon Peck
SPSS, an IBM Company
[hidden email]
312-651-3435



From: "Osterlind, Steven J." <[hidden email]>
To: [hidden email]
Date: 09/02/2010 12:17 PM
Subject: [SPSSX-L] Merging files, add variables
Sent by: "SPSSX(r) Discussion" <[hidden email]>





A perplexing problem:
 
Two data sets: Set 1 & Set 2. Set 1 is records for individual, and many individuals have two or more records. Set 2 is addresses, one for each individual. So, I want to merge the addresses of Set 2 with the into Set 1 (Mergeàadd variables).  SSN appears in both data sets and is my keyed variable. However,  when I merge the data sets, the address of Set 1 appears only for the first occurrence of a given SSN.  I want to addresses to appear at every  occurrence of a given SSN.  I’ve tried all combinations I can think of with Key table/variable.  
Thanks for assistance,
 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Steven J. Osterlind, Ph.D., Professor
Measurement & Statistics
University of Missouri
(573) 268-7514 - cell
[hidden email]
 

Reply | Threaded
Open this post in threaded view
|

Re: Merging files, add variables

Jennifer S. Randolph
In reply to this post by Osterlind, Steven J.
Steven-

I run into a similar situation for a database I work with.  I often have to merge genetic data onto every research visit for a given person in our main database.  The genetic information is static, so I want it to appear on every one of (often) multiple visits.  In my experience, the syntax people have offered does not work if say, in your case, the address variable already exists in the dataset you are seeking to add info onto.  So, for example, as we run genetics tests and I have to update this field on our main database, I cannot use that syntax since the genetics fields already exist, and in some cases already have data.  Also, I do not know how to make the "update" command work for a one to many situation...  the only workaround I have found is to keep a cumulative file of all genetics results, and every time I have to update, i strip off all the genetics varaibles on the main dataset, and then run the merge with the syntax people have offered here.

I am not an adept SPSS user, having dome my graduate work on SAS, so I would love any thoughts experienced SPSS users have on this-- and I am not sure this is your problem, but wondering if it is...
Jennifer


"Osterlind, Steven J." <[hidden email]>
To:
[hidden email]
Date:
09/02/2010 12:17 PM
Subject:
[SPSSX-L] Merging files, add variables
Sent by:
"SPSSX(r) Discussion" <[hidden email]>



A perplexing problem:

Two data sets: Set 1 & Set 2. Set 1 is records for individual, and many
individuals have two or more records. Set 2 is addresses, one for each
individual. So, I want to merge the addresses of Set 2 with the into Set 1
(Mergeàadd variables).  SSN appears in both data sets and is my keyed
variable. However,  when I merge the data sets, the address of Set 1
appears only for the first occurrence of a given SSN.  I want to addresses
to appear at every  occurrence of a given SSN.  I?ve tried all
combinations I can think of with Key table/variable.
Thanks for assistance,
--- end of quote ---



Jennifer Stone Randolph, M.Sc.
Brain Imaging Laboratory
Department of Psychiatry

Confidentiality Notice: This e-mail transmission is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution violates confidentiality and privacy laws and is prohibited. If you are not the intended recipient, please contact the sender immediately and destroy all copies of the message. Thank you for your cooperation.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Name Normalization

Barnett, Adrian (DECD)
In reply to this post by Kevan Edwards (MDH)

Hi Kevan

If your file has <10,000 records you can use the free version of LinkageWiz to do the de-duplication for you.

It has a table built in which has all the variations on each name (it refers to these as “nicknames”) which it uses as part of its fuzzy matching routine (it also uses a NYSIIS or a Soundex match, and optionally will use a string similarity measure as well). Exact matches, nickname matches and phonetic matches all get different weights in computing a match score.

 

You can find out more at www.linkagewiz.com

 

If you have more than 10,000 cases, I’d recommend looking at FEBRL, which is free. You can find out more about FEBRL here:

http://datamining.anu.edu.au/software/febrl/febrldoc/

 

What you are trying to do can in principle be done in SPSS (sort of), but it would be very hard to do it well, and would probably take more time than you have.

 

Adrian Barnett

Project Officer

Educational Measurement and Analysis

Data and Educational Measurement

DECS

ph 82261080

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edwards, Kevan (MDH)
Sent: Friday, 3 September 2010 4:40 AM
To: [hidden email]
Subject: Name Normalization

 

Hello all...

 

Is anyone aware of a process, or a data file that can be use to normalize first names?

 

My goal is to be able to de-duplicate a data file that was put together from several sources of data by converting all instances or Bill, Billy, Willy, William,  to William and all instances of Rob, Bob, Bobby, Robby, Robbie, Robert to Robert.

 

I envision using “IF” “THEN” syntax structures pointing to a data file with two variables, first the specific instance of the first name and second the normalized (standardized) format of that name.

 

However, I need to find the data file with common variations and a normalized version of first names and I haven’t been able to find one to assist the automation of this process..

 

Thanks.

 

Kevan

 

----------------------------

Kevan Edwards Ph.D.

Research Scientist III
Health Economics Program, DHP/MDH

651-201-3551

Reply | Threaded
Open this post in threaded view
|

Re: Name Normalization

Raffe, Sydelle, SSA

How about address normalization software?

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Barnett, Adrian (DECS)
Sent: Sunday, September 05, 2010 9:43 PM
To: [hidden email]
Subject: Re: Name Normalization

 

Hi Kevan

If your file has <10,000 records you can use the free version of LinkageWiz to do the de-duplication for you.

It has a table built in which has all the variations on each name (it refers to these as “nicknames”) which it uses as part of its fuzzy matching routine (it also uses a NYSIIS or a Soundex match, and optionally will use a string similarity measure as well). Exact matches, nickname matches and phonetic matches all get different weights in computing a match score.

 

You can find out more at www.linkagewiz.com

 

If you have more than 10,000 cases, I’d recommend looking at FEBRL, which is free. You can find out more about FEBRL here:

http://datamining.anu.edu.au/software/febrl/febrldoc/

 

What you are trying to do can in principle be done in SPSS (sort of), but it would be very hard to do it well, and would probably take more time than you have.

 

Adrian Barnett

Project Officer

Educational Measurement and Analysis

Data and Educational Measurement

DECS

ph 82261080

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edwards, Kevan (MDH)
Sent: Friday, 3 September 2010 4:40 AM
To: [hidden email]
Subject: Name Normalization

 

Hello all...

 

Is anyone aware of a process, or a data file that can be use to normalize first names?

 

My goal is to be able to de-duplicate a data file that was put together from several sources of data by converting all instances or Bill, Billy, Willy, William,  to William and all instances of Rob, Bob, Bobby, Robby, Robbie, Robert to Robert.

 

I envision using “IF” “THEN” syntax structures pointing to a data file with two variables, first the specific instance of the first name and second the normalized (standardized) format of that name.

 

However, I need to find the data file with common variations and a normalized version of first names and I haven’t been able to find one to assist the automation of this process..

 

Thanks.

 

Kevan

 

----------------------------

Kevan Edwards Ph.D.

Research Scientist III
Health Economics Program, DHP/MDH

651-201-3551

Reply | Threaded
Open this post in threaded view
|

Re: Name Normalization

Heidi Green

My company uses several Address Standardization Software packages.

I have experience with Melissa Data’s “Mailers+4” software, which I am quite happy with. It standardizes the addresses to postal specifications (at least for the U.S. & Canada), adds other geographic information such as a latitude / longitude or census information. It can also identify duplicates based on addresses or name + address.

You can learn more about them at:

http://www.melissadata.com/

Hope that helps,

-Heidi


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Raffe, Sydelle, SSA
Sent: Wednesday, September 08, 2010 1:28 PM
To: [hidden email]
Subject: Re: Name Normalization

 

How about address normalization software?

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Barnett, Adrian (DECS)
Sent: Sunday, September 05, 2010 9:43 PM
To: [hidden email]
Subject: Re: Name Normalization

 

Hi Kevan

If your file has <10,000 records you can use the free version of LinkageWiz to do the de-duplication for you.

It has a table built in which has all the variations on each name (it refers to these as “nicknames”) which it uses as part of its fuzzy matching routine (it also uses a NYSIIS or a Soundex match, and optionally will use a string similarity measure as well). Exact matches, nickname matches and phonetic matches all get different weights in computing a match score.

 

You can find out more at www.linkagewiz.com

 

If you have more than 10,000 cases, I’d recommend looking at FEBRL, which is free. You can find out more about FEBRL here:

http://datamining.anu.edu.au/software/febrl/febrldoc/

 

What you are trying to do can in principle be done in SPSS (sort of), but it would be very hard to do it well, and would probably take more time than you have.

 

Adrian Barnett

Project Officer

Educational Measurement and Analysis

Data and Educational Measurement

DECS

ph 82261080

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edwards, Kevan (MDH)
Sent: Friday, 3 September 2010 4:40 AM
To: [hidden email]
Subject: Name Normalization

 

Hello all...

 

Is anyone aware of a process, or a data file that can be use to normalize first names?

 

My goal is to be able to de-duplicate a data file that was put together from several sources of data by converting all instances or Bill, Billy, Willy, William,  to William and all instances of Rob, Bob, Bobby, Robby, Robbie, Robert to Robert.

 

I envision using “IF” “THEN” syntax structures pointing to a data file with two variables, first the specific instance of the first name and second the normalized (standardized) format of that name.

 

However, I need to find the data file with common variations and a normalized version of first names and I haven’t been able to find one to assist the automation of this process..

 

Thanks.

 

Kevan

 

----------------------------

Kevan Edwards Ph.D.

Research Scientist III
Health Economics Program, DHP/MDH

651-201-3551



____________
DefenderMX2.
Reply | Threaded
Open this post in threaded view
|

Re: Name Normalization

Barnett, Adrian (DECD)
In reply to this post by Raffe, Sydelle, SSA

Hi

Both the packages I mentioned do address standardization, and, if you have a geocoded address reference file, will also geocode addresses against that file.

 

Adrian Barnett

Project Officer

Educational Measurement and Analysis

Data and Educational Measurement

DECS

ph 82261080

 


From: Raffe, Sydelle, SSA [mailto:[hidden email]]
Sent: Thursday, 9 September 2010 5:58 AM
To: Barnett, Adrian (DECS); [hidden email]
Subject: RE: Name Normalization

 

How about address normalization software?

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Barnett, Adrian (DECS)
Sent: Sunday, September 05, 2010 9:43 PM
To: [hidden email]
Subject: Re: Name Normalization

 

Hi Kevan

If your file has <10,000 records you can use the free version of LinkageWiz to do the de-duplication for you.

It has a table built in which has all the variations on each name (it refers to these as “nicknames”) which it uses as part of its fuzzy matching routine (it also uses a NYSIIS or a Soundex match, and optionally will use a string similarity measure as well). Exact matches, nickname matches and phonetic matches all get different weights in computing a match score.

 

You can find out more at www.linkagewiz.com

 

If you have more than 10,000 cases, I’d recommend looking at FEBRL, which is free. You can find out more about FEBRL here:

http://datamining.anu.edu.au/software/febrl/febrldoc/

 

What you are trying to do can in principle be done in SPSS (sort of), but it would be very hard to do it well, and would probably take more time than you have.

 

Adrian Barnett

Project Officer

Educational Measurement and Analysis

Data and Educational Measurement

DECS

ph 82261080

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edwards, Kevan (MDH)
Sent: Friday, 3 September 2010 4:40 AM
To: [hidden email]
Subject: Name Normalization

 

Hello all...

 

Is anyone aware of a process, or a data file that can be use to normalize first names?

 

My goal is to be able to de-duplicate a data file that was put together from several sources of data by converting all instances or Bill, Billy, Willy, William,  to William and all instances of Rob, Bob, Bobby, Robby, Robbie, Robert to Robert.

 

I envision using “IF” “THEN” syntax structures pointing to a data file with two variables, first the specific instance of the first name and second the normalized (standardized) format of that name.

 

However, I need to find the data file with common variations and a normalized version of first names and I haven’t been able to find one to assist the automation of this process..

 

Thanks.

 

Kevan

 

----------------------------

Kevan Edwards Ph.D.

Research Scientist III
Health Economics Program, DHP/MDH

651-201-3551