SPSSX Discussion

Merging files, add variables

Classic

List

Threaded

13 messages Options

Osterlind, Steven J.

Merging files, add variables

A perplexing problem:

Two data sets: Set 1 & Set 2. Set 1 is records for individual, and many individuals have two or more records. Set 2 is addresses, one for each individual. So, I want to merge the addresses of Set 2 with the into Set 1 (Mergeàadd variables). SSN appears in both data sets and is my keyed variable. However, when I merge the data sets, the address of Set 1 appears only for the first occurrence of a given SSN. I want to addresses to appear at every occurrence of a given SSN. I’ve tried all combinations I can think of with Key table/variable.

Thanks for assistance,

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Steven J. Osterlind, Ph.D., Professor

Measurement & Statistics

University of Missouri

(573) 268-7514 - cell

[hidden email]

Maguin, Eugene

Re: Merging files, add variables

Steven,

Is this the syntax that you are using

Match files file=set1_individuals/table=set2_addresses/by ssn.

AND both filesa are sorted already by SSN?

Gene Maguin

________________________________

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Osterlind, Steven J.
Sent: Thursday, September 02, 2010 2:07 PM
To: [hidden email]
Subject: Merging files, add variables

A perplexing problem:

Two data sets: Set 1 & Set 2. Set 1 is records for individual, and many
individuals have two or more records. Set 2 is addresses, one for each
individual. So, I want to merge the addresses of Set 2 with the into Set 1
(Mergeàadd variables). SSN appears in both data sets and is my keyed
variable. However, when I merge the data sets, the address of Set 1 appears
only for the first occurrence of a given SSN. I want to addresses to appear
at every occurrence of a given SSN. Ive tried all combinations I can
think of with Key table/variable.

Thanks for assistance,
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Steven J. Osterlind, Ph.D., Professor
Measurement & Statistics
University of Missouri
(573) 268-7514 - cell
[hidden email]

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Zdaniuk, Bozena-3

Re: Merging files, add variables

hm, interesting... I dealt with this problem using LAG function to fill in the values of the merged in variable. But will try Gene's solution next time.
Steven, please let us know if it works, will you? If not, i will send you the LAG syntax that i use.
bozena

----- Original Message -----
From: "Gene Maguin" <[hidden email]>
To: [hidden email]
Sent: Thursday, September 2, 2010 11:16:17 AM GMT -08:00 US/Canada Pacific
Subject: Re: Merging files, add variables

Steven,

Is this the syntax that you are using

Match files file=set1_individuals/table=set2_addresses/by ssn.

AND both filesa are sorted already by SSN?

Gene Maguin

________________________________

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Osterlind, Steven J.
Sent: Thursday, September 02, 2010 2:07 PM
To: [hidden email]
Subject: Merging files, add variables

A perplexing problem:

Two data sets: Set 1 & Set 2. Set 1 is records for individual, and many
individuals have two or more records. Set 2 is addresses, one for each
individual. So, I want to merge the addresses of Set 2 with the into Set 1
(Merge� add variables). SSN appears in both data sets and is my keyed
variable. However, when I merge the data sets, the address of Set 1 appears
only for the first occurrence of a given SSN. I want to addresses to appear
at every occurrence of a given SSN. I’ve tried all combinations I can
think of with Key table/variable.

Thanks for assistance,
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Steven J. Osterlind, Ph.D., Professor
Measurement & Statistics
University of Missouri
(573) 268-7514 - cell
[hidden email]

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

J P-6

Re: Merging files, add variables

For what it's worth. if you can move these files into a database (e.g., Access) a match like this is a quick and easy many-to-one using SQL. One of my wishes is to be able to run SQL within SPSS without jumping through a bunch of hoops.

John

From: Bozena Zdaniuk <[hidden email]>
To: [hidden email]
Sent: Thu, September 2, 2010 2:27:18 PM
Subject: Re: Merging files, add variables

hm, interesting... I dealt with this problem using LAG function to fill in the values of the merged in variable. But will try Gene's solution next time.
Steven, please let us know if it works, will you? If not, i will send you the LAG syntax that i use.
bozena

----- Original Message -----
From: "Gene Maguin" <[hidden email]>
To: [hidden email]
Sent: Thursday, September 2, 2010 11:16:17 AM GMT -08:00 US/Canada Pacific
Subject: Re: Merging files, add variables

Steven,

Is this the syntax that you are using

Match files file=set1_individuals/table=set2_addresses/by ssn.

AND both filesa are sorted already by SSN?

Gene Maguin

________________________________

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Osterlind, Steven J.
Sent: Thursday, September 02, 2010 2:07 PM
To: [hidden email]
Subject: Merging files, add variables

A perplexing problem:

Two data sets: Set 1 & Set 2. Set 1 is records for individual, and many
individuals have two or more records. Set 2 is addresses, one for each
individual. So, I want to merge the addresses of Set 2 with the into Set 1
(MergeÃ add variables). SSN appears in both data sets and is my keyed
variable. However, when I merge the data sets, the address of Set 1 appears
only for the first occurrence of a given SSN. I want to addresses to appear
at every occurrence of a given SSN. Iâ€™ve tried all combinations I can
think of with Key table/variable.

Thanks for assistance,
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Steven J. Osterlind, Ph.D., Professor
Measurement & Statistics
University of Missouri
(573) 268-7514 - cell
[hidden email]

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Kevan Edwards (MDH)

Name Normalization

Hello all...

Is anyone aware of a process, or a data file that can be use to normalize first names?

My goal is to be able to de-duplicate a data file that was put together from several sources of data by converting all instances or Bill, Billy, Willy, William, to William and all instances of Rob, Bob, Bobby, Robby, Robbie, Robert to Robert.

I envision using “IF” “THEN” syntax structures pointing to a data file with two variables, first the specific instance of the first name and second the normalized (standardized) format of that name.

However, I need to find the data file with common variations and a normalized version of first names and I haven’t been able to find one to assist the automation of this process..

Thanks.

Kevan

----------------------------

Kevan Edwards Ph.D.

Research Scientist III
Health Economics Program, DHP/MDH

651-201-3551

Bruce Weaver

Re: Name Normalization

Administrator

Kevan Edwards (MDH) wrote

Hello all...

Is anyone aware of a process, or a data file that can be use to normalize first names?

My goal is to be able to de-duplicate a data file that was put together from several sources of data by converting all instances or Bill, Billy, Willy, William, to William and all instances of Rob, Bob, Bobby, Robby, Robbie, Robert to Robert.

I envision using "IF" "THEN" syntax structures pointing to a data file with two variables, first the specific instance of the first name and second the normalized (standardized) format of that name.

However, I need to find the data file with common variations and a normalized version of first names and I haven't been able to find one to assist the automation of this process..

Thanks.

Kevan

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

Albert-Jan Roskam

Re: Name Normalization

Hi,

You can do that using a part of the Febrl package: http://datamining.anu.edu.au/software/febrl/febrldoc/node8.html
I used Febrl before, but not for name standardization, only for its main purpose, probabilistic data linkage.
But it's possible to use only pyStandard.py for the standardization and not use other program features.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

From: Bruce Weaver <[hidden email]>
To: [hidden email]
Sent: Thu, September 2, 2010 9:59:49 PM
Subject: Re: [SPSSX-L] Name Normalization

Kevan Edwards (MDH) wrote:

>
> Hello all...
>
> Is anyone aware of a process, or a data file that can be use to normalize
> first names?
>
> My goal is to be able to de-duplicate a data file that was put together
> from several sources of data by converting all instances or Bill, Billy,
> Willy, William, to William and all instances of Rob, Bob, Bobby, Robby,
> Robbie, Robert to Robert.
>
> I envision using "IF" "THEN" syntax structures pointing to a data file
> with two variables, first the specific instance of the first name and
> second the normalized (standardized) format of that name.
>
> However, I need to find the data file with common variations and a
> normalized version of first names and I haven't been able to find one to
> assist the automation of this process..
>
> Thanks.
>
> Kevan
>
>

I'd start by converting everything to uppercase (see the UPCASE function).

Are William and Robert (and their variations) the only names you need to
worry about? Or were they just given as examples to illustrate the problem?
If they are the only ones, or even if the number of names is relatively
small, I'd probably just use RECODE, like this:

RECODE name
("BILL" "BILLY" "WILLY" "WILLIAM" = "WILLIAM")
("ROB" "BOB" "BOBBY" "ROBBY" "ROBBIE" "ROBERT" = "ROBERT) .
EXE.

Another option you could explore is creating a separate file to serve as a
look-up table. E.g.,

Name Name2
"BILL" "WILLIAM"
"BILLY" "WILLIAM"
"WILLY" "WILLIAM"
"WILLIAM" "WILLIAM"
"ROB" "ROBERT"
"BOB" "ROBERT"
"BOBBY" "ROBERT"
"ROBBY" "ROBERT"
"ROBBIE" "ROBERT"
"ROBERT" "ROBERT"

Use MATCH FILES to bring variable NAME2 into the original file (matching on
variable NAME), and where NAME2 does not equal NAME (in the original file),
replace NAME with NAME2. I.e.,

match files
file = 'original' /
table = 'lookuptable' /
by NAME .
exe.

if (name NE name2) name = name2.

This second method would probably be neater if there are a lot of names to
worry about. And note that both data sets would have to be sorted on NAME
before doing the match.

HTH.

-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Merging-files-add-variables-tp2801151p2801372.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck

Re: Merging files, add variables

In reply to this post by Osterlind, Steven J.

The following syntax does what I think you are asking. I don't know why you are not getting the expected result.

* table lookup dataset.
data list free /key value.
begin data.
1 100
2 200
4 400
end data.
dataset name lookup.

* main dataset with duplicate key.
data list free/ key data.
begin data.
1 123
1 456
2 789
3 012
4 1234
end data.
dataset name main.

DATASET ACTIVATE main.
MATCH FILES /FILE=*
/TABLE='lookup'
/BY key.
list.
* output:
key data value

1.00 123.00 100.00
1.00 456.00 100.00
2.00 789.00 200.00
3.00 12.00 .
4.00 1234.00 400.00

Regards,
Jon Peck

Jon Peck
SPSS, an IBM Company
[hidden email]
312-651-3435

From:	"Osterlind, Steven J." <[hidden email]>
To:	[hidden email]
Date:	09/02/2010 12:17 PM
Subject:	[SPSSX-L] Merging files, add variables
Sent by:	"SPSSX(r) Discussion" <[hidden email]>

A perplexing problem:

Two data sets: Set 1 & Set 2. Set 1 is records for individual, and many individuals have two or more records. Set 2 is addresses, one for each individual. So, I want to merge the addresses of Set 2 with the into Set 1 (Mergeàadd variables). SSN appears in both data sets and is my keyed variable. However, when I merge the data sets, the address of Set 1 appears only for the first occurrence of a given SSN. I want to addresses to appear at every occurrence of a given SSN. I’ve tried all combinations I can think of with Key table/variable.
Thanks for assistance,

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Steven J. Osterlind, Ph.D., Professor
Measurement & Statistics
University of Missouri
(573) 268-7514 - cell
[hidden email]

Jennifer S. Randolph

Re: Merging files, add variables

In reply to this post by Osterlind, Steven J.

Steven-

I run into a similar situation for a database I work with. I often have to merge genetic data onto every research visit for a given person in our main database. The genetic information is static, so I want it to appear on every one of (often) multiple visits. In my experience, the syntax people have offered does not work if say, in your case, the address variable already exists in the dataset you are seeking to add info onto. So, for example, as we run genetics tests and I have to update this field on our main database, I cannot use that syntax since the genetics fields already exist, and in some cases already have data. Also, I do not know how to make the "update" command work for a one to many situation... the only workaround I have found is to keep a cumulative file of all genetics results, and every time I have to update, i strip off all the genetics varaibles on the main dataset, and then run the merge with the syntax people have offered here.

I am not an adept SPSS user, having dome my graduate work on SAS, so I would love any thoughts experienced SPSS users have on this-- and I am not sure this is your problem, but wondering if it is...
Jennifer

"Osterlind, Steven J." <[hidden email]>
To:
[hidden email]
Date:
09/02/2010 12:17 PM
Subject:
[SPSSX-L] Merging files, add variables
Sent by:
"SPSSX(r) Discussion" <[hidden email]>

A perplexing problem:

Two data sets: Set 1 & Set 2. Set 1 is records for individual, and many
individuals have two or more records. Set 2 is addresses, one for each
individual. So, I want to merge the addresses of Set 2 with the into Set 1
(Mergeàadd variables). SSN appears in both data sets and is my keyed
variable. However, when I merge the data sets, the address of Set 1
appears only for the first occurrence of a given SSN. I want to addresses
to appear at every occurrence of a given SSN. I?ve tried all
combinations I can think of with Key table/variable.
Thanks for assistance,
--- end of quote ---

Jennifer Stone Randolph, M.Sc.
Brain Imaging Laboratory
Department of Psychiatry

Confidentiality Notice: This e-mail transmission is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution violates confidentiality and privacy laws and is prohibited. If you are not the intended recipient, please contact the sender immediately and destroy all copies of the message. Thank you for your cooperation.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Barnett, Adrian (DECD)

Re: Name Normalization

In reply to this post by Kevan Edwards (MDH)

Hi Kevan

If your file has <10,000 records you can use the free version of LinkageWiz to do the de-duplication for you.

It has a table built in which has all the variations on each name (it refers to these as “nicknames”) which it uses as part of its fuzzy matching routine (it also uses a NYSIIS or a Soundex match, and optionally will use a string similarity measure as well). Exact matches, nickname matches and phonetic matches all get different weights in computing a match score.

You can find out more at www.linkagewiz.com

If you have more than 10,000 cases, I’d recommend looking at FEBRL, which is free. You can find out more about FEBRL here:

http://datamining.anu.edu.au/software/febrl/febrldoc/

What you are trying to do can in principle be done in SPSS (sort of), but it would be very hard to do it well, and would probably take more time than you have.

Adrian Barnett

Project Officer

Educational Measurement and Analysis

Data and Educational Measurement

DECS

ph 82261080

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edwards, Kevan (MDH)
Sent: Friday, 3 September 2010 4:40 AM
To: [hidden email]
Subject: Name Normalization

Hello all...

Is anyone aware of a process, or a data file that can be use to normalize first names?

However, I need to find the data file with common variations and a normalized version of first names and I haven’t been able to find one to assist the automation of this process..

Thanks.

Kevan

----------------------------

Kevan Edwards Ph.D.

Research Scientist III
Health Economics Program, DHP/MDH

651-201-3551

Raffe, Sydelle, SSA

Re: Name Normalization

How about address normalization software?

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Barnett, Adrian (DECS)
Sent: Sunday, September 05, 2010 9:43 PM
To: [hidden email]
Subject: Re: Name Normalization

Hi Kevan

If your file has <10,000 records you can use the free version of LinkageWiz to do the de-duplication for you.

You can find out more at www.linkagewiz.com

If you have more than 10,000 cases, I’d recommend looking at FEBRL, which is free. You can find out more about FEBRL here:

http://datamining.anu.edu.au/software/febrl/febrldoc/

What you are trying to do can in principle be done in SPSS (sort of), but it would be very hard to do it well, and would probably take more time than you have.

Adrian Barnett

Project Officer

Educational Measurement and Analysis

Data and Educational Measurement

DECS

ph 82261080

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edwards, Kevan (MDH)
Sent: Friday, 3 September 2010 4:40 AM
To: [hidden email]
Subject: Name Normalization

Hello all...

Is anyone aware of a process, or a data file that can be use to normalize first names?

However, I need to find the data file with common variations and a normalized version of first names and I haven’t been able to find one to assist the automation of this process..

Thanks.

Kevan

----------------------------

Kevan Edwards Ph.D.

Research Scientist III
Health Economics Program, DHP/MDH

651-201-3551

Heidi Green

Re: Name Normalization

My company uses several Address Standardization Software packages.

I have experience with Melissa Data’s “Mailers+4” software, which I am quite happy with. It standardizes the addresses to postal specifications (at least for the U.S. & Canada), adds other geographic information such as a latitude / longitude or census information. It can also identify duplicates based on addresses or name + address.

You can learn more about them at:

http://www.melissadata.com/

Hope that helps,

-Heidi

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Raffe, Sydelle, SSA
Sent: Wednesday, September 08, 2010 1:28 PM
To: [hidden email]
Subject: Re: Name Normalization

How about address normalization software?

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Barnett, Adrian (DECS)
Sent: Sunday, September 05, 2010 9:43 PM
To: [hidden email]
Subject: Re: Name Normalization

Hi Kevan

If your file has <10,000 records you can use the free version of LinkageWiz to do the de-duplication for you.

You can find out more at www.linkagewiz.com

If you have more than 10,000 cases, I’d recommend looking at FEBRL, which is free. You can find out more about FEBRL here:

http://datamining.anu.edu.au/software/febrl/febrldoc/

What you are trying to do can in principle be done in SPSS (sort of), but it would be very hard to do it well, and would probably take more time than you have.

Adrian Barnett

Project Officer

Educational Measurement and Analysis

Data and Educational Measurement

DECS

ph 82261080

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edwards, Kevan (MDH)
Sent: Friday, 3 September 2010 4:40 AM
To: [hidden email]
Subject: Name Normalization

Hello all...

Is anyone aware of a process, or a data file that can be use to normalize first names?

However, I need to find the data file with common variations and a normalized version of first names and I haven’t been able to find one to assist the automation of this process..

Thanks.

Kevan

----------------------------

Kevan Edwards Ph.D.

Research Scientist III
Health Economics Program, DHP/MDH

651-201-3551

____________
DefenderMX2.

Barnett, Adrian (DECD)

Re: Name Normalization

In reply to this post by Raffe, Sydelle, SSA

Both the packages I mentioned do address standardization, and, if you have a geocoded address reference file, will also geocode addresses against that file.

Adrian Barnett

Project Officer

Educational Measurement and Analysis

Data and Educational Measurement

DECS

ph 82261080

From: Raffe, Sydelle, SSA [mailto:[hidden email]]
Sent: Thursday, 9 September 2010 5:58 AM
To: Barnett, Adrian (DECS); [hidden email]
Subject: RE: Name Normalization

How about address normalization software?

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Barnett, Adrian (DECS)
Sent: Sunday, September 05, 2010 9:43 PM
To: [hidden email]
Subject: Re: Name Normalization

Hi Kevan

If your file has <10,000 records you can use the free version of LinkageWiz to do the de-duplication for you.

You can find out more at www.linkagewiz.com

If you have more than 10,000 cases, I’d recommend looking at FEBRL, which is free. You can find out more about FEBRL here:

http://datamining.anu.edu.au/software/febrl/febrldoc/

What you are trying to do can in principle be done in SPSS (sort of), but it would be very hard to do it well, and would probably take more time than you have.

Adrian Barnett

Project Officer

Educational Measurement and Analysis

Data and Educational Measurement

DECS

ph 82261080

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edwards, Kevan (MDH)
Sent: Friday, 3 September 2010 4:40 AM
To: [hidden email]
Subject: Name Normalization

Hello all...

Is anyone aware of a process, or a data file that can be use to normalize first names?

However, I need to find the data file with common variations and a normalized version of first names and I haven’t been able to find one to assist the automation of this process..

Thanks.

Kevan

----------------------------

Kevan Edwards Ph.D.

Research Scientist III
Health Economics Program, DHP/MDH

651-201-3551