|
A perplexing problem: Two data sets: Set 1 & Set 2. Set 1 is records for individual, and many individuals have two or more records. Set 2 is addresses, one for each individual. So, I want to merge the addresses of Set 2 with the into Set 1 (Mergeàadd variables). SSN appears in both data sets and is my keyed variable. However, when I merge the data sets, the address of Set 1 appears only for the first occurrence of a given SSN. I want to addresses to appear at every occurrence of a given SSN. I’ve tried all combinations I can think of with Key table/variable. Thanks for assistance, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Steven J. Osterlind, Ph.D., Professor Measurement & Statistics University of Missouri (573) 268-7514 - cell |
|
Steven,
Is this the syntax that you are using Match files file=set1_individuals/table=set2_addresses/by ssn. AND both filesa are sorted already by SSN? Gene Maguin ________________________________ From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Osterlind, Steven J. Sent: Thursday, September 02, 2010 2:07 PM To: [hidden email] Subject: Merging files, add variables A perplexing problem: Two data sets: Set 1 & Set 2. Set 1 is records for individual, and many individuals have two or more records. Set 2 is addresses, one for each individual. So, I want to merge the addresses of Set 2 with the into Set 1 (Mergeàadd variables). SSN appears in both data sets and is my keyed variable. However, when I merge the data sets, the address of Set 1 appears only for the first occurrence of a given SSN. I want to addresses to appear at every occurrence of a given SSN. Ive tried all combinations I can think of with Key table/variable. Thanks for assistance, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Steven J. Osterlind, Ph.D., Professor Measurement & Statistics University of Missouri (573) 268-7514 - cell [hidden email] ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
hm, interesting... I dealt with this problem using LAG function to fill in the values of the merged in variable. But will try Gene's solution next time.
Steven, please let us know if it works, will you? If not, i will send you the LAG syntax that i use. bozena ----- Original Message ----- From: "Gene Maguin" <[hidden email]> To: [hidden email] Sent: Thursday, September 2, 2010 11:16:17 AM GMT -08:00 US/Canada Pacific Subject: Re: Merging files, add variables Steven, Is this the syntax that you are using Match files file=set1_individuals/table=set2_addresses/by ssn. AND both filesa are sorted already by SSN? Gene Maguin ________________________________ From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Osterlind, Steven J. Sent: Thursday, September 02, 2010 2:07 PM To: [hidden email] Subject: Merging files, add variables A perplexing problem: Two data sets: Set 1 & Set 2. Set 1 is records for individual, and many individuals have two or more records. Set 2 is addresses, one for each individual. So, I want to merge the addresses of Set 2 with the into Set 1 (Merge� add variables). SSN appears in both data sets and is my keyed variable. However, when I merge the data sets, the address of Set 1 appears only for the first occurrence of a given SSN. I want to addresses to appear at every occurrence of a given SSN. I’ve tried all combinations I can think of with Key table/variable. Thanks for assistance, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Steven J. Osterlind, Ph.D., Professor Measurement & Statistics University of Missouri (573) 268-7514 - cell [hidden email] ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
For what it's worth. if you can move these files into a database (e.g., Access) a match like this is a quick and easy many-to-one using SQL. One of my wishes is to be able to run SQL within SPSS without jumping through a bunch of hoops.
John From: Bozena Zdaniuk <[hidden email]> To: [hidden email] Sent: Thu, September 2, 2010 2:27:18 PM Subject: Re: Merging files, add variables hm, interesting... I dealt with this problem using LAG function to fill in the values of the merged in variable. But will try Gene's solution next time. Steven, please let us know if it works, will you? If not, i will send you the LAG syntax that i use. bozena ----- Original Message ----- From: "Gene Maguin" <[hidden email]> To: [hidden email] Sent: Thursday, September 2, 2010 11:16:17 AM GMT -08:00 US/Canada Pacific Subject: Re: Merging files, add variables Steven, Is this the syntax that you are using Match files file=set1_individuals/table=set2_addresses/by ssn. AND both filesa are sorted already by SSN? Gene Maguin ________________________________ From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Osterlind, Steven J. Sent: Thursday, September 02, 2010 2:07 PM To: [hidden email] Subject: Merging files, add variables A perplexing problem: Two data sets: Set 1 & Set 2. Set 1 is records for individual, and many individuals have two or more records. Set 2 is addresses, one for each individual. So, I want to merge the addresses of Set 2 with the into Set 1 (Mergeà add variables). SSN appears in both data sets and is my keyed variable. However, when I merge the data sets, the address of Set 1 appears only for the first occurrence of a given SSN. I want to addresses to appear at every occurrence of a given SSN. I’ve tried all combinations I can think of with Key table/variable. Thanks for assistance, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Steven J. Osterlind, Ph.D., Professor Measurement & Statistics University of Missouri (573) 268-7514 - cell [hidden email] ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Hello all... Is anyone aware of a process, or a data file that can be use to
normalize first names? My goal is to be able to de-duplicate a data file that was put together
from several sources of data by converting all instances or Bill, Billy, Willy,
William, to William and all instances of Rob, Bob, Bobby, Robby, Robbie,
Robert to Robert. I envision using “IF” “THEN” syntax structures pointing to a data file
with two variables, first the specific instance of the first name and second the
normalized (standardized) format of that name. However, I need to find the data file with common variations and a
normalized version of first names and I haven’t been able to find one to assist
the automation of this process.. Thanks. Kevan ---------------------------- Kevan Edwards Ph.D. Research Scientist III 651-201-3551 |
|
Administrator
|
I'd start by converting everything to uppercase (see the UPCASE function). Are William and Robert (and their variations) the only names you need to worry about? Or were they just given as examples to illustrate the problem? If they are the only ones, or even if the number of names is relatively small, I'd probably just use RECODE, like this: RECODE name ("BILL" "BILLY" "WILLY" "WILLIAM" = "WILLIAM") ("ROB" "BOB" "BOBBY" "ROBBY" "ROBBIE" "ROBERT" = "ROBERT) . EXE. Another option you could explore is creating a separate file to serve as a look-up table. E.g., Name Name2 "BILL" "WILLIAM" "BILLY" "WILLIAM" "WILLY" "WILLIAM" "WILLIAM" "WILLIAM" "ROB" "ROBERT" "BOB" "ROBERT" "BOBBY" "ROBERT" "ROBBY" "ROBERT" "ROBBIE" "ROBERT" "ROBERT" "ROBERT" Use MATCH FILES to bring variable NAME2 into the original file (matching on variable NAME), and where NAME2 does not equal NAME (in the original file), replace NAME with NAME2. I.e., match files file = 'original' / table = 'lookuptable' / by NAME . exe. if (name NE name2) name = name2. This second method would probably be neater if there are a lot of names to worry about. And note that both data sets would have to be sorted on NAME before doing the match. HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
|
Hi, You can do that using a part of the Febrl package: http://datamining.anu.edu.au/software/febrl/febrldoc/node8.html I used Febrl before, but not for name standardization, only for its main purpose, probabilistic data linkage. But it's possible to use only pyStandard.py for the standardization and not use other program features. Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From: Bruce Weaver <[hidden email]> To: [hidden email] Sent: Thu, September 2, 2010 9:59:49 PM Subject: Re: [SPSSX-L] Name Normalization Kevan Edwards (MDH) wrote: > > Hello all... > > Is anyone aware of a process, or a data file that can be use to normalize > first names? > > My goal is to be able to de-duplicate a data file that was put together > from several sources of data by converting all instances or Bill, Billy, > Willy, William, to William and all instances of Rob, Bob, Bobby, Robby, > Robbie, Robert to Robert. > > I envision using "IF" "THEN" syntax structures pointing to a data file > with two variables, first the specific instance of the first name and > second the normalized (standardized) format of that name. > > However, I need to find the data file with common variations and a > normalized version of first names and I haven't been able to find one to > assist the automation of this process.. > > Thanks. > > Kevan > > I'd start by converting everything to uppercase (see the UPCASE function). Are William and Robert (and their variations) the only names you need to worry about? Or were they just given as examples to illustrate the problem? If they are the only ones, or even if the number of names is relatively small, I'd probably just use RECODE, like this: RECODE name ("BILL" "BILLY" "WILLY" "WILLIAM" = "WILLIAM") ("ROB" "BOB" "BOBBY" "ROBBY" "ROBBIE" "ROBERT" = "ROBERT) . EXE. Another option you could explore is creating a separate file to serve as a look-up table. E.g., Name Name2 "BILL" "WILLIAM" "BILLY" "WILLIAM" "WILLY" "WILLIAM" "WILLIAM" "WILLIAM" "ROB" "ROBERT" "BOB" "ROBERT" "BOBBY" "ROBERT" "ROBBY" "ROBERT" "ROBBIE" "ROBERT" "ROBERT" "ROBERT" Use MATCH FILES to bring variable NAME2 into the original file (matching on variable NAME), and where NAME2 does not equal NAME (in the original file), replace NAME with NAME2. I.e., match files file = 'original' / table = 'lookuptable' / by NAME . exe. if (name NE name2) name = name2. This second method would probably be neater if there are a lot of names to worry about. And note that both data sets would have to be sorted on NAME before doing the match. HTH. ----- -- Bruce Weaver [hidden email] http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." NOTE: My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Merging-files-add-variables-tp2801151p2801372.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Osterlind, Steven J.
The following syntax does what I think you are asking. I don't know why you are not getting the expected result. * table lookup dataset. data list free /key value. begin data. 1 100 2 200 4 400 end data. dataset name lookup. * main dataset with duplicate key. data list free/ key data. begin data. 1 123 1 456 2 789 3 012 4 1234 end data. dataset name main. DATASET ACTIVATE main. MATCH FILES /FILE=* /TABLE='lookup' /BY key. list. * output: key data value 1.00 123.00 100.00 1.00 456.00 100.00 2.00 789.00 200.00 3.00 12.00 . 4.00 1234.00 400.00 Regards, Jon Peck Jon Peck SPSS, an IBM Company [hidden email] 312-651-3435
A perplexing problem: Two data sets: Set 1 & Set 2. Set 1 is records for individual, and many individuals have two or more records. Set 2 is addresses, one for each individual. So, I want to merge the addresses of Set 2 with the into Set 1 (Mergeàadd variables). SSN appears in both data sets and is my keyed variable. However, when I merge the data sets, the address of Set 1 appears only for the first occurrence of a given SSN. I want to addresses to appear at every occurrence of a given SSN. I’ve tried all combinations I can think of with Key table/variable. Thanks for assistance, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Steven J. Osterlind, Ph.D., Professor Measurement & Statistics University of Missouri (573) 268-7514 - cell [hidden email] |
|
In reply to this post by Osterlind, Steven J.
Steven-
I run into a similar situation for a database I work with. I often have to merge genetic data onto every research visit for a given person in our main database. The genetic information is static, so I want it to appear on every one of (often) multiple visits. In my experience, the syntax people have offered does not work if say, in your case, the address variable already exists in the dataset you are seeking to add info onto. So, for example, as we run genetics tests and I have to update this field on our main database, I cannot use that syntax since the genetics fields already exist, and in some cases already have data. Also, I do not know how to make the "update" command work for a one to many situation... the only workaround I have found is to keep a cumulative file of all genetics results, and every time I have to update, i strip off all the genetics varaibles on the main dataset, and then run the merge with the syntax people have offered here. I am not an adept SPSS user, having dome my graduate work on SAS, so I would love any thoughts experienced SPSS users have on this-- and I am not sure this is your problem, but wondering if it is... Jennifer "Osterlind, Steven J." <[hidden email]> To: [hidden email] Date: 09/02/2010 12:17 PM Subject: [SPSSX-L] Merging files, add variables Sent by: "SPSSX(r) Discussion" <[hidden email]> A perplexing problem: Two data sets: Set 1 & Set 2. Set 1 is records for individual, and many individuals have two or more records. Set 2 is addresses, one for each individual. So, I want to merge the addresses of Set 2 with the into Set 1 (Mergeàadd variables). SSN appears in both data sets and is my keyed variable. However, when I merge the data sets, the address of Set 1 appears only for the first occurrence of a given SSN. I want to addresses to appear at every occurrence of a given SSN. I?ve tried all combinations I can think of with Key table/variable. Thanks for assistance, --- end of quote --- Jennifer Stone Randolph, M.Sc. Brain Imaging Laboratory Department of Psychiatry Confidentiality Notice: This e-mail transmission is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution violates confidentiality and privacy laws and is prohibited. If you are not the intended recipient, please contact the sender immediately and destroy all copies of the message. Thank you for your cooperation. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Kevan Edwards (MDH)
Hi Kevan If your file has <10,000 records you
can use the free version of LinkageWiz to do the de-duplication for you. It has a table built in which has all the
variations on each name (it refers to these as “nicknames”) which
it uses as part of its fuzzy matching routine (it also uses a NYSIIS or a
Soundex match, and optionally will use a string similarity measure as well).
Exact matches, nickname matches and phonetic matches all get different weights
in computing a match score. You can find out more at www.linkagewiz.com If you have more than 10,000 cases, I’d
recommend looking at FEBRL, which is free. You can find out more about FEBRL
here: http://datamining.anu.edu.au/software/febrl/febrldoc/ What you are trying to do can in principle
be done in SPSS (sort of), but it would be very hard to do it well, and would
probably take more time than you have. Adrian
Barnett Project Officer Educational
Measurement and Analysis Data and
Educational Measurement DECS ph 82261080 From:
SPSSX(r) Discussion [mailto: Hello all... Is anyone aware of a process, or a data file that can
be use to normalize first names? My goal is to be able to de-duplicate a data file that
was put together from several sources of data by converting all instances or
Bill, Billy, Willy, William, to William and all instances of Rob, Bob,
Bobby, Robby, Robbie, Robert to Robert. I envision using “IF” “THEN”
syntax structures pointing to a data file with two variables, first the
specific instance of the first name and second the normalized (standardized)
format of that name. However, I need to find the data file with common
variations and a normalized version of first names and I haven’t been
able to find one to assist the automation of this process.. Thanks. Kevan ---------------------------- Kevan Edwards Ph.D. Research Scientist III 651-201-3551 |
|
How about address normalization software? From: SPSSX(r)
Discussion [mailto:[hidden email]] On
Behalf Of Barnett, Adrian (DECS) Hi Kevan If your file has
<10,000 records you can use the free version of LinkageWiz to do the
de-duplication for you. It has a table built in
which has all the variations on each name (it refers to these as “nicknames”)
which it uses as part of its fuzzy matching routine (it also uses a NYSIIS or a
Soundex match, and optionally will use a string similarity measure as well).
Exact matches, nickname matches and phonetic matches all get different weights
in computing a match score. You can find out more at www.linkagewiz.com If you have more than
10,000 cases, I’d recommend looking at FEBRL, which is free. You can find out
more about FEBRL here: http://datamining.anu.edu.au/software/febrl/febrldoc/ What you are trying to do
can in principle be done in SPSS (sort of), but it would be very hard to do it
well, and would probably take more time than you have. Adrian Barnett Project
Officer Educational
Measurement and Analysis Data
and Educational Measurement DECS ph
82261080 From: SPSSX(r)
Discussion [mailto: Hello all... Is anyone aware of a process, or a data file that can be use to
normalize first names? My goal is to be able to de-duplicate a data file that was put together
from several sources of data by converting all instances or Bill, Billy, Willy,
William, to William and all instances of Rob, Bob, Bobby, Robby, Robbie,
Robert to Robert. I envision using “IF” “THEN” syntax structures pointing to a data file
with two variables, first the specific instance of the first name and second
the normalized (standardized) format of that name. However, I need to find the data file with common variations and a normalized
version of first names and I haven’t been able to find one to assist the
automation of this process.. Thanks. Kevan ---------------------------- Kevan Edwards Ph.D. Research Scientist III 651-201-3551 |
|
My company uses several Address Standardization Software packages.
I have experience with Melissa Data’s “Mailers+4”
software, which I am quite happy with. It standardizes the addresses to postal
specifications (at least for the You can learn more about them at: http://www.melissadata.com/ Hope that helps, -Heidi From: SPSSX(r)
Discussion [mailto:[hidden email]] On
Behalf Of Raffe, Sydelle, SSA How about address normalization software? From: SPSSX(r)
Discussion [mailto:[hidden email]] On
Behalf Of Barnett, Adrian (DECS) Hi Kevan If your file has
<10,000 records you can use the free version of LinkageWiz to do the
de-duplication for you. It has a table built in
which has all the variations on each name (it refers to these as
“nicknames”) which it uses as part of its fuzzy matching routine
(it also uses a NYSIIS or a Soundex match, and optionally will use a string
similarity measure as well). Exact matches, nickname matches and phonetic
matches all get different weights in computing a match score. You can find out more at www.linkagewiz.com If you have more than
10,000 cases, I’d recommend looking at FEBRL, which is free. You can find
out more about FEBRL here: http://datamining.anu.edu.au/software/febrl/febrldoc/ What you are trying to do
can in principle be done in SPSS (sort of), but it would be very hard to do it
well, and would probably take more time than you have. Adrian Barnett Project
Officer Educational
Measurement and Analysis Data
and Educational Measurement DECS ph
82261080 From: SPSSX(r)
Discussion [mailto: Hello all... Is anyone aware of a process, or a data file that can be use to
normalize first names? My goal is to be able to de-duplicate a data file that was put together
from several sources of data by converting all instances or Bill, Billy, Willy,
William, to William and all instances of Rob, Bob, Bobby, Robby, Robbie,
Robert to Robert. I envision using “IF” “THEN” syntax structures
pointing to a data file with two variables, first the specific instance of the
first name and second the normalized (standardized) format of that name. However, I need to find the data file with common variations and a
normalized version of first names and I haven’t been able to find one to
assist the automation of this process.. Thanks. Kevan ---------------------------- Kevan Edwards Ph.D. Research Scientist III 651-201-3551 ____________ DefenderMX2. |
|
In reply to this post by Raffe, Sydelle, SSA
Hi Both the packages I mentioned do address standardization,
and, if you have a geocoded address reference file, will also geocode addresses
against that file. Adrian
Barnett Project Officer Educational
Measurement and Analysis Data and
Educational Measurement DECS ph 82261080 From:
How about address
normalization software? From:
SPSSX(r) Discussion [mailto: Hi Kevan If your file has <10,000 records you
can use the free version of LinkageWiz to do the de-duplication for you. It has a table built in which has all the
variations on each name (it refers to these as “nicknames”) which it uses as
part of its fuzzy matching routine (it also uses a NYSIIS or a Soundex match,
and optionally will use a string similarity measure as well). Exact matches,
nickname matches and phonetic matches all get different weights in computing a
match score. You can find out more at www.linkagewiz.com If you have more than 10,000 cases, I’d
recommend looking at FEBRL, which is free. You can find out more about FEBRL
here: http://datamining.anu.edu.au/software/febrl/febrldoc/ What you are trying to do can in principle
be done in SPSS (sort of), but it would be very hard to do it well, and would
probably take more time than you have. Adrian
Barnett Project Officer Educational
Measurement and Analysis Data and
Educational Measurement DECS ph 82261080 From:
SPSSX(r) Discussion [mailto: Hello all... Is anyone aware of a process, or a data file that can
be use to normalize first names? My goal is to be able to de-duplicate a data file that
was put together from several sources of data by converting all instances or Bill,
Billy, Willy, William, to William and all instances of Rob, Bob, Bobby,
Robby, Robbie, Robert to Robert. I envision using “IF” “THEN” syntax structures
pointing to a data file with two variables, first the specific instance of the
first name and second the normalized (standardized) format of that name. However, I need to find the data file with common
variations and a normalized version of first names and I haven’t been able to
find one to assist the automation of this process.. Thanks. Kevan ---------------------------- Kevan Edwards Ph.D. Research Scientist III 651-201-3551 |
| Free forum by Nabble | Edit this page |
