new file.
data list free /id(f4) name_line(a35). begin data 1234 'Sally S Smith, CPA' 1233 'Sally Smith' 1232 'John Q Public' 1231 'John Smith CPA' 1230 'John W Jones, CPA, CFP ' end data. dataset name A. Goal is to: 1 - Parse the name line into V1 to Vx in order presented (using any combination of space or comma as a single delimiter) 2- Put reasonable tests in place to identify likely last name field. For example for each Vx to a- count number of characters in each b- return a variable indicating case (all upcase, all lower case , mix) c- include an indicator for if space or comma was used to create the following delimited field Thanks for any suggestions. Regards, Brian Brian Moore Market Research Manager WorldatWork 14040 N. Northsight Blvd. Scottsdale, AZ 85260 Direct Line: (480-348-7232) E-mail: ([hidden email]) WorldatWork(r) The Total Rewards Association(tm) UPCOMING CONFERENCES: Work-Life Conference & Exhibition <http://www.worldatwork.org/worklife2007 <http://www.worldatwork.org/worklife2007> > Presented by Alliance for Work-Life Progress and WorldatWork Feb. 21-23, 2007 - Phoenix, AZ WorldatWork Total Rewards Conference & Exhibition <http://www.worldatwork.org/orlando2007 <http://www.worldatwork.org/orlando2007> > May 6-9, 2007 - Orlando, FL |
If you have the programmability installed, id take a look at using the
NYSIS command, as it looks specifically at Surname identification, I can't remember it what module it is in (I'm sure Jon Peck does though, after all he wrote the module ;-) Mike -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Brian Moore Sent: 09 February 2007 16:36 To: [hidden email] Subject: Name Parsing Problem new file. data list free /id(f4) name_line(a35). begin data 1234 'Sally S Smith, CPA' 1233 'Sally Smith' 1232 'John Q Public' 1231 'John Smith CPA' 1230 'John W Jones, CPA, CFP ' end data. dataset name A. Goal is to: 1 - Parse the name line into V1 to Vx in order presented (using any combination of space or comma as a single delimiter) 2- Put reasonable tests in place to identify likely last name field. For example for each Vx to a- count number of characters in each b- return a variable indicating case (all upcase, all lower case , mix) c- include an indicator for if space or comma was used to create the following delimited field Thanks for any suggestions. Regards, Brian Brian Moore Market Research Manager WorldatWork 14040 N. Northsight Blvd. Scottsdale, AZ 85260 Direct Line: (480-348-7232) E-mail: ([hidden email]) WorldatWork(r) The Total Rewards Association(tm) UPCOMING CONFERENCES: Work-Life Conference & Exhibition <http://www.worldatwork.org/worklife2007 <http://www.worldatwork.org/worklife2007> > Presented by Alliance for Work-Life Progress and WorldatWork Feb. 21-23, 2007 - Phoenix, AZ WorldatWork Total Rewards Conference & Exhibition <http://www.worldatwork.org/orlando2007 <http://www.worldatwork.org/orlando2007> > May 6-9, 2007 - Orlando, FL ________________________________________________________________________ This e-mail has been scanned for all viruses by Star. The service is powered by MessageLabs. For more information on a proactive anti-virus service working around the clock, around the globe, visit: http://www.star.net.uk ________________________________________________________________________ ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________ |
NYSIIS or Soundex will give you versions of a name coding that removes common spelling variations etc. Functions for these can be found in the extendedTransforms module on SPSS Developer Central (www.spss.com/devcentral) for use with programmability and SPSS 15, but for this problem, the best tool short of special-purpose software is regular expressions. There is a regular expression search and replace function to be found also in that module, but here is a little function tuned for this particular purpose.
Here is the function definition followed by explanatory comments. (since text sometimes gets inopportune wrapping, I can email this as an attachment to anyone who wants it.) --- start of function import re def guessSurname(name): """apply heuristic extraction methods to guess last name from a name field. name is the string to test.""" step1 = re.match("[^,0-9]*", name).group() step2 = re.sub("\.", " ", step1) parts = re.split(" +", step2) for i in range(len(parts)): guess = parts[-1-i] if (len(guess) > 1 and not (guess == guess.upper() or guess.upper() in ["JR", "III"]): return guess else: return "" --- end of function This function accepts a string and attempts to extract the surname. First, it uses a regular expression to remove everything following a comma. Next, it converts periods to blanks. Then it splits the blank-separated parts into a list. Finally it goes through the list backwards from the end and returns the first part it finds that is not in all uppercase, is at least two characters long, and does not match JR or III. If nothing qualifies, the result is empty. It gives the right answer for all of the examples below, but it could certainly be made more sophisticated by adding other tests To use this, assuming you have SPSS 15 with programmability, and the spssaux, spssdata, and trans modules downloaded from Developer Central and this function saved in surname.py, you could do this, where "name" is the variable you are extracting from. BEGIN PROGRAM. import spss, trans, surname t = trans.Tfunction() t.append(surname.guessSurname, "surname", "A20", ["name"]) t.execute() END PROGRAM. This creates a new string variable named surname containing the guess based on the string in name, looping over all the cases in the dataset. -Jon Peck SPSS -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Michael Pearmain Sent: Friday, February 09, 2007 10:45 AM To: [hidden email] Subject: Re: [SPSSX-L] Name Parsing Problem If you have the programmability installed, id take a look at using the NYSIS command, as it looks specifically at Surname identification, I can't remember it what module it is in (I'm sure Jon Peck does though, after all he wrote the module ;-) Mike -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Brian Moore Sent: 09 February 2007 16:36 To: [hidden email] Subject: Name Parsing Problem new file. data list free /id(f4) name_line(a35). begin data 1234 'Sally S Smith, CPA' 1233 'Sally Smith' 1232 'John Q Public' 1231 'John Smith CPA' 1230 'John W Jones, CPA, CFP ' end data. dataset name A. Goal is to: 1 - Parse the name line into V1 to Vx in order presented (using any combination of space or comma as a single delimiter) 2- Put reasonable tests in place to identify likely last name field. For example for each Vx to a- count number of characters in each b- return a variable indicating case (all upcase, all lower case , mix) c- include an indicator for if space or comma was used to create the following delimited field Thanks for any suggestions. Regards, Brian Brian Moore Market Research Manager WorldatWork 14040 N. Northsight Blvd. Scottsdale, AZ 85260 Direct Line: (480-348-7232) E-mail: ([hidden email]) WorldatWork(r) The Total Rewards Association(tm) UPCOMING CONFERENCES: Work-Life Conference & Exhibition <http://www.worldatwork.org/worklife2007 <http://www.worldatwork.org/worklife2007> > Presented by Alliance for Work-Life Progress and WorldatWork Feb. 21-23, 2007 - Phoenix, AZ WorldatWork Total Rewards Conference & Exhibition <http://www.worldatwork.org/orlando2007 <http://www.worldatwork.org/orlando2007> > May 6-9, 2007 - Orlando, FL ________________________________________________________________________ This e-mail has been scanned for all viruses by Star. The service is powered by MessageLabs. For more information on a proactive anti-virus service working around the clock, around the globe, visit: http://www.star.net.uk ________________________________________________________________________ ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________ |
In reply to this post by Brian Moore-3
Granting the importance of Python facilities, a lot of your problems
are with your algorithmic - how do you DEFINE your answer - rather than with tools. Here are some thoughts, in native SPSS. Test data is from your post; I'm not repeating it. At 11:35 AM 2/9/2007, Brian Moore wrote: >1 - Parse the name line into V1 to Vx in order presented (using any >combination of space or comma as a single delimiter) SPSS 15 draft output: STRING V01 TO V06 (A10). STRING #WkgName (A35) /* Text being parsed, from which */ /* text is removed as it is parsed. * . NUMERIC #Where (F03) /* Index of some place in the string */ . STRING #Word (A10) /* One parsed-out word */ . * The following takes any number of spaces plus up to one comma . * as a delimiter. . COMPUTE #WkgName = name_line. VECTOR WORDS = V01 TO V06. LOOP #WORDNUM = 1 TO 6. . COMPUTE #Where = INDEX (#WkgName, ', ', 1). . COMPUTE #Word = SUBSTR(#WkgName,1,#Where-1). . COMPUTE WORDS(#WORDNUM) = #Word. . COMPUTE #WkgName = LTRIM(SUBSTR(#WkgName,#Where+1)). END LOOP IF #WkgName EQ ' '. TEMPORARY. STRING SP_1 (A20). LIST ID NAME_LINE SP_1 V01 TO V06. List |-----------------------------|---------------------------| |Output Created |10-FEB-2007 11:41:17 | |-----------------------------|---------------------------| [Parsed1] The variables are listed in the following order: LINE 1: id name_line SP_1 LINE 2: V01 V02 V03 V04 V05 V06 id: 1234 Sally S Smith, CPA V01: Sally S Smith CPA id: 1233 Sally Smith V01: Sally Smith id: 1232 John Q Public V01: John Q Public id: 1231 John Smith CPA V01: John Smith CPA id: 1230 John W Jones, CPA, CFP V01: John W Jones CPA CFP id: 1299 Alfred E. Neuman CPA, CFP V01: Alfred E. Neuman CPA CFP Number of cases read: 6 Number of cases listed: 6 >2- Put reasonable tests in place to identify likely last name field. >For example for each Vx to >a- count number of characters in each >b- return a variable indicating case (all upcase, all lower case, mix) > >c- include an indicator for if space or comma was used to create the >following delimited field Here we get into algorithms. Jon Peck suggests identifying last name as the last word in the string which is (a) Not all uppercase (b) Not 'Jr' or 'III', case not important. I've implemented this in native SPSS, too. It does rather illustrate Jon's point about the advantages of Python. But I'll post it, if you don't have SPSS 15. Other useful heuristics could be d- Anything after a comma is NOT part of the last name e- Recognized titles (CPA, CFP) are not part of the last name It's not much harder in native SPSS than in Python. Python's main advantage is that it can parse into words very easily: step1 = re.match("[^,0-9]*", name).group() step2 = re.sub("\.", " ", step1) parts = re.split(" +", step2) -Good luck, Richard |
Free forum by Nabble | Edit this page |