SPSSX Discussion

Name Parsing Problem

Classic

List

Threaded

4 messages Options

Brian Moore-3

Name Parsing Problem

new file.
data list free /id(f4) name_line(a35).
begin data
1234 'Sally S Smith, CPA'
1233 'Sally Smith'
1232 'John Q Public'
1231 'John Smith CPA'
1230 'John W Jones, CPA, CFP '
end data.
dataset name A.

Goal is to:
1 - Parse the name line into V1 to Vx in order presented (using any
combination of space or comma as a single delimiter)
2- Put reasonable tests in place to identify likely last name field. For
example for each Vx to
a- count number of characters in each
b- return a variable indicating case (all upcase, all lower case , mix)
c- include an indicator for if space or comma was used to create the
following delimited field

Thanks for any suggestions.

Regards,
Brian

Brian Moore
Market Research Manager
WorldatWork
14040 N. Northsight Blvd.
Scottsdale, AZ 85260
Direct Line: (480-348-7232)
E-mail: ([hidden email])

WorldatWork(r)
The Total Rewards Association(tm)

UPCOMING CONFERENCES:
Work-Life Conference & Exhibition <http://www.worldatwork.org/worklife2007
<http://www.worldatwork.org/worklife2007> >
Presented by Alliance for Work-Life Progress and WorldatWork
Feb. 21-23, 2007 - Phoenix, AZ
WorldatWork Total Rewards Conference & Exhibition
<http://www.worldatwork.org/orlando2007
<http://www.worldatwork.org/orlando2007> >
May 6-9, 2007 - Orlando, FL

Mike P-5

Re: Name Parsing Problem

If you have the programmability installed, id take a look at using the
NYSIS command, as it looks specifically at Surname identification,

I can't remember it what module it is in (I'm sure Jon Peck does though,
after all he wrote the module ;-)

Mike

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Brian Moore
Sent: 09 February 2007 16:36
To: [hidden email]
Subject: Name Parsing Problem

new file.
data list free /id(f4) name_line(a35).
begin data
1234 'Sally S Smith, CPA'
1233 'Sally Smith'
1232 'John Q Public'
1231 'John Smith CPA'
1230 'John W Jones, CPA, CFP '
end data.
dataset name A.

Goal is to:
1 - Parse the name line into V1 to Vx in order presented (using any
combination of space or comma as a single delimiter)
2- Put reasonable tests in place to identify likely last name field. For
example for each Vx to
a- count number of characters in each
b- return a variable indicating case (all upcase, all lower case , mix)
c- include an indicator for if space or comma was used to create the
following delimited field

Thanks for any suggestions.

Regards,
Brian

Brian Moore
Market Research Manager
WorldatWork
14040 N. Northsight Blvd.
Scottsdale, AZ 85260
Direct Line: (480-348-7232)
E-mail: ([hidden email])

WorldatWork(r)
The Total Rewards Association(tm)

UPCOMING CONFERENCES:
Work-Life Conference & Exhibition
<http://www.worldatwork.org/worklife2007
<http://www.worldatwork.org/worklife2007> >
Presented by Alliance for Work-Life Progress and WorldatWork
Feb. 21-23, 2007 - Phoenix, AZ
WorldatWork Total Rewards Conference & Exhibition
<http://www.worldatwork.org/orlando2007
<http://www.worldatwork.org/orlando2007> >
May 6-9, 2007 - Orlando, FL

________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________

Peck, Jon

Re: Name Parsing Problem

NYSIIS or Soundex will give you versions of a name coding that removes common spelling variations etc. Functions for these can be found in the extendedTransforms module on SPSS Developer Central (www.spss.com/devcentral) for use with programmability and SPSS 15, but for this problem, the best tool short of special-purpose software is regular expressions. There is a regular expression search and replace function to be found also in that module, but here is a little function tuned for this particular purpose.

Here is the function definition followed by explanatory comments. (since text sometimes gets inopportune wrapping, I can email this as an attachment to anyone who wants it.)

--- start of function

import re

def guessSurname(name):
"""apply heuristic extraction methods to guess last name from a name field.

name is the string to test."""

step1 = re.match("[^,0-9]*", name).group()
step2 = re.sub("\.", " ", step1)
parts = re.split(" +", step2)

for i in range(len(parts)):
guess = parts[-1-i]
if (len(guess) > 1 and not (guess == guess.upper() or guess.upper() in ["JR", "III"]):
return guess
else:
return ""

--- end of function

This function accepts a string and attempts to extract the surname.
First, it uses a regular expression to remove everything following a comma.
Next, it converts periods to blanks.
Then it splits the blank-separated parts into a list.
Finally it goes through the list backwards from the end and returns the first part it finds that is not in all uppercase, is at least two characters long, and does not match JR or III. If nothing qualifies, the result is empty.

It gives the right answer for all of the examples below, but it could certainly be made more sophisticated by adding other tests

To use this, assuming you have SPSS 15 with programmability, and the spssaux, spssdata, and trans modules downloaded from Developer Central and this function saved in surname.py, you could do this, where "name" is the variable you are extracting from.

BEGIN PROGRAM.
import spss, trans, surname
t = trans.Tfunction()
t.append(surname.guessSurname, "surname", "A20", ["name"])
t.execute()
END PROGRAM.

This creates a new string variable named surname containing the guess based on the string in name, looping over all the cases in the dataset.

-Jon Peck
SPSS

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Michael Pearmain
Sent: Friday, February 09, 2007 10:45 AM
To: [hidden email]
Subject: Re: [SPSSX-L] Name Parsing Problem

If you have the programmability installed, id take a look at using the
NYSIS command, as it looks specifically at Surname identification,

I can't remember it what module it is in (I'm sure Jon Peck does though,
after all he wrote the module ;-)

Mike

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Brian Moore
Sent: 09 February 2007 16:36
To: [hidden email]
Subject: Name Parsing Problem

new file.
data list free /id(f4) name_line(a35).
begin data
1234 'Sally S Smith, CPA'
1233 'Sally Smith'
1232 'John Q Public'
1231 'John Smith CPA'
1230 'John W Jones, CPA, CFP '
end data.
dataset name A.

Goal is to:
1 - Parse the name line into V1 to Vx in order presented (using any
combination of space or comma as a single delimiter)
2- Put reasonable tests in place to identify likely last name field. For
example for each Vx to
a- count number of characters in each
b- return a variable indicating case (all upcase, all lower case , mix)
c- include an indicator for if space or comma was used to create the
following delimited field

Thanks for any suggestions.

Regards,
Brian

Brian Moore
Market Research Manager
WorldatWork
14040 N. Northsight Blvd.
Scottsdale, AZ 85260
Direct Line: (480-348-7232)
E-mail: ([hidden email])

WorldatWork(r)
The Total Rewards Association(tm)

UPCOMING CONFERENCES:
Work-Life Conference & Exhibition
<http://www.worldatwork.org/worklife2007
<http://www.worldatwork.org/worklife2007> >
Presented by Alliance for Work-Life Progress and WorldatWork
Feb. 21-23, 2007 - Phoenix, AZ
WorldatWork Total Rewards Conference & Exhibition
<http://www.worldatwork.org/orlando2007
<http://www.worldatwork.org/orlando2007> >
May 6-9, 2007 - Orlando, FL

________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________

Richard Ristow

Re: Name Parsing Problem

In reply to this post by Brian Moore-3

Granting the importance of Python facilities, a lot of your problems
are with your algorithmic - how do you DEFINE your answer - rather than
with tools. Here are some thoughts, in native SPSS. Test data is from
your post; I'm not repeating it.

At 11:35 AM 2/9/2007, Brian Moore wrote:

>1 - Parse the name line into V1 to Vx in order presented (using any
>combination of space or comma as a single delimiter)

SPSS 15 draft output:

STRING V01 TO V06 (A10).

STRING #WkgName (A35) /* Text being parsed, from which */
/* text is removed as it is parsed. * .

NUMERIC #Where (F03) /* Index of some place in the string */ .
STRING #Word (A10) /* One parsed-out word */ .

* The following takes any number of spaces plus up to one comma .
* as a delimiter. .

COMPUTE #WkgName = name_line.

VECTOR WORDS = V01 TO V06.
LOOP #WORDNUM = 1 TO 6.
. COMPUTE #Where = INDEX (#WkgName, ', ', 1).
. COMPUTE #Word = SUBSTR(#WkgName,1,#Where-1).
. COMPUTE WORDS(#WORDNUM) = #Word.
. COMPUTE #WkgName = LTRIM(SUBSTR(#WkgName,#Where+1)).
END LOOP IF #WkgName EQ ' '.

TEMPORARY.
STRING SP_1 (A20).
LIST ID NAME_LINE SP_1
V01 TO V06.

List
|-----------------------------|---------------------------|
|Output Created |10-FEB-2007 11:41:17 |
|-----------------------------|---------------------------|
[Parsed1]

The variables are listed in the following order:

LINE 1: id name_line SP_1
LINE 2: V01 V02 V03 V04 V05 V06

id: 1234 Sally S Smith, CPA
V01: Sally S Smith CPA

id: 1233 Sally Smith
V01: Sally Smith

id: 1232 John Q Public
V01: John Q Public

id: 1231 John Smith CPA
V01: John Smith CPA

id: 1230 John W Jones, CPA, CFP
V01: John W Jones CPA CFP

id: 1299 Alfred E. Neuman CPA, CFP
V01: Alfred E. Neuman CPA CFP

Number of cases read: 6 Number of cases listed: 6

>2- Put reasonable tests in place to identify likely last name field.
>For example for each Vx to
>a- count number of characters in each
>b- return a variable indicating case (all upcase, all lower case, mix)
>
>c- include an indicator for if space or comma was used to create the
>following delimited field

Here we get into algorithms. Jon Peck suggests identifying last name as
the last word in the string which is
(a) Not all uppercase
(b) Not 'Jr' or 'III', case not important.

I've implemented this in native SPSS, too. It does rather illustrate
Jon's point about the advantages of Python. But I'll post it, if you
don't have SPSS 15.

Other useful heuristics could be
d- Anything after a comma is NOT part of the last name
e- Recognized titles (CPA, CFP) are not part of the last name
It's not much harder in native SPSS than in Python. Python's main
advantage is that it can parse into words very easily:
step1 = re.match("[^,0-9]*", name).group()
step2 = re.sub("\.", " ", step1)
parts = re.split(" +", step2)

-Good luck,
Richard