De-Identifying String Variables...

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

De-Identifying String Variables...

Jeff-125
I have a somewhat unique (at least to me) string manipulation task.

I'm working with some agency data. I'll simplify my discussion of the
structure of the data to include only the information necessary to
ask my question.

I have 3 string variables: 1) First Name, 2) Last Name, 3) Narrative.
The first two variables are self-explanatory. The last contains text
up to several thousand characters long. Some of that text contains
the first and last names that are contained within the first-name and
last-name variables. I need to "de-identify" those narratives to
remove all actual names before transferring the data to another
machine (the first name and last name variables will simply be
dropped - can easily do that already).

I obtained the code below from Raynald Levesque's web site, and made
only slight modifications. It replaces all "&" contained in the
variable "mystring" with ~FirstName~.  I need to expand upon that
idea by replacing not only a single character or word, but by
replacing all possible first names with ~FirstName~ and all possible
last names with ~LastName~. I can easily get a spss file that
contains all of the first names and last names that might be found
and would like to have spss check for them and replace them as
described above. I can't simply check for one or two names contained
within the same row, but rather from a long list of names contained
within another file, although if necessary, I could convert that list
of names to text and paste it into syntax.

Could somehow show me how to modify the syntax below to do something
like this? I've also provided an example of what the results should
look like below the syntax

Thanks in advance

Jeff


DO IF (INDEX(mystring,"&")>0).
    LOOP.
       COMPUTE mystring =
CONCAT(SUBSTR(mystring,1,INDEX(mystring,"&")-1),"~FirstName~",
       SUBSTR(mystring,INDEX(mystring,"&")+1,LENGTH(mystring)-INDEX(mystring,"&"))).
    END LOOP IF (INDEX(mystring,"&")=0).
END IF.
EXECUTE.


OriginalNarrative                              ModifiedNarrative
John Smith went to the store           ~FirstName~ ~LastName~ went to the store

but I would need to search for all names contained in another file
(rather than just a single name) - e.g.,

FirstName
John
Jim
Tim
Dave
etc.
Reply | Threaded
Open this post in threaded view
|

Re: De-Identifying String Variables...

Richard Ristow
At 05:32 PM 8/27/2007, Jeff wrote:

I have 3 string variables: 1) First Name, 2) Last Name, 3) Narrative.
The [narrative] contains text up to several thousand characters long.
Some of that text contains the first and last names that are contained
within the first-name and last-name variables. I need to "de-identify"
those narratives to remove all actual names.

I obtained the code below from Raynald Levesque's web site, and made
only slight modifications. It replaces all "&" contained in the
variable "mystring" with ~FirstName~.

First, if you have SPSS 14 or later, string function 'REPLACE' gives
muchy simpler syntax to replace a single name. As in Raynald's example,
it's "&" that will be replaced in this example, but you can use any
string:

COMPUTE mystring = REPLACE(mystring,"&","~FirstName~").

I need to expand replacie not only a single character or word, but all
possible first names with ~FirstName~ and all possible last names with
~LastName~. I can easily get a spss file that contains all of the first
names and last names that might be found and would like to have spss
check for them and replace them as described above.

OK, I've preached 'long' organization forever, but here's a case where
you need 'wide'. That's because this includes a many-to-many merge:
you're checking every line in your file against every name on the list.
Changing one file to 'wide' organization is a good way to do this, in
SPSS.

This syntax isn't checked. Generating test data would be more elaborate
than I'm up for just now.

Here's file NAMES, with the list of names:
|-----------------------------|---------------------------|
|Output Created               |27-AUG-2007 18:50:04       |
|-----------------------------|---------------------------|
Name         Type

George       F
Peter        F
Susan        F
Elizabeth    F
Eliza        F
Smith        L
Jones        L
Robinson     L

Number of cases read:  8    Number of cases listed:  8

'Type' is not sex, but classification, as a first name or last name. Do
what you want, with any names that occur in both roles.


If you derive the name list from your original file (as is probably
easiest), the names may come out in order in the file, like this,
rather than all the first names and then all the last name:


George       F
Smith        L
Peter        F
Jones        L
Susan        F
Robinson     L

That's no problem. You'll probably want to sort and drop duplicates,
though.

.............
Next, a step that may not be obvious: sort names in descending length
order. You have to match the longer names first, or you'll hang up on
short names that occur as parts of longer names:

'Elizabeth' -> '~FirstName~beth'


NUMERIC  NameLen (F4).
COMPUTE  NameLen = LENGTH(RTRIM(NAME)).
SORT CASES
       BY NameLen (D).
LIST.

List
|-----------------------------|---------------------------|
|Output Created               |27-AUG-2007 18:53:06       |
|-----------------------------|---------------------------|
Name         Type NameLen

Elizabeth    F         9
Robinson     L         8
George       F         6
Peter        F         5
Susan        F         5
Eliza        F         5
Smith        L         5
Jones        L         5

Number of cases read:  8    Number of cases listed:  8


Now, change to wide form:

NUMERIC NoKey  (F2).
COMPUTE NoKey = 1.

CASESTOVARS
  /ID      = NoKey
  /DROP    = NameLen
  /GROUPBY = VARIABLE .


Cases to Variables
|----------------------------|---------------------------|
|Output Created              |27-AUG-2007 19:07:41       |
|----------------------------|---------------------------|
[WideName]

Generated Variables
|---------|------|
|Original |Result|
|Variabl  |------|
|e        |Name  |
|-------|-|------|
|Name   |1|Name.1|
|       |2|Name.2|
|       |3|Name.3|
|       |4|Name.4|
|       |5|Name.5|
|       |6|Name.6|
|       |7|Name.7|
|       |8|Name.8|
|-------|-|------|
|Type   |1|Type.1|
|       |2|Type.2|
|       |3|Type.3|
|       |4|Type.4|
|       |5|Type.5|
|       |6|Type.6|
|       |7|Type.7|
|       |8|Type.8|
|-------|-|------|

Processing Statistics
|---------------|---|
|Cases In       |8  |
|Cases Out      |1  |
|---------------|---|
|Cases In/Cases |8.0|
|Out            |   |
|---------------|---|
|Variables In   |4  |
|Variables Out  |17 |
|---------------|---|
|Index Values   |8  |
|---------------|---|

LIST.

List
|-----------------------------|---------------------------|
|Output Created               |27-AUG-2007 19:07:41       |
|-----------------------------|---------------------------|
[WideName]

The variables are listed in the following order:

LINE   1: NoKey Name.1 Name.2 Name.3 Name.4
LINE   2: Name.5 Name.6 Name.7 Name.8 Type.1 Type.2 Type.3 Type.4
Type.5 Type.6
           Type.7
LINE   3: Type.8

       NoKey:  1 Elizabeth    Robinson     George       Peter
       Name.5: Susan        Eliza        Smith        Jones        F L
F F F F L
       Type.8: L


Number of cases read:  1    Number of cases listed:  1
.............


Now you can do the replacements; this is the part I haven't tested. The
VECTOR statements and the upper bound on the LOOP depend on how many
names there are; that is, they have to be dynamically-generated code,
based on the data. In the old days, you'd do this by generating the
code in SPSS, writing it to an external file, and then INCLUDing or
INSERTing it. Nowadays, more likely Python.

GET FILE=RealData.

NUMERIC NoKey  (F2).
COMPUTE NoKey = 1.

MATCH FILES
    /FILE=*
    /TABLE=WideName
    /BY NoKey.

VECTOR Name = Name.1 TO Name.8
       /Type = Type.1 TO Type.8.

LOOP #NameIdx = 1 TO 8.
.  DO IF   Type(#NameIdx) EQ 'F'.
*  .. Don't forget 'RTRIM' in this COMPUTE ....... .
.     COMPUTE Narrativ = REPLACE(Narrativ,
                                  RTRIM(Name(#NameIdx)),
                                  "~FirstName~").
.  ELSE.
*  .. Don't forget 'RTRIM' in this COMPUTE ....... .
.     COMPUTE Narrativ = REPLACE(Narrativ,
                                  RTRIM(Name(#NameIdx)),
                                  "~LastName~").
.  END IF.
END LOOP.




=============================
APPENDIX: (partial) test data
=============================
Reply | Threaded
Open this post in threaded view
|

Re: De-Identifying String Variables...

Jeff-125
At 06:20 PM 8/27/2007, you wrote:

>Now you can do the replacements; this is the part I haven't tested.
>The VECTOR statements and the upper bound on the LOOP depend on how
>many names there are; that is, they have to be dynamically-generated
>code, based on the data.


Thanks. I think that I understand the idea and will just have to
spend a few minutes with the code tomorrow to experiment and see it
work first hand. The one thing that I didn't mention is that I will
have about a half-million of these narratives and each may be up to
about 7000 characters. There may also be up to about 30,000 names,
but many will be duplicates and we will remove them. Will the number
of names cause any problem with the vector command or any other
command?  We can easily just page down to see how many names we end
up with, so determining the upper limit shouldn't be a problem.

I do have spss 14 so the replace command will work. ...would never
have thought about the sort by length by myself without running into
the problem.

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: De-Identifying String Variables...

Richard Ristow
In reply to this post by Jeff-125
Second postscript:

At 05:32 PM 8/27/2007, Jeff wrote:

>I obtained the code below from Raynald Levesque's web site, and made
>only slight modifications. It replaces all "&" contained in the
>variable "mystring" with ~FirstName.
>
>DO IF (INDEX(mystring,"&")>0).
>    LOOP.
>       COMPUTE mystring =
>            CONCAT(SUBSTR(mystring,1,INDEX(mystring,"&")-1),
>                          "~FirstName~",
>                      SUBSTR(mystring,INDEX(mystring,"&")+1,
>                             LENGTH(mystring)-INDEX(mystring,"&"))).
>    END LOOP IF (INDEX(mystring,"&")=0).
>END IF.

Be careful, if you do have to use this one. I don't think it will work
unless the string you're replacing has length 1, as "&" does. (I
haven't tested, and could be wrong; it's easy to misunderstand code
when you're just reading it.)

A rewrite to fix this wouldn't be too difficult. I'm not doing one
right now, since you probably do have the 'REPLACE' function.
Reply | Threaded
Open this post in threaded view
|

Re: De-Identifying String Variables...

Richard Ristow
In reply to this post by Jeff-125
Postscript, with a question about what you want:

At 05:32 PM 8/27/2007, Jeff wrote:

>I have 3 string variables: 1) First Name, 2) Last Name, 3) Narrative.
>The first two variables are self-explanatory. Some of [the narrative]
>text contains the first and last names [from] the first-name and
>last-name variables. I need to "de-identify" those narratives to
>remove all actual names.

You wrote that, and also

>I need to replace not only a single character or word, but all
>possible first names with ~FirstName~ and all possible last names with
>~LastName~. I can easily get a spss file that contains all of the
>first names and last names that might be found and would like to have
>spss check for them and replace them as described above.

I wrote, to solve the problem as best I could, as you desribe it.

But, do you need to remove "all possible first names and all possible
last names" from the narrative, or only those of the individual - i.e.,
the names in the individual's First Name and Last Name variables?

Your first paragraph suggests that you only need to do the latter. If
that's the case, you don't need a separate file of names, with the
extra logic to handle it. You can just replace the first name and last
name from the current record, in each narrative.
Reply | Threaded
Open this post in threaded view
|

Re: De-Identifying String Variables...

Richard Ristow
In reply to this post by Jeff-125
At 09:15 PM 8/27/2007, Jeff wrote:

>I think that I understand the idea and will just have to spend a few
>minutes to experiment and see it work first hand.

>The one thing that I didn't mention is that I will have about a
>half-million of these narratives and each may be up to about 7000
>characters. There may also be up to about 30,000 names, but many will
>be duplicates and we will remove them. Will the number of names cause
>any problem with the vector command or any other command?

Shouldn't be; I don't see why a vector can't have length 30,000.

HOWEVER, processing may be #slow#. If you have 30,000 names, and you
scan every one of your half-million narratives for every one of 30,000
names (I hope you'll actually have many fewer) -

(0.5E6)*(30E3) = 15*1E9

15 billion of something will take quite a while, even if 'something'
doesn't take very long.

Good luck!
Richard
Reply | Threaded
Open this post in threaded view
|

Re: De-Identifying String Variables...

Peck, Jon
In reply to this post by Richard Ristow
This is a perfect application for a regular expression replace, which can also be flexible regarding case and can easily handle word boundaries.  This is available if you can use Python programmability and have at least SPSS 15.  It deals with repeated occurrences, differences in case, and names with possessives.  It also handles names like O'Neill or van Gogh.  The SPSS variable names in the code below must exactly match the case of the names in SPSS.

This code requires some downloadable supplementary modules from SPSS Developer Central (www.spss.com/devcentral): spssaux, spssdata, and namedTuple.

Here is the code.  Annotations below.  If anyone needs it emailed, since the mail tends to mess up line breaks and indenting, send me a message ([hidden email]).

begin program.
import spss, spssaux, spssdata, re
vard = spssaux.VariableDict()
curs = spssdata.Spssdata(indexes='firstname lastname narrative', accessType='w')
curs.append(spssdata.vdef("anonnarrative",vtype=vard['narrative'].VariableType + 100))
curs.commitdict()
wbound = r"\b"
for case in curs:
    fnregex = re.compile(wbound + case.firstname.strip() + wbound, flags=re.IGNORECASE)
    lnregex = re.compile(wbound + case.lastname.strip() + wbound, flags=re.IGNORECASE)
    newnarr = fnregex.sub("-firstname-", case.narrative)
    newnarr = lnregex.sub("-lastname-", newnarr)
    curs.casevalues([newnarr])
curs.CClose()
end program.

First the code gets an SPSS variable dictionary.  Then it requests three variables and defines one new variable, newnarr, which will hold the modified string.  newnarr is made the size of the narrative variable plus some extra room in case the names are shorter than the replacement strings.

It loops over the cases and constructs a regular expression for firstname and lastname.  The regular expression form is
"\bjoe\b"
which matches "joe" as a word.

It replaces all occurrences of the firstname and lastname strings.

Regards,
Jon Peck



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow
Sent: Monday, August 27, 2007 6:20 PM
To: [hidden email]
Subject: Re: [SPSSX-L] De-Identifying String Variables...

At 05:32 PM 8/27/2007, Jeff wrote:

I have 3 string variables: 1) First Name, 2) Last Name, 3) Narrative.
The [narrative] contains text up to several thousand characters long.
Some of that text contains the first and last names that are contained
within the first-name and last-name variables. I need to "de-identify"
those narratives to remove all actual names.

I obtained the code below from Raynald Levesque's web site, and made
only slight modifications. It replaces all "&" contained in the
variable "mystring" with ~FirstName~.

First, if you have SPSS 14 or later, string function 'REPLACE' gives
muchy simpler syntax to replace a single name. As in Raynald's example,
it's "&" that will be replaced in this example, but you can use any
string:

COMPUTE mystring = REPLACE(mystring,"&","~FirstName~").

I need to expand replacie not only a single character or word, but all
possible first names with ~FirstName~ and all possible last names with
~LastName~. I can easily get a spss file that contains all of the first
names and last names that might be found and would like to have spss
check for them and replace them as described above.

OK, I've preached 'long' organization forever, but here's a case where
you need 'wide'. That's because this includes a many-to-many merge:
you're checking every line in your file against every name on the list.
Changing one file to 'wide' organization is a good way to do this, in
SPSS.

This syntax isn't checked. Generating test data would be more elaborate
than I'm up for just now.

Here's file NAMES, with the list of names:
|-----------------------------|---------------------------|
|Output Created               |27-AUG-2007 18:50:04       |
|-----------------------------|---------------------------|
Name         Type

George       F
Peter        F
Susan        F
Elizabeth    F
Eliza        F
Smith        L
Jones        L
Robinson     L

Number of cases read:  8    Number of cases listed:  8

'Type' is not sex, but classification, as a first name or last name. Do
what you want, with any names that occur in both roles.


If you derive the name list from your original file (as is probably
easiest), the names may come out in order in the file, like this,
rather than all the first names and then all the last name:


George       F
Smith        L
Peter        F
Jones        L
Susan        F
Robinson     L

That's no problem. You'll probably want to sort and drop duplicates,
though.

.............
Next, a step that may not be obvious: sort names in descending length
order. You have to match the longer names first, or you'll hang up on
short names that occur as parts of longer names:

'Elizabeth' -> '~FirstName~beth'


NUMERIC  NameLen (F4).
COMPUTE  NameLen = LENGTH(RTRIM(NAME)).
SORT CASES
       BY NameLen (D).
LIST.

List
|-----------------------------|---------------------------|
|Output Created               |27-AUG-2007 18:53:06       |
|-----------------------------|---------------------------|
Name         Type NameLen

Elizabeth    F         9
Robinson     L         8
George       F         6
Peter        F         5
Susan        F         5
Eliza        F         5
Smith        L         5
Jones        L         5

Number of cases read:  8    Number of cases listed:  8


Now, change to wide form:

NUMERIC NoKey  (F2).
COMPUTE NoKey = 1.

CASESTOVARS
  /ID      = NoKey
  /DROP    = NameLen
  /GROUPBY = VARIABLE .


Cases to Variables
|----------------------------|---------------------------|
|Output Created              |27-AUG-2007 19:07:41       |
|----------------------------|---------------------------|
[WideName]

Generated Variables
|---------|------|
|Original |Result|
|Variabl  |------|
|e        |Name  |
|-------|-|------|
|Name   |1|Name.1|
|       |2|Name.2|
|       |3|Name.3|
|       |4|Name.4|
|       |5|Name.5|
|       |6|Name.6|
|       |7|Name.7|
|       |8|Name.8|
|-------|-|------|
|Type   |1|Type.1|
|       |2|Type.2|
|       |3|Type.3|
|       |4|Type.4|
|       |5|Type.5|
|       |6|Type.6|
|       |7|Type.7|
|       |8|Type.8|
|-------|-|------|

Processing Statistics
|---------------|---|
|Cases In       |8  |
|Cases Out      |1  |
|---------------|---|
|Cases In/Cases |8.0|
|Out            |   |
|---------------|---|
|Variables In   |4  |
|Variables Out  |17 |
|---------------|---|
|Index Values   |8  |
|---------------|---|

LIST.

List
|-----------------------------|---------------------------|
|Output Created               |27-AUG-2007 19:07:41       |
|-----------------------------|---------------------------|
[WideName]

The variables are listed in the following order:

LINE   1: NoKey Name.1 Name.2 Name.3 Name.4
LINE   2: Name.5 Name.6 Name.7 Name.8 Type.1 Type.2 Type.3 Type.4
Type.5 Type.6
           Type.7
LINE   3: Type.8

       NoKey:  1 Elizabeth    Robinson     George       Peter
       Name.5: Susan        Eliza        Smith        Jones        F L
F F F F L
       Type.8: L


Number of cases read:  1    Number of cases listed:  1
.............


Now you can do the replacements; this is the part I haven't tested. The
VECTOR statements and the upper bound on the LOOP depend on how many
names there are; that is, they have to be dynamically-generated code,
based on the data. In the old days, you'd do this by generating the
code in SPSS, writing it to an external file, and then INCLUDing or
INSERTing it. Nowadays, more likely Python.

GET FILE=RealData.

NUMERIC NoKey  (F2).
COMPUTE NoKey = 1.

MATCH FILES
    /FILE=*
    /TABLE=WideName
    /BY NoKey.

VECTOR Name = Name.1 TO Name.8
       /Type = Type.1 TO Type.8.

LOOP #NameIdx = 1 TO 8.
.  DO IF   Type(#NameIdx) EQ 'F'.
*  .. Don't forget 'RTRIM' in this COMPUTE ....... .
.     COMPUTE Narrativ = REPLACE(Narrativ,
                                  RTRIM(Name(#NameIdx)),
                                  "~FirstName~").
.  ELSE.
*  .. Don't forget 'RTRIM' in this COMPUTE ....... .
.     COMPUTE Narrativ = REPLACE(Narrativ,
                                  RTRIM(Name(#NameIdx)),
                                  "~LastName~").
.  END IF.
END LOOP.




=============================
APPENDIX: (partial) test data
=============================
Reply | Threaded
Open this post in threaded view
|

Re: De-Identifying String Variables...

Peck, Jon
In reply to this post by Jeff-125
p.s.  This example did not address the list of other names, but it would be easy to read them from a file and apply the same regular expression replacement technique in a loop.

-----Original Message-----
From: Peck, Jon
Sent: Tuesday, August 28, 2007 8:42 AM
To: 'Richard Ristow'; [hidden email]
Subject: RE: Re: [SPSSX-L] De-Identifying String Variables...

This is a perfect application for a regular expression replace, which can also be flexible regarding case and can easily handle word boundaries.  This is available if you can use Python programmability and have at least SPSS 15.  It deals with repeated occurrences, differences in case, and names with possessives.  It also handles names like O'Neill or van Gogh.  The SPSS variable names in the code below must exactly match the case of the names in SPSS.

This code requires some downloadable supplementary modules from SPSS Developer Central (www.spss.com/devcentral): spssaux, spssdata, and namedTuple.

Here is the code.  Annotations below.  If anyone needs it emailed, since the mail tends to mess up line breaks and indenting, send me a message ([hidden email]).

begin program.
import spss, spssaux, spssdata, re
vard = spssaux.VariableDict()
curs = spssdata.Spssdata(indexes='firstname lastname narrative', accessType='w')
curs.append(spssdata.vdef("anonnarrative",vtype=vard['narrative'].VariableType + 100))
curs.commitdict()
wbound = r"\b"
for case in curs:
    fnregex = re.compile(wbound + case.firstname.strip() + wbound, flags=re.IGNORECASE)
    lnregex = re.compile(wbound + case.lastname.strip() + wbound, flags=re.IGNORECASE)
    newnarr = fnregex.sub("-firstname-", case.narrative)
    newnarr = lnregex.sub("-lastname-", newnarr)
    curs.casevalues([newnarr])
curs.CClose()
end program.

First the code gets an SPSS variable dictionary.  Then it requests three variables and defines one new variable, newnarr, which will hold the modified string.  newnarr is made the size of the narrative variable plus some extra room in case the names are shorter than the replacement strings.

It loops over the cases and constructs a regular expression for firstname and lastname.  The regular expression form is
"\bjoe\b"
which matches "joe" as a word.

It replaces all occurrences of the firstname and lastname strings.

Regards,
Jon Peck



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow
Sent: Monday, August 27, 2007 6:20 PM
To: [hidden email]
Subject: Re: [SPSSX-L] De-Identifying String Variables...

At 05:32 PM 8/27/2007, Jeff wrote:

I have 3 string variables: 1) First Name, 2) Last Name, 3) Narrative.
The [narrative] contains text up to several thousand characters long.
Some of that text contains the first and last names that are contained
within the first-name and last-name variables. I need to "de-identify"
those narratives to remove all actual names.

I obtained the code below from Raynald Levesque's web site, and made
only slight modifications. It replaces all "&" contained in the
variable "mystring" with ~FirstName~.

First, if you have SPSS 14 or later, string function 'REPLACE' gives
muchy simpler syntax to replace a single name. As in Raynald's example,
it's "&" that will be replaced in this example, but you can use any
string:

COMPUTE mystring = REPLACE(mystring,"&","~FirstName~").

I need to expand replacie not only a single character or word, but all
possible first names with ~FirstName~ and all possible last names with
~LastName~. I can easily get a spss file that contains all of the first
names and last names that might be found and would like to have spss
check for them and replace them as described above.

OK, I've preached 'long' organization forever, but here's a case where
you need 'wide'. That's because this includes a many-to-many merge:
you're checking every line in your file against every name on the list.
Changing one file to 'wide' organization is a good way to do this, in
SPSS.

This syntax isn't checked. Generating test data would be more elaborate
than I'm up for just now.

Here's file NAMES, with the list of names:
|-----------------------------|---------------------------|
|Output Created               |27-AUG-2007 18:50:04       |
|-----------------------------|---------------------------|
Name         Type

George       F
Peter        F
Susan        F
Elizabeth    F
Eliza        F
Smith        L
Jones        L
Robinson     L

Number of cases read:  8    Number of cases listed:  8

'Type' is not sex, but classification, as a first name or last name. Do
what you want, with any names that occur in both roles.


If you derive the name list from your original file (as is probably
easiest), the names may come out in order in the file, like this,
rather than all the first names and then all the last name:


George       F
Smith        L
Peter        F
Jones        L
Susan        F
Robinson     L

That's no problem. You'll probably want to sort and drop duplicates,
though.

.............
Next, a step that may not be obvious: sort names in descending length
order. You have to match the longer names first, or you'll hang up on
short names that occur as parts of longer names:

'Elizabeth' -> '~FirstName~beth'


NUMERIC  NameLen (F4).
COMPUTE  NameLen = LENGTH(RTRIM(NAME)).
SORT CASES
       BY NameLen (D).
LIST.

List
|-----------------------------|---------------------------|
|Output Created               |27-AUG-2007 18:53:06       |
|-----------------------------|---------------------------|
Name         Type NameLen

Elizabeth    F         9
Robinson     L         8
George       F         6
Peter        F         5
Susan        F         5
Eliza        F         5
Smith        L         5
Jones        L         5

Number of cases read:  8    Number of cases listed:  8


Now, change to wide form:

NUMERIC NoKey  (F2).
COMPUTE NoKey = 1.

CASESTOVARS
  /ID      = NoKey
  /DROP    = NameLen
  /GROUPBY = VARIABLE .


Cases to Variables
|----------------------------|---------------------------|
|Output Created              |27-AUG-2007 19:07:41       |
|----------------------------|---------------------------|
[WideName]

Generated Variables
|---------|------|
|Original |Result|
|Variabl  |------|
|e        |Name  |
|-------|-|------|
|Name   |1|Name.1|
|       |2|Name.2|
|       |3|Name.3|
|       |4|Name.4|
|       |5|Name.5|
|       |6|Name.6|
|       |7|Name.7|
|       |8|Name.8|
|-------|-|------|
|Type   |1|Type.1|
|       |2|Type.2|
|       |3|Type.3|
|       |4|Type.4|
|       |5|Type.5|
|       |6|Type.6|
|       |7|Type.7|
|       |8|Type.8|
|-------|-|------|

Processing Statistics
|---------------|---|
|Cases In       |8  |
|Cases Out      |1  |
|---------------|---|
|Cases In/Cases |8.0|
|Out            |   |
|---------------|---|
|Variables In   |4  |
|Variables Out  |17 |
|---------------|---|
|Index Values   |8  |
|---------------|---|

LIST.

List
|-----------------------------|---------------------------|
|Output Created               |27-AUG-2007 19:07:41       |
|-----------------------------|---------------------------|
[WideName]

The variables are listed in the following order:

LINE   1: NoKey Name.1 Name.2 Name.3 Name.4
LINE   2: Name.5 Name.6 Name.7 Name.8 Type.1 Type.2 Type.3 Type.4
Type.5 Type.6
           Type.7
LINE   3: Type.8

       NoKey:  1 Elizabeth    Robinson     George       Peter
       Name.5: Susan        Eliza        Smith        Jones        F L
F F F F L
       Type.8: L


Number of cases read:  1    Number of cases listed:  1
.............


Now you can do the replacements; this is the part I haven't tested. The
VECTOR statements and the upper bound on the LOOP depend on how many
names there are; that is, they have to be dynamically-generated code,
based on the data. In the old days, you'd do this by generating the
code in SPSS, writing it to an external file, and then INCLUDing or
INSERTing it. Nowadays, more likely Python.

GET FILE=RealData.

NUMERIC NoKey  (F2).
COMPUTE NoKey = 1.

MATCH FILES
    /FILE=*
    /TABLE=WideName
    /BY NoKey.

VECTOR Name = Name.1 TO Name.8
       /Type = Type.1 TO Type.8.

LOOP #NameIdx = 1 TO 8.
.  DO IF   Type(#NameIdx) EQ 'F'.
*  .. Don't forget 'RTRIM' in this COMPUTE ....... .
.     COMPUTE Narrativ = REPLACE(Narrativ,
                                  RTRIM(Name(#NameIdx)),
                                  "~FirstName~").
.  ELSE.
*  .. Don't forget 'RTRIM' in this COMPUTE ....... .
.     COMPUTE Narrativ = REPLACE(Narrativ,
                                  RTRIM(Name(#NameIdx)),
                                  "~LastName~").
.  END IF.
END LOOP.




=============================
APPENDIX: (partial) test data
=============================
Reply | Threaded
Open this post in threaded view
|

Re: De-Identifying String Variables...

Jeff-125
In reply to this post by Richard Ristow
At 09:56 PM 8/27/2007, you wrote:
>But, do you need to remove "all possible first names and all
>possible last names" from the narrative, or only those of the
>individual - i.e., the names in the individual's First Name and Last
>Name variables?
>
>Your first paragraph suggests that you only need to do the latter.
>If that's the case, you don't need a separate file of names, with
>the extra logic to handle it. You can just replace the first name
>and last name from the current record, in each narrative.


 From what I've seen of the data (I don't have it in the office yet,
the idea of removing the names is that I have to do so before taking
it out of the agency and into my office), the most likely scenario
will be that the names contained within the narrative will be the
same names as in the first and last name variables from the same
record/row. But it is possible to have the names from other
records/rows in there as well, so your original solution was along
the lines of what I want.

  ...and I do have spss 14 installed, so I'll have the replace
function available.

...will try this out in the next day or so.

Thanks


Jeff
Reply | Threaded
Open this post in threaded view
|

Re: De-Identifying String Variables...

Richard Ristow
At 10:47 AM 8/28/2007, Jeff wrote:

>At 09:56 PM 8/27/2007, you wrote:
>>Do you need to remove "all possible first names and all possible last
>>names" from the narrative, or only those of the individual - i.e.,
>>the names in the individual's First Name and Last Name variables?
>
>The most likely scenario will be that the names contained within the
>narrative will be the same names as in the first and last name
>variables from the same record/row. But it is possible to have the
>names from other records/rows in there as well, so your original
>solution was along the lines of what I want.

Fair enough. In some circumstances it would be acceptable to retain
*other* names in readable form (since they don't identify the current
subject, and don't clearly identify anybody else), but in others not.
You'll know what your requirements are.

If you do need to replace all names, here's an extension that may be
useful:

First, replace the individual's first and last names, separately, by
'~FirstName~' and '~LastName~'. THEN, go through the whole set of
possible names, replacing first names by '~OtherFirst~' and last names
by '~OtherLast~'. That may make the result more comprehensible, by
making it clear when a name refers to the current person, and when it
refers to somebody else.

(That does re-raise the question of short names being subsets of longer
ones. If you have a subject named 'Eliza', and that person's narrative
refers to an 'Elizabeth', appearance of '~FirstName~beth' in the
narrative could blow anonymity of BOTH people. This could take some
special-case code. As Jon Peck writes, matching by regular expressions
could solve this: match names only when not preceded or followed by
letters. But without SPSS 15, you may not be able to do this.)

You still may have a daunting speed problem, but you'll find that out,
one way or the other, soon enough.

Good luck to you!
Richard
Reply | Threaded
Open this post in threaded view
|

Re: De-Identifying String Variables...

Jeff-125
At 12:55 PM 8/28/2007, you wrote:
>But without SPSS 15, you may not be able to do this.)
>
>You still may have a daunting speed problem, but you'll find that
>out, one way or the other, soon enough.
>
>Good luck to you!
>Richard


Thanks (both Richard and Jon),

...very valuable information. I have v15 available in my office, but
I will have to get the IT people from the agency I'm working with
back to install the new version at the agency computer we are using.
I've only worked with regular expressions a small bit, and never
within spss/python. ...haven't done any python for that matter, but
have some experience with similar languages so perhaps now is the
time to learn some.

In the event that we go with the python solution, could you tell me
if the extra modules will require administrator access to install?  I
only have that on my own machines.

Thanks

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: De-Identifying String Variables...

Peck, Jon
For your question below, the plug-in and supplementary modules should not need admin privileges to install, but a few files are written to the SPSS installation directory.

-Jon


-----Original Message-----
From: SPSSX(r) Discussion on behalf of Jeff
Sent: Tue 8/28/2007 3:46 PM
To: [hidden email]
Subject:      Re: [SPSSX-L] De-Identifying String Variables...
 
At 12:55 PM 8/28/2007, you wrote:
>But without SPSS 15, you may not be able to do this.)
>
>You still may have a daunting speed problem, but you'll find that
>out, one way or the other, soon enough.
>
>Good luck to you!
>Richard


Thanks (both Richard and Jon),

...very valuable information. I have v15 available in my office, but
I will have to get the IT people from the agency I'm working with
back to install the new version at the agency computer we are using.
I've only worked with regular expressions a small bit, and never
within spss/python. ...haven't done any python for that matter, but
have some experience with similar languages so perhaps now is the
time to learn some.

In the event that we go with the python solution, could you tell me
if the extra modules will require administrator access to install?  I
only have that on my own machines.

Thanks

Jeff