|
I have a somewhat unique (at least to me) string manipulation task.
I'm working with some agency data. I'll simplify my discussion of the structure of the data to include only the information necessary to ask my question. I have 3 string variables: 1) First Name, 2) Last Name, 3) Narrative. The first two variables are self-explanatory. The last contains text up to several thousand characters long. Some of that text contains the first and last names that are contained within the first-name and last-name variables. I need to "de-identify" those narratives to remove all actual names before transferring the data to another machine (the first name and last name variables will simply be dropped - can easily do that already). I obtained the code below from Raynald Levesque's web site, and made only slight modifications. It replaces all "&" contained in the variable "mystring" with ~FirstName~. I need to expand upon that idea by replacing not only a single character or word, but by replacing all possible first names with ~FirstName~ and all possible last names with ~LastName~. I can easily get a spss file that contains all of the first names and last names that might be found and would like to have spss check for them and replace them as described above. I can't simply check for one or two names contained within the same row, but rather from a long list of names contained within another file, although if necessary, I could convert that list of names to text and paste it into syntax. Could somehow show me how to modify the syntax below to do something like this? I've also provided an example of what the results should look like below the syntax Thanks in advance Jeff DO IF (INDEX(mystring,"&")>0). LOOP. COMPUTE mystring = CONCAT(SUBSTR(mystring,1,INDEX(mystring,"&")-1),"~FirstName~", SUBSTR(mystring,INDEX(mystring,"&")+1,LENGTH(mystring)-INDEX(mystring,"&"))). END LOOP IF (INDEX(mystring,"&")=0). END IF. EXECUTE. OriginalNarrative ModifiedNarrative John Smith went to the store ~FirstName~ ~LastName~ went to the store but I would need to search for all names contained in another file (rather than just a single name) - e.g., FirstName John Jim Tim Dave etc. |
|
At 05:32 PM 8/27/2007, Jeff wrote:
I have 3 string variables: 1) First Name, 2) Last Name, 3) Narrative. The [narrative] contains text up to several thousand characters long. Some of that text contains the first and last names that are contained within the first-name and last-name variables. I need to "de-identify" those narratives to remove all actual names. I obtained the code below from Raynald Levesque's web site, and made only slight modifications. It replaces all "&" contained in the variable "mystring" with ~FirstName~. First, if you have SPSS 14 or later, string function 'REPLACE' gives muchy simpler syntax to replace a single name. As in Raynald's example, it's "&" that will be replaced in this example, but you can use any string: COMPUTE mystring = REPLACE(mystring,"&","~FirstName~"). I need to expand replacie not only a single character or word, but all possible first names with ~FirstName~ and all possible last names with ~LastName~. I can easily get a spss file that contains all of the first names and last names that might be found and would like to have spss check for them and replace them as described above. OK, I've preached 'long' organization forever, but here's a case where you need 'wide'. That's because this includes a many-to-many merge: you're checking every line in your file against every name on the list. Changing one file to 'wide' organization is a good way to do this, in SPSS. This syntax isn't checked. Generating test data would be more elaborate than I'm up for just now. Here's file NAMES, with the list of names: |-----------------------------|---------------------------| |Output Created |27-AUG-2007 18:50:04 | |-----------------------------|---------------------------| Name Type George F Peter F Susan F Elizabeth F Eliza F Smith L Jones L Robinson L Number of cases read: 8 Number of cases listed: 8 'Type' is not sex, but classification, as a first name or last name. Do what you want, with any names that occur in both roles. If you derive the name list from your original file (as is probably easiest), the names may come out in order in the file, like this, rather than all the first names and then all the last name: George F Smith L Peter F Jones L Susan F Robinson L That's no problem. You'll probably want to sort and drop duplicates, though. ............. Next, a step that may not be obvious: sort names in descending length order. You have to match the longer names first, or you'll hang up on short names that occur as parts of longer names: 'Elizabeth' -> '~FirstName~beth' NUMERIC NameLen (F4). COMPUTE NameLen = LENGTH(RTRIM(NAME)). SORT CASES BY NameLen (D). LIST. List |-----------------------------|---------------------------| |Output Created |27-AUG-2007 18:53:06 | |-----------------------------|---------------------------| Name Type NameLen Elizabeth F 9 Robinson L 8 George F 6 Peter F 5 Susan F 5 Eliza F 5 Smith L 5 Jones L 5 Number of cases read: 8 Number of cases listed: 8 Now, change to wide form: NUMERIC NoKey (F2). COMPUTE NoKey = 1. CASESTOVARS /ID = NoKey /DROP = NameLen /GROUPBY = VARIABLE . Cases to Variables |----------------------------|---------------------------| |Output Created |27-AUG-2007 19:07:41 | |----------------------------|---------------------------| [WideName] Generated Variables |---------|------| |Original |Result| |Variabl |------| |e |Name | |-------|-|------| |Name |1|Name.1| | |2|Name.2| | |3|Name.3| | |4|Name.4| | |5|Name.5| | |6|Name.6| | |7|Name.7| | |8|Name.8| |-------|-|------| |Type |1|Type.1| | |2|Type.2| | |3|Type.3| | |4|Type.4| | |5|Type.5| | |6|Type.6| | |7|Type.7| | |8|Type.8| |-------|-|------| Processing Statistics |---------------|---| |Cases In |8 | |Cases Out |1 | |---------------|---| |Cases In/Cases |8.0| |Out | | |---------------|---| |Variables In |4 | |Variables Out |17 | |---------------|---| |Index Values |8 | |---------------|---| LIST. List |-----------------------------|---------------------------| |Output Created |27-AUG-2007 19:07:41 | |-----------------------------|---------------------------| [WideName] The variables are listed in the following order: LINE 1: NoKey Name.1 Name.2 Name.3 Name.4 LINE 2: Name.5 Name.6 Name.7 Name.8 Type.1 Type.2 Type.3 Type.4 Type.5 Type.6 Type.7 LINE 3: Type.8 NoKey: 1 Elizabeth Robinson George Peter Name.5: Susan Eliza Smith Jones F L F F F F L Type.8: L Number of cases read: 1 Number of cases listed: 1 ............. Now you can do the replacements; this is the part I haven't tested. The VECTOR statements and the upper bound on the LOOP depend on how many names there are; that is, they have to be dynamically-generated code, based on the data. In the old days, you'd do this by generating the code in SPSS, writing it to an external file, and then INCLUDing or INSERTing it. Nowadays, more likely Python. GET FILE=RealData. NUMERIC NoKey (F2). COMPUTE NoKey = 1. MATCH FILES /FILE=* /TABLE=WideName /BY NoKey. VECTOR Name = Name.1 TO Name.8 /Type = Type.1 TO Type.8. LOOP #NameIdx = 1 TO 8. . DO IF Type(#NameIdx) EQ 'F'. * .. Don't forget 'RTRIM' in this COMPUTE ....... . . COMPUTE Narrativ = REPLACE(Narrativ, RTRIM(Name(#NameIdx)), "~FirstName~"). . ELSE. * .. Don't forget 'RTRIM' in this COMPUTE ....... . . COMPUTE Narrativ = REPLACE(Narrativ, RTRIM(Name(#NameIdx)), "~LastName~"). . END IF. END LOOP. ============================= APPENDIX: (partial) test data ============================= |
|
At 06:20 PM 8/27/2007, you wrote:
>Now you can do the replacements; this is the part I haven't tested. >The VECTOR statements and the upper bound on the LOOP depend on how >many names there are; that is, they have to be dynamically-generated >code, based on the data. Thanks. I think that I understand the idea and will just have to spend a few minutes with the code tomorrow to experiment and see it work first hand. The one thing that I didn't mention is that I will have about a half-million of these narratives and each may be up to about 7000 characters. There may also be up to about 30,000 names, but many will be duplicates and we will remove them. Will the number of names cause any problem with the vector command or any other command? We can easily just page down to see how many names we end up with, so determining the upper limit shouldn't be a problem. I do have spss 14 so the replace command will work. ...would never have thought about the sort by length by myself without running into the problem. Jeff |
|
In reply to this post by Jeff-125
Second postscript:
At 05:32 PM 8/27/2007, Jeff wrote: >I obtained the code below from Raynald Levesque's web site, and made >only slight modifications. It replaces all "&" contained in the >variable "mystring" with ~FirstName. > >DO IF (INDEX(mystring,"&")>0). > LOOP. > COMPUTE mystring = > CONCAT(SUBSTR(mystring,1,INDEX(mystring,"&")-1), > "~FirstName~", > SUBSTR(mystring,INDEX(mystring,"&")+1, > LENGTH(mystring)-INDEX(mystring,"&"))). > END LOOP IF (INDEX(mystring,"&")=0). >END IF. Be careful, if you do have to use this one. I don't think it will work unless the string you're replacing has length 1, as "&" does. (I haven't tested, and could be wrong; it's easy to misunderstand code when you're just reading it.) A rewrite to fix this wouldn't be too difficult. I'm not doing one right now, since you probably do have the 'REPLACE' function. |
|
In reply to this post by Jeff-125
Postscript, with a question about what you want:
At 05:32 PM 8/27/2007, Jeff wrote: >I have 3 string variables: 1) First Name, 2) Last Name, 3) Narrative. >The first two variables are self-explanatory. Some of [the narrative] >text contains the first and last names [from] the first-name and >last-name variables. I need to "de-identify" those narratives to >remove all actual names. You wrote that, and also >I need to replace not only a single character or word, but all >possible first names with ~FirstName~ and all possible last names with >~LastName~. I can easily get a spss file that contains all of the >first names and last names that might be found and would like to have >spss check for them and replace them as described above. I wrote, to solve the problem as best I could, as you desribe it. But, do you need to remove "all possible first names and all possible last names" from the narrative, or only those of the individual - i.e., the names in the individual's First Name and Last Name variables? Your first paragraph suggests that you only need to do the latter. If that's the case, you don't need a separate file of names, with the extra logic to handle it. You can just replace the first name and last name from the current record, in each narrative. |
|
In reply to this post by Jeff-125
At 09:15 PM 8/27/2007, Jeff wrote:
>I think that I understand the idea and will just have to spend a few >minutes to experiment and see it work first hand. >The one thing that I didn't mention is that I will have about a >half-million of these narratives and each may be up to about 7000 >characters. There may also be up to about 30,000 names, but many will >be duplicates and we will remove them. Will the number of names cause >any problem with the vector command or any other command? Shouldn't be; I don't see why a vector can't have length 30,000. HOWEVER, processing may be #slow#. If you have 30,000 names, and you scan every one of your half-million narratives for every one of 30,000 names (I hope you'll actually have many fewer) - (0.5E6)*(30E3) = 15*1E9 15 billion of something will take quite a while, even if 'something' doesn't take very long. Good luck! Richard |
|
In reply to this post by Richard Ristow
This is a perfect application for a regular expression replace, which can also be flexible regarding case and can easily handle word boundaries. This is available if you can use Python programmability and have at least SPSS 15. It deals with repeated occurrences, differences in case, and names with possessives. It also handles names like O'Neill or van Gogh. The SPSS variable names in the code below must exactly match the case of the names in SPSS.
This code requires some downloadable supplementary modules from SPSS Developer Central (www.spss.com/devcentral): spssaux, spssdata, and namedTuple. Here is the code. Annotations below. If anyone needs it emailed, since the mail tends to mess up line breaks and indenting, send me a message ([hidden email]). begin program. import spss, spssaux, spssdata, re vard = spssaux.VariableDict() curs = spssdata.Spssdata(indexes='firstname lastname narrative', accessType='w') curs.append(spssdata.vdef("anonnarrative",vtype=vard['narrative'].VariableType + 100)) curs.commitdict() wbound = r"\b" for case in curs: fnregex = re.compile(wbound + case.firstname.strip() + wbound, flags=re.IGNORECASE) lnregex = re.compile(wbound + case.lastname.strip() + wbound, flags=re.IGNORECASE) newnarr = fnregex.sub("-firstname-", case.narrative) newnarr = lnregex.sub("-lastname-", newnarr) curs.casevalues([newnarr]) curs.CClose() end program. First the code gets an SPSS variable dictionary. Then it requests three variables and defines one new variable, newnarr, which will hold the modified string. newnarr is made the size of the narrative variable plus some extra room in case the names are shorter than the replacement strings. It loops over the cases and constructs a regular expression for firstname and lastname. The regular expression form is "\bjoe\b" which matches "joe" as a word. It replaces all occurrences of the firstname and lastname strings. Regards, Jon Peck -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow Sent: Monday, August 27, 2007 6:20 PM To: [hidden email] Subject: Re: [SPSSX-L] De-Identifying String Variables... At 05:32 PM 8/27/2007, Jeff wrote: I have 3 string variables: 1) First Name, 2) Last Name, 3) Narrative. The [narrative] contains text up to several thousand characters long. Some of that text contains the first and last names that are contained within the first-name and last-name variables. I need to "de-identify" those narratives to remove all actual names. I obtained the code below from Raynald Levesque's web site, and made only slight modifications. It replaces all "&" contained in the variable "mystring" with ~FirstName~. First, if you have SPSS 14 or later, string function 'REPLACE' gives muchy simpler syntax to replace a single name. As in Raynald's example, it's "&" that will be replaced in this example, but you can use any string: COMPUTE mystring = REPLACE(mystring,"&","~FirstName~"). I need to expand replacie not only a single character or word, but all possible first names with ~FirstName~ and all possible last names with ~LastName~. I can easily get a spss file that contains all of the first names and last names that might be found and would like to have spss check for them and replace them as described above. OK, I've preached 'long' organization forever, but here's a case where you need 'wide'. That's because this includes a many-to-many merge: you're checking every line in your file against every name on the list. Changing one file to 'wide' organization is a good way to do this, in SPSS. This syntax isn't checked. Generating test data would be more elaborate than I'm up for just now. Here's file NAMES, with the list of names: |-----------------------------|---------------------------| |Output Created |27-AUG-2007 18:50:04 | |-----------------------------|---------------------------| Name Type George F Peter F Susan F Elizabeth F Eliza F Smith L Jones L Robinson L Number of cases read: 8 Number of cases listed: 8 'Type' is not sex, but classification, as a first name or last name. Do what you want, with any names that occur in both roles. If you derive the name list from your original file (as is probably easiest), the names may come out in order in the file, like this, rather than all the first names and then all the last name: George F Smith L Peter F Jones L Susan F Robinson L That's no problem. You'll probably want to sort and drop duplicates, though. ............. Next, a step that may not be obvious: sort names in descending length order. You have to match the longer names first, or you'll hang up on short names that occur as parts of longer names: 'Elizabeth' -> '~FirstName~beth' NUMERIC NameLen (F4). COMPUTE NameLen = LENGTH(RTRIM(NAME)). SORT CASES BY NameLen (D). LIST. List |-----------------------------|---------------------------| |Output Created |27-AUG-2007 18:53:06 | |-----------------------------|---------------------------| Name Type NameLen Elizabeth F 9 Robinson L 8 George F 6 Peter F 5 Susan F 5 Eliza F 5 Smith L 5 Jones L 5 Number of cases read: 8 Number of cases listed: 8 Now, change to wide form: NUMERIC NoKey (F2). COMPUTE NoKey = 1. CASESTOVARS /ID = NoKey /DROP = NameLen /GROUPBY = VARIABLE . Cases to Variables |----------------------------|---------------------------| |Output Created |27-AUG-2007 19:07:41 | |----------------------------|---------------------------| [WideName] Generated Variables |---------|------| |Original |Result| |Variabl |------| |e |Name | |-------|-|------| |Name |1|Name.1| | |2|Name.2| | |3|Name.3| | |4|Name.4| | |5|Name.5| | |6|Name.6| | |7|Name.7| | |8|Name.8| |-------|-|------| |Type |1|Type.1| | |2|Type.2| | |3|Type.3| | |4|Type.4| | |5|Type.5| | |6|Type.6| | |7|Type.7| | |8|Type.8| |-------|-|------| Processing Statistics |---------------|---| |Cases In |8 | |Cases Out |1 | |---------------|---| |Cases In/Cases |8.0| |Out | | |---------------|---| |Variables In |4 | |Variables Out |17 | |---------------|---| |Index Values |8 | |---------------|---| LIST. List |-----------------------------|---------------------------| |Output Created |27-AUG-2007 19:07:41 | |-----------------------------|---------------------------| [WideName] The variables are listed in the following order: LINE 1: NoKey Name.1 Name.2 Name.3 Name.4 LINE 2: Name.5 Name.6 Name.7 Name.8 Type.1 Type.2 Type.3 Type.4 Type.5 Type.6 Type.7 LINE 3: Type.8 NoKey: 1 Elizabeth Robinson George Peter Name.5: Susan Eliza Smith Jones F L F F F F L Type.8: L Number of cases read: 1 Number of cases listed: 1 ............. Now you can do the replacements; this is the part I haven't tested. The VECTOR statements and the upper bound on the LOOP depend on how many names there are; that is, they have to be dynamically-generated code, based on the data. In the old days, you'd do this by generating the code in SPSS, writing it to an external file, and then INCLUDing or INSERTing it. Nowadays, more likely Python. GET FILE=RealData. NUMERIC NoKey (F2). COMPUTE NoKey = 1. MATCH FILES /FILE=* /TABLE=WideName /BY NoKey. VECTOR Name = Name.1 TO Name.8 /Type = Type.1 TO Type.8. LOOP #NameIdx = 1 TO 8. . DO IF Type(#NameIdx) EQ 'F'. * .. Don't forget 'RTRIM' in this COMPUTE ....... . . COMPUTE Narrativ = REPLACE(Narrativ, RTRIM(Name(#NameIdx)), "~FirstName~"). . ELSE. * .. Don't forget 'RTRIM' in this COMPUTE ....... . . COMPUTE Narrativ = REPLACE(Narrativ, RTRIM(Name(#NameIdx)), "~LastName~"). . END IF. END LOOP. ============================= APPENDIX: (partial) test data ============================= |
|
In reply to this post by Jeff-125
p.s. This example did not address the list of other names, but it would be easy to read them from a file and apply the same regular expression replacement technique in a loop.
-----Original Message----- From: Peck, Jon Sent: Tuesday, August 28, 2007 8:42 AM To: 'Richard Ristow'; [hidden email] Subject: RE: Re: [SPSSX-L] De-Identifying String Variables... This is a perfect application for a regular expression replace, which can also be flexible regarding case and can easily handle word boundaries. This is available if you can use Python programmability and have at least SPSS 15. It deals with repeated occurrences, differences in case, and names with possessives. It also handles names like O'Neill or van Gogh. The SPSS variable names in the code below must exactly match the case of the names in SPSS. This code requires some downloadable supplementary modules from SPSS Developer Central (www.spss.com/devcentral): spssaux, spssdata, and namedTuple. Here is the code. Annotations below. If anyone needs it emailed, since the mail tends to mess up line breaks and indenting, send me a message ([hidden email]). begin program. import spss, spssaux, spssdata, re vard = spssaux.VariableDict() curs = spssdata.Spssdata(indexes='firstname lastname narrative', accessType='w') curs.append(spssdata.vdef("anonnarrative",vtype=vard['narrative'].VariableType + 100)) curs.commitdict() wbound = r"\b" for case in curs: fnregex = re.compile(wbound + case.firstname.strip() + wbound, flags=re.IGNORECASE) lnregex = re.compile(wbound + case.lastname.strip() + wbound, flags=re.IGNORECASE) newnarr = fnregex.sub("-firstname-", case.narrative) newnarr = lnregex.sub("-lastname-", newnarr) curs.casevalues([newnarr]) curs.CClose() end program. First the code gets an SPSS variable dictionary. Then it requests three variables and defines one new variable, newnarr, which will hold the modified string. newnarr is made the size of the narrative variable plus some extra room in case the names are shorter than the replacement strings. It loops over the cases and constructs a regular expression for firstname and lastname. The regular expression form is "\bjoe\b" which matches "joe" as a word. It replaces all occurrences of the firstname and lastname strings. Regards, Jon Peck -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow Sent: Monday, August 27, 2007 6:20 PM To: [hidden email] Subject: Re: [SPSSX-L] De-Identifying String Variables... At 05:32 PM 8/27/2007, Jeff wrote: I have 3 string variables: 1) First Name, 2) Last Name, 3) Narrative. The [narrative] contains text up to several thousand characters long. Some of that text contains the first and last names that are contained within the first-name and last-name variables. I need to "de-identify" those narratives to remove all actual names. I obtained the code below from Raynald Levesque's web site, and made only slight modifications. It replaces all "&" contained in the variable "mystring" with ~FirstName~. First, if you have SPSS 14 or later, string function 'REPLACE' gives muchy simpler syntax to replace a single name. As in Raynald's example, it's "&" that will be replaced in this example, but you can use any string: COMPUTE mystring = REPLACE(mystring,"&","~FirstName~"). I need to expand replacie not only a single character or word, but all possible first names with ~FirstName~ and all possible last names with ~LastName~. I can easily get a spss file that contains all of the first names and last names that might be found and would like to have spss check for them and replace them as described above. OK, I've preached 'long' organization forever, but here's a case where you need 'wide'. That's because this includes a many-to-many merge: you're checking every line in your file against every name on the list. Changing one file to 'wide' organization is a good way to do this, in SPSS. This syntax isn't checked. Generating test data would be more elaborate than I'm up for just now. Here's file NAMES, with the list of names: |-----------------------------|---------------------------| |Output Created |27-AUG-2007 18:50:04 | |-----------------------------|---------------------------| Name Type George F Peter F Susan F Elizabeth F Eliza F Smith L Jones L Robinson L Number of cases read: 8 Number of cases listed: 8 'Type' is not sex, but classification, as a first name or last name. Do what you want, with any names that occur in both roles. If you derive the name list from your original file (as is probably easiest), the names may come out in order in the file, like this, rather than all the first names and then all the last name: George F Smith L Peter F Jones L Susan F Robinson L That's no problem. You'll probably want to sort and drop duplicates, though. ............. Next, a step that may not be obvious: sort names in descending length order. You have to match the longer names first, or you'll hang up on short names that occur as parts of longer names: 'Elizabeth' -> '~FirstName~beth' NUMERIC NameLen (F4). COMPUTE NameLen = LENGTH(RTRIM(NAME)). SORT CASES BY NameLen (D). LIST. List |-----------------------------|---------------------------| |Output Created |27-AUG-2007 18:53:06 | |-----------------------------|---------------------------| Name Type NameLen Elizabeth F 9 Robinson L 8 George F 6 Peter F 5 Susan F 5 Eliza F 5 Smith L 5 Jones L 5 Number of cases read: 8 Number of cases listed: 8 Now, change to wide form: NUMERIC NoKey (F2). COMPUTE NoKey = 1. CASESTOVARS /ID = NoKey /DROP = NameLen /GROUPBY = VARIABLE . Cases to Variables |----------------------------|---------------------------| |Output Created |27-AUG-2007 19:07:41 | |----------------------------|---------------------------| [WideName] Generated Variables |---------|------| |Original |Result| |Variabl |------| |e |Name | |-------|-|------| |Name |1|Name.1| | |2|Name.2| | |3|Name.3| | |4|Name.4| | |5|Name.5| | |6|Name.6| | |7|Name.7| | |8|Name.8| |-------|-|------| |Type |1|Type.1| | |2|Type.2| | |3|Type.3| | |4|Type.4| | |5|Type.5| | |6|Type.6| | |7|Type.7| | |8|Type.8| |-------|-|------| Processing Statistics |---------------|---| |Cases In |8 | |Cases Out |1 | |---------------|---| |Cases In/Cases |8.0| |Out | | |---------------|---| |Variables In |4 | |Variables Out |17 | |---------------|---| |Index Values |8 | |---------------|---| LIST. List |-----------------------------|---------------------------| |Output Created |27-AUG-2007 19:07:41 | |-----------------------------|---------------------------| [WideName] The variables are listed in the following order: LINE 1: NoKey Name.1 Name.2 Name.3 Name.4 LINE 2: Name.5 Name.6 Name.7 Name.8 Type.1 Type.2 Type.3 Type.4 Type.5 Type.6 Type.7 LINE 3: Type.8 NoKey: 1 Elizabeth Robinson George Peter Name.5: Susan Eliza Smith Jones F L F F F F L Type.8: L Number of cases read: 1 Number of cases listed: 1 ............. Now you can do the replacements; this is the part I haven't tested. The VECTOR statements and the upper bound on the LOOP depend on how many names there are; that is, they have to be dynamically-generated code, based on the data. In the old days, you'd do this by generating the code in SPSS, writing it to an external file, and then INCLUDing or INSERTing it. Nowadays, more likely Python. GET FILE=RealData. NUMERIC NoKey (F2). COMPUTE NoKey = 1. MATCH FILES /FILE=* /TABLE=WideName /BY NoKey. VECTOR Name = Name.1 TO Name.8 /Type = Type.1 TO Type.8. LOOP #NameIdx = 1 TO 8. . DO IF Type(#NameIdx) EQ 'F'. * .. Don't forget 'RTRIM' in this COMPUTE ....... . . COMPUTE Narrativ = REPLACE(Narrativ, RTRIM(Name(#NameIdx)), "~FirstName~"). . ELSE. * .. Don't forget 'RTRIM' in this COMPUTE ....... . . COMPUTE Narrativ = REPLACE(Narrativ, RTRIM(Name(#NameIdx)), "~LastName~"). . END IF. END LOOP. ============================= APPENDIX: (partial) test data ============================= |
|
In reply to this post by Richard Ristow
At 09:56 PM 8/27/2007, you wrote:
>But, do you need to remove "all possible first names and all >possible last names" from the narrative, or only those of the >individual - i.e., the names in the individual's First Name and Last >Name variables? > >Your first paragraph suggests that you only need to do the latter. >If that's the case, you don't need a separate file of names, with >the extra logic to handle it. You can just replace the first name >and last name from the current record, in each narrative. From what I've seen of the data (I don't have it in the office yet, the idea of removing the names is that I have to do so before taking it out of the agency and into my office), the most likely scenario will be that the names contained within the narrative will be the same names as in the first and last name variables from the same record/row. But it is possible to have the names from other records/rows in there as well, so your original solution was along the lines of what I want. ...and I do have spss 14 installed, so I'll have the replace function available. ...will try this out in the next day or so. Thanks Jeff |
|
At 10:47 AM 8/28/2007, Jeff wrote:
>At 09:56 PM 8/27/2007, you wrote: >>Do you need to remove "all possible first names and all possible last >>names" from the narrative, or only those of the individual - i.e., >>the names in the individual's First Name and Last Name variables? > >The most likely scenario will be that the names contained within the >narrative will be the same names as in the first and last name >variables from the same record/row. But it is possible to have the >names from other records/rows in there as well, so your original >solution was along the lines of what I want. Fair enough. In some circumstances it would be acceptable to retain *other* names in readable form (since they don't identify the current subject, and don't clearly identify anybody else), but in others not. You'll know what your requirements are. If you do need to replace all names, here's an extension that may be useful: First, replace the individual's first and last names, separately, by '~FirstName~' and '~LastName~'. THEN, go through the whole set of possible names, replacing first names by '~OtherFirst~' and last names by '~OtherLast~'. That may make the result more comprehensible, by making it clear when a name refers to the current person, and when it refers to somebody else. (That does re-raise the question of short names being subsets of longer ones. If you have a subject named 'Eliza', and that person's narrative refers to an 'Elizabeth', appearance of '~FirstName~beth' in the narrative could blow anonymity of BOTH people. This could take some special-case code. As Jon Peck writes, matching by regular expressions could solve this: match names only when not preceded or followed by letters. But without SPSS 15, you may not be able to do this.) You still may have a daunting speed problem, but you'll find that out, one way or the other, soon enough. Good luck to you! Richard |
|
At 12:55 PM 8/28/2007, you wrote:
>But without SPSS 15, you may not be able to do this.) > >You still may have a daunting speed problem, but you'll find that >out, one way or the other, soon enough. > >Good luck to you! >Richard Thanks (both Richard and Jon), ...very valuable information. I have v15 available in my office, but I will have to get the IT people from the agency I'm working with back to install the new version at the agency computer we are using. I've only worked with regular expressions a small bit, and never within spss/python. ...haven't done any python for that matter, but have some experience with similar languages so perhaps now is the time to learn some. In the event that we go with the python solution, could you tell me if the extra modules will require administrator access to install? I only have that on my own machines. Thanks Jeff |
|
For your question below, the plug-in and supplementary modules should not need admin privileges to install, but a few files are written to the SPSS installation directory.
-Jon -----Original Message----- From: SPSSX(r) Discussion on behalf of Jeff Sent: Tue 8/28/2007 3:46 PM To: [hidden email] Subject: Re: [SPSSX-L] De-Identifying String Variables... At 12:55 PM 8/28/2007, you wrote: >But without SPSS 15, you may not be able to do this.) > >You still may have a daunting speed problem, but you'll find that >out, one way or the other, soon enough. > >Good luck to you! >Richard Thanks (both Richard and Jon), ...very valuable information. I have v15 available in my office, but I will have to get the IT people from the agency I'm working with back to install the new version at the agency computer we are using. I've only worked with regular expressions a small bit, and never within spss/python. ...haven't done any python for that matter, but have some experience with similar languages so perhaps now is the time to learn some. In the event that we go with the python solution, could you tell me if the extra modules will require administrator access to install? I only have that on my own machines. Thanks Jeff |
| Free forum by Nabble | Edit this page |
