Hello all,
I'm having trouble using the Index function to search for given words in a string variable. I am using the following syntax (pretend example of words to illustrate my point!): COMPUTE flag=INDEX(UPCASE(str1),'Tom Jones'), INDEX(UPCASE(str1),'Katie Price'), INDEX(UPCASE(str1),'Chris Brown')>0. The problem is, the syntax returns cases I am not looking for as well - for example - Tom Odell (who I am not looking for) is flagged because it found the Tom bit of Tom Jones. Similarly, Tom Jones Smith might be flagged too - which I also dont want. Is there anyway of getting the syntax to work so that it finds cases with only the exact words I specify and nothing else? Any help would be really appreciated! Mike |
your compute command as posted is not syntactically correct. the comma a few character after Jones seems to be a problem.
please supply a snippet of syntax that demonstrates your problem. Also please give more words about your goal. Do you want to know if the whole string is equal to one of your targets? etc. Do you only have the three target names?
Art Kendall
Social Research Consultants |
The syntax snippet would look something like this.
data list list/str1 (a50) wanted_flag (f1). begin data 'Tom Jones' 1 'my friend Tom Jones' 1 'Tom Jones Smith' 0 end data. list.
Art Kendall
Social Research Consultants |
Administrator
|
In reply to this post by Mike Paul
Does something like the following do what you want?
* Generate a small data set to illustrate. DATA LIST LIST / str1(A25). BEGIN DATA "Tom Jones" " Tom Jones " "Katie Price" "Chris Brown" "Tom Odell" "Tom Jones Smith" "TOM JONES" "KATIE PRICE" "CHRIS BROWN" "TOM ODELL" "TOM JONES SMITH" "DAVID TOM JONES" END DATA. LIST. STRING #copy (A25). COMPUTE #copy = CONCAT("!",LTRIM(UPCASE(str1)),"!"). **** Add more needles as needed *****. VECTOR #needle(3,A25). COMPUTE #needle(1) = "!TOM JONES!". COMPUTE #needle(2) = "!KATIE PRICE!". COMPUTE #needle(3) = "!CHRIS BROWN!". *************************************. LOOP # = 1 to 3. // Loop from 1 to number of needles. - COMPUTE flag = CHAR.INDEX(#copy,#needle(#)) GT 0. END LOOP IF flag. FORMATS flag(F1). LIST flag str1. Output: flag str1 1 Tom Jones 1 Tom Jones 1 Katie Price 1 Chris Brown 0 Tom Odell 0 Tom Jones Smith 1 TOM JONES 1 KATIE PRICE 1 CHRIS BROWN 0 TOM ODELL 0 TOM JONES SMITH 0 DAVID TOM JONES
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
Administrator
|
In reply to this post by Mike Paul
You are shooting yourself in the foot by using UPCASE on the haystack but leaving the needle mixed case.
Ever wonder why it is returning NADA at least the way the code currently stands it should return nothing? "Is there anyway of getting the syntax to work so that it finds cases with only the exact words I specify and nothing else?" Not without building additional logic. Computers do what you tell them to. Not what your magical rainbow Unicorn might accidently poop onto the plate.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Art Kendall
The data consists of the company names that individuals in my data-set are employed at (there are 400k+ individuals in my data). The end goal is to flag those individuals employed by certain companies (for example, all those employed by Apple, Tesco, Walmart, etc. - different combinations of employers that I pre-specify). The problem I'm having is that when I run the syntax below, other employers that I don't want are flagged - for example, those who have part of the name of the target employer within their name... as illustrated below:
COMPUTE Flag_T100= sum(INDEX(UPCASE(F_EMPNAME), 'APPLE'),INDEX(UPCASE(F_EMPNAME), 'FACEBOOK'))>0. execute. temporary. select if flag_T100=1. list /var F_EMPNAME. F_EMPNAME Apple Applecroft School Appletree Court Apple Apple Apple Apple Blossom Lodge The Apples Medical Centre NHS Any thoughts or advice on how to get around this would be really welcome! Mike |
In reply to this post by Bruce Weaver
Thank you for this - it looks like it is what I am trying to achieve (see above reply) - could you please say a bit more about how this works? I'm not quite sure how to apply the syntax to my data, but it definitely looks a solution!
|
Administrator
|
In reply to this post by Mike Paul
Slow down and take a breath.
Clearly a brute force single COMPUTE will not do your bidding here. As I mentioned previously you will need to build additional code/logic. COMPUTE #found=CHAR.INDEX(UPCASE(F_EMPNAME), 'APPLE'). IF CHAR.SUBSTR(UPCASE(F_EMPNAME), LENGTH('APPLE')+1) NE "" OTHERCRAPExists=TRUE. You could build a DO REPEAT which helps you stuff the right stuff into the right stuff. COMPUTE TheRightStuff=0. DO REPEAT needle='APPLE' 'FARCEBOOK' 'DIDDLEDOG' "YUMMIESFORPOOTIES". COMPUTE #found=CHAR.INDEX(UPCASE(F_EMPNAME), needle). IF CHAR.SUBSTR(UPCASE(F_EMPNAME), LENGTH(needle)+1) NE "" OTHERCRAPExists=1. IF #found AND NOT(OTHERCRAPExists) TheRightStuff=TheRightStuff +1. END REPEAT. COMPUTE MyFilter=TheRightStuff GE 1. YMMV, best. HTH
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
It seems to me that most of the complication goes away if the companies in the data are thoroughly
edited, not just for Upper/Lower case but for punctuation. (Murphy & Sons vs. Murphy and Sons ?) And the names should be edited, anyway, to get rid of mis-spellings or other ambiguities. Is it Apple, Inc?
With unique names at relative column 1, you can look for exact matches, rather "included-as-a-substring-but-also-exact".
(For that matter, asking for the substring that contains numerous blanks after the name seems like a handy adaption for the other code that has been suggested.)
And I would be tempted to put the list of check-files into another file. Then I could use Match Files with /Table=
/in= in order to label the interesting cases.
-- Rich Ulrich From: SPSSX(r) Discussion <[hidden email]> on behalf of David Marso <[hidden email]>
Sent: Wednesday, June 7, 2017 12:51:28 PM To: [hidden email] Subject: Re: Flag cases where a strong variable contains only the EXACT word Slow down and take a breath.
Clearly a brute force single COMPUTE will not do your bidding here. As I mentioned previously you will need to build additional code/logic. COMPUTE #found=CHAR.INDEX(UPCASE(F_EMPNAME), 'APPLE'). IF CHAR.SUBSTR(UPCASE(F_EMPNAME), LENGTH('APPLE')+1) NE "" OTHERCRAPExists=TRUE. You could build a DO REPEAT which helps you stuff the right stuff into the right stuff. COMPUTE TheRightStuff=0. DO REPEAT needle='APPLE' 'FARCEBOOK' 'DIDDLEDOG' "YUMMIESFORPOOTIES". COMPUTE #found=CHAR.INDEX(UPCASE(F_EMPNAME), needle). IF CHAR.SUBSTR(UPCASE(F_EMPNAME), LENGTH(needle)+1) NE "" OTHERCRAPExists=1. IF #found AND NOT(OTHERCRAPExists) TheRightStuff=TheRightStuff +1. END REPEAT. COMPUTE MyFilter=TheRightStuff GE 1. YMMV, best. HTH Mike Paul wrote > The data consists of the company names that individuals in my data-set are > employed at (there are 400k+ individuals in my data). The end goal is to > flag those individuals employed by certain companies (for example, all > those employed by Apple, Tesco, Walmart, etc. - different combinations of > employers that I pre-specify). The problem I'm having is that when I run > the syntax below, other employers that I don't want are flagged - for > example, those who have part of the name of the target employer within > their name... as illustrated below: > > COMPUTE Flag_T100= sum(INDEX(UPCASE(F_EMPNAME), > 'APPLE'),INDEX(UPCASE(F_EMPNAME), 'FACEBOOK'))>0. > execute. > > temporary. > select if flag_T100=1. > list /var F_EMPNAME. > > F_EMPNAME > > Apple > Applecroft School > Appletree Court > Apple > Apple > Apple > Apple Blossom Lodge > The Apples Medical Centre NHS > > Any thoughts or advice on how to get around this would be really welcome! > > Mike ----- Please reply to the list and not to my personal email. Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Flag-cases-where-a-strong-variable-contains-only-the-EXACT-word-tp5734322p5734351.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by Mike Paul
Here's a variation on the syntax I posted earlier. It uses your SUM() approach rather than LOOP-END LOOP. It also assumes you want to flag anyone who works for Apple, Tesco or Facebook.
* Generate a small data set to illustrate. DATA LIST LIST / F_EMPNAME(A50). BEGIN DATA "Apple" "Applecroft School" "Appletree Court" "Apple" " APPLE" " ApPlE" "Apple Blossom Lodge" "The Apples Medical Centre NHS" "Tesco" " TESCO" " TeScO" "Walmart" " WaLmArT" "Facebook" " FaceBook" END DATA. STRING #copy (A50). COMPUTE #copy = CONCAT("!",LTRIM(UPCASE(F_EMPNAME)),"!"). *****************************************************. * Flag anyone who works for Apple, Tesco or Facebook. *****************************************************. COMPUTE Flag_T100 = SUM( CHAR.INDEX(UPCASE(#copy), "!APPLE!"), CHAR.INDEX(UPCASE(#copy), "!TESCO!"), CHAR.INDEX(UPCASE(#copy), "!FACEBOOK!")) GT 0. *****************************************************. FORMATS Flag_T100(F1). LIST Flag_T100 F_EMPNAME. TEMPORARY. SELECT IF Flag_T100. LIST Flag_T100 F_EMPNAME. Output from the first LIST: Flag_T100 F_EMPNAME 1 Apple 0 Applecroft School 0 Appletree Court 1 Apple 1 APPLE 1 ApPlE 0 Apple Blossom Lodge 0 The Apples Medical Centre NHS 1 Tesco 1 TESCO 1 TeScO 0 Walmart 0 WaLmArT 1 Facebook 1 FaceBook Output from the second LIST: Flag_T100 F_EMPNAME 1 Apple 1 Apple 1 APPLE 1 ApPlE 1 Tesco 1 TESCO 1 TeScO 1 Facebook 1 FaceBook HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
Administrator
|
And of course, seconds after posting, I noticed that UPCASE() is not needed in the computation of the Flag variable. I.e., you can change this...
COMPUTE Flag_T100 = SUM( CHAR.INDEX(UPCASE(#copy), "!APPLE!"), CHAR.INDEX(UPCASE(#copy), "!TESCO!"), CHAR.INDEX(UPCASE(#copy), "!FACEBOOK!")) GT 0. to this: COMPUTE Flag_T100 = SUM( CHAR.INDEX(#copy, "!APPLE!"), CHAR.INDEX(#copy, "!TESCO!"), CHAR.INDEX(#copy, "!FACEBOOK!")) GT 0.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
In reply to this post by Mike Paul
if your variable is typed in rather than populated from a drop-down list.
See if creating a few flag variables like this will help. data list list /F_EmpName (a30). begin data 'apple' 'APPLES' 'Aple' 'Apple' 'APPLE' 'Applecroft School' 'Appletree Court' 'Apple' 'Apple' 'Apple' 'Apple Blossom Lodge' 'Apple Blosom Lodge' 'The Apples Medical Centre NHS' end data. autorecode F_EmpName /into F_EmpName# /print. RECODE F_EmpName# (1,2,3,4=3)(5,6=5)(ELSE=COPY). string clean_EmpName (a30). compute clean_EmpName = valuelabel( F_EmpName#). Compute Flag_Apple = F_EmpName# eq 3. formats Flag_apple (f1). list.
Art Kendall
Social Research Consultants |
If there are more than a few possible values, a lookup table is definitely the way to go. Here is an example using the SPSSINC TRANS extension command. lookup is a table that will be keyed by variable akey. The main dataset is named main. The command returns a lookup value if present in the lookup table and sysmis otherwise. In this application, only found or not found matters, so I made all the return values 1, but if there is some reason to associate a specific values, that would replace the value variable. Testing for sysmis after running this gives the final answer. * The lookup table. data list free/ value(F8.0) akey(A1). begin data 1 'a' 1 'b' 1 'z' end data. dataset name lookup. * The main dataset. data list free/x(f8.0) y(A2). begin data 1 'a' 2 'b' 5 'a ' 10 '' 1 'b' end data. dataset name main. dataset activate main. spssinc trans result = resultcodealpha /initial "extendedTransforms.vlookup('akey', 'value', 'lookup')" /formula func(y). Here is the main dataset afterwards. x y resultcodealpha 1 a 1.00 2 b 1.00 5 a 1.00 10 . 1 b 1.00 The spssinc trans extension command is installed by default, and the extendedTransforms.py module is installed in recent versions. For older versions, it can be obtained from the SPSS Community website or I can email it by request. On Wed, Jun 7, 2017 at 12:58 PM, Art Kendall <[hidden email]> wrote: if your variable is typed in rather than populated from a drop-down list. |
We still do not know whether the variable content was consistent in spelling, casing, spacing etc. We also do not yet know whether there are many or a few values of interest.
In any case, one way to get a draft of the look up table would be to use the TEMPLATE option in AUTORECODE. Another way to get a draft of the look up table would be to use FREQUENCIES and OMS.
Art Kendall
Social Research Consultants |
Thank you all for the very helpful advice - I'm going to try each suggestion this afternoon.
In terms of consistency of company names, spelling/grammar isn't a problem, but sometimes as pointed out by Rich some of the same companies have a few different versions to their name - for example, Apple is sometimes just Apples, but other times listed as Apple Inc, or Apple Europe. With the syntax I've been using I have got over this by just including the many different versions of the name that I know of. Coding such a huge qualitative data-set is tough! |
• Value mappings from the template are applied first. All remaining values are recoded into values higher than the last value in the template, with user-missing values (based on the first variable in the list with defined user-missing values) recoded into values higher than the last valid value. I have not tried this but you should be able to recode the values in the template file, save that file with 3 variables, create a new template file by deleting the original AUTORECODE value, apply the new template. You can help draft your recode in the template by sorting on the recoded value and inspecting which orginal contents would end up in the new value. P. S. I do not have time to test it, but I think the template can have several input values that have the same final group value.
Art Kendall
Social Research Consultants |
In reply to this post by Bruce Weaver
Hi Bruce
Trying your second suggestion and the first bit works, but then I get an error message when trying to run the second bit of the syntax: STRING #copy (A77). COMPUTE #copy = CONCAT("!",LTRIM(UPCASE(F_EMPNAME)),"!"). execute. COMPUTE Flag_T100 = SUM( CHAR.INDEX(#copy, "!APPLE!"), CHAR.INDEX(#copy, "!TESCO!"), CHAR.INDEX(#copy, "!FACEBOOK!")) GT 0. ***Output message*** >Error # 4285 in column 13. Text: #copy >Incorrect variable name: either the name is more than 64 characters, or it is >not defined by a previous command. >Execution of this command stops. EXECUTE. Not sure what is happening here... |
Administrator
|
Hi Mike. Get rid of that EXECUTE after you compute #copy. Scratch variables (i.e., variables that start with #) are abolished by EXECUTE or by any procedure that causes a pass through the data set.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
Hi Bruce, many thanks, I deleted the EXECUTE, but when running the list the cases do not get flagged... not sure if you have any thoughts on what could be going wrong?
Here's my syntax/output: STRING #copy (A50). COMPUTE #copy = CONCAT("!",LTRIM(UPCASE(F_EMPNAME)),"!"). COMPUTE Flag_T100 = SUM( CHAR.INDEX(#copy, "!APPLE!"), CHAR.INDEX(#copy, "!TESCO!"), CHAR.INDEX(#copy, "!FACEBOOK!")) GT 0. FORMATS Flag_T100(F1). LIST Flag_T100 F_EMPNAME. Flag_T100 F_EMPNAME 0 Aberdeen City Council 0 University of Leeds 0 Facebook 0 Unknown Etc. .................. ***not copied all the list!**** Number of cases read: 424,373 Number of cases listed: 424,373 TEMPORARY. SELECT IF Flag_T100. LIST Flag_T100 F_EMPNAME. List Number of cases read: 0 Number of cases listed: 0 |
Administrator
|
COMPUTE #copy = CONCAT("!",LTRIM(UPCASE(F_EMPNAME)),"!").
might be better as: COMPUTE #copy = CONCAT("!",LTRIM(UTRIM(UPCASE(F_EMPNAME))),"!"). It wouldn't matter if you were running in UNICODE mode.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Free forum by Nabble | Edit this page |