Hello All,
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
I poked around searching the listserv, but had trouble finding my exact issue. I have a long list of agencies/employers that I'd like to flag as School-related or not. There are several keywords that I would like to search for i.e. Elementary, Middle, High, School, School District, Schools, etc. So far, I'm using the syntax below, but it flags MIDDLETON POLICE DEPARTMENT because it contains the word MIDDLE...I'd like to search only for the exact phrases to avoid this issue. DATA LIST FIXED /TX_AFLTN 1-60(A). BEGIN DATA MIDDLETON HIGH SCHOOL MIDDLETON POLICE DEPARTMENT EAST HIGH SCHOOL END DATA. LIST. COMPUTE School=INDEX(TX_AFLTN,'MMSD')>0 OR INDEX(TX_AFLTN,'ELEMENTARY')>0 OR INDEX(TX_AFLTN,'ES')>0 OR INDEX(TX_AFLTN,'MIDDLE')>0 OR INDEX(TX_AFLTN,'MS')>0 OR INDEX(TX_AFLTN,'HIGH')>0 OR INDEX(TX_AFLTN,'HS')>0 OR INDEX(TX_AFLTN,'SCHOOL')>0 OR INDEX(TX_AFLTN,'SCHOOL DISTRICT')>0 OR INDEX(TX_AFLTN,'ELEM.')>0 OR INDEX(TX_AFLTN,'SCHOOLS')>0. EXECUTE. Thanks in advance! -Ariel |
Administrator
|
"MIDDLE "?
--
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by ariel barak
Perhaps others will have more elegant ideas but my idea is this The common element in the words you are searching for is that each has a character string followed by a space. That’s how you know the difference between middle
and middleton. How to exploit that? You’re going to have to look at your list carefully to check for problems. The simplest change is to search for the text string plus a space character rather
than the text string, as you do now. So, possible simple problem strings: ‘highschool’, etc: search for the whole string. More complicated problem: It just so happens that
TX_AFLTN, which is A60, contains names that are exactly 60 characters long. Then, the search for ‘middle ‘ won’t work. Probably the simplest thing to do is first to compute the string length, which is not the variable width, by using Char.Length function
and find the max value and, if necessary, use Alter Variable to widen the variable width, e.g., A60 to A61. Gene Maguin From: SPSSX(r) Discussion [mailto:[hidden email]]
On Behalf Of Ariel Barak Hello All, I poked around searching the listserv, but had trouble finding my exact issue. I have a long list of agencies/employers that I'd like to flag as School-related or not. There are several keywords that I would like to search for i.e. Elementary, Middle, High, School, School District, Schools, etc. So far, I'm using
the syntax below, but it flags MIDDLETON POLICE DEPARTMENT because it contains the word MIDDLE...I'd like to search only for the exact phrases to avoid this issue. DATA LIST FIXED /TX_AFLTN 1-60(A). BEGIN DATA MIDDLETON HIGH SCHOOL MIDDLETON POLICE DEPARTMENT EAST HIGH SCHOOL END DATA. LIST. COMPUTE School=INDEX(TX_AFLTN,'MMSD')>0 OR INDEX(TX_AFLTN,'ELEMENTARY')>0 OR INDEX(TX_AFLTN,'ES')>0 OR INDEX(TX_AFLTN,'MIDDLE')>0 OR INDEX(TX_AFLTN,'MS')>0 OR INDEX(TX_AFLTN,'HIGH')>0 OR INDEX(TX_AFLTN,'HS')>0 OR INDEX(TX_AFLTN,'SCHOOL')>0 OR INDEX(TX_AFLTN,'SCHOOL DISTRICT')>0 OR INDEX(TX_AFLTN,'ELEM.')>0 OR INDEX(TX_AFLTN,'SCHOOLS')>0. EXECUTE. Thanks in advance! -Ariel ===================== To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
|
Administrator
|
In reply to this post by ariel barak
FWIW: I would likely drop that into a DO REPEAT.
COMPUTE School=0. DO REPEAT needle='MMSD ' 'ELEMENTARY ' 'ES ' 'MIDDLE ' 'MS ' 'HIGH ' 'HS ' 'SCHOOL ' 'SCHOOL DISTRICT ' 'ELEM. ' 'SCHOOLS '. COMPUTE School=School OR INDEX(TX_AFLTN,needle)>0. END REPEAT.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by ariel barak
The best way to do pattern matching is
with a regular expression. Here's an example. This allows for
terms that occur at the end of the input, commas, etc. You can add
other terms to the terms in re.search below, where | means or. The
\b directive means word boundary.
DATA LIST FIXED /TX_AFLTN 1-60(A). BEGIN DATA MIDDLETON HIGH SCHOOL MIDDLETON POLICE DEPARTMENT EAST HIGH SCHOOL YAHOO ELEM. END DATA. DATASET NAME DATA. begin program. import re, spss def school(s): if re.search(r"\b(SCHOOL|MIDDLE|HIGH|HS|ELEM|ELEMENTARY)\b", s): return True else: return False end program. SPSSINC TRANS RESULT=school /FORMULA "school(TX_AFLTN)". Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Ariel Barak <[hidden email]> To: [hidden email] Date: 03/09/2015 09:47 AM Subject: [SPSSX-L] Best way to flag multiple Needles in Haystack Sent by: "SPSSX(r) Discussion" <[hidden email]> Hello All, I poked around searching the listserv, but had trouble finding my exact issue. I have a long list of agencies/employers that I'd like to flag as School-related or not. There are several keywords that I would like to search for i.e. Elementary, Middle, High, School, School District, Schools, etc. So far, I'm using the syntax below, but it flags MIDDLETON POLICE DEPARTMENT because it contains the word MIDDLE...I'd like to search only for the exact phrases to avoid this issue. DATA LIST FIXED /TX_AFLTN 1-60(A). BEGIN DATA MIDDLETON HIGH SCHOOL MIDDLETON POLICE DEPARTMENT EAST HIGH SCHOOL END DATA. LIST. COMPUTE School=INDEX(TX_AFLTN,'MMSD')>0 OR INDEX(TX_AFLTN,'ELEMENTARY')>0 OR INDEX(TX_AFLTN,'ES')>0 OR INDEX(TX_AFLTN,'MIDDLE')>0 OR INDEX(TX_AFLTN,'MS')>0 OR INDEX(TX_AFLTN,'HIGH')>0 OR INDEX(TX_AFLTN,'HS')>0 OR INDEX(TX_AFLTN,'SCHOOL')>0 OR INDEX(TX_AFLTN,'SCHOOL DISTRICT')>0 OR INDEX(TX_AFLTN,'ELEM.')>0 OR INDEX(TX_AFLTN,'SCHOOLS')>0. EXECUTE. Thanks in advance! -Ariel ===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by ariel barak
As well, as inclusion terms, you could also build (multiple) exclusion terms, to counteract matches to "MIDDLE" or any other possible clash:
COMPUTE School=(INDEX(TX_AFLTN,'MMSD')>0 OR INDEX(TX_AFLTN,'ELEMENTARY')>0 OR INDEX(TX_AFLTN,'ES')>0 OR INDEX(TX_AFLTN,'MIDDLE')>0 OR INDEX(TX_AFLTN,'MS')>0 OR INDEX(TX_AFLTN,'HIGH')>0 OR INDEX(TX_AFLTN,'HS')>0 OR INDEX(TX_AFLTN,'SCHOOL')>0 OR INDEX(TX_AFLTN,'SCHOOL DISTRICT')>0 OR INDEX(TX_AFLTN,'ELEM.')>0 OR INDEX(TX_AFLTN,'SCHOOLS')>0) AND (INDEX(TX_AFLTN,'MIDDLETON ')=0 OR INDEX(TX_AFLTN,'MIDDLEFOO ')=0 OR INDEX(TX_AFLTN,'MIDDLEBAR ')=0). |
The trailing space's were not intentional in my last post, it should have read
COMPUTE School=(INDEX(TX_AFLTN,'MMSD')>0 OR INDEX(TX_AFLTN,'ELEMENTARY')>0 OR INDEX(TX_AFLTN,'ES')>0 OR INDEX(TX_AFLTN,'MIDDLE')>0 OR INDEX(TX_AFLTN,'MS')>0 OR INDEX(TX_AFLTN,'HIGH')>0 OR INDEX(TX_AFLTN,'HS')>0 OR INDEX(TX_AFLTN,'SCHOOL')>0 OR INDEX(TX_AFLTN,'SCHOOL DISTRICT')>0 OR INDEX(TX_AFLTN,'ELEM.')>0 OR INDEX(TX_AFLTN,'SCHOOLS')>0) AND (INDEX(TX_AFLTN,'MIDDLETON')=0 OR INDEX(TX_AFLTN,'MIDDLEFOO')=0 OR INDEX(TX_AFLTN,'MIDDLEBAR')=0). |
Administrator
|
In reply to this post by David Marso
Also, if you have a HUGE file and a large list the following might be a bit quicker?
Initialize a vector of scratch variables at the beginning and terminate the loop immediately if found rather than exhaustively search the list? VECTOR #needle(12,A20). DO IF $CASENUM=1. DO REPEAT x='MMSD ' 'ELEMENTARY ' 'ES ' 'MIDDLE ' 'MS ' 'HIGH ' 'HS ' 'SCHOOL ' 'SCHOOL DISTRICT ' 'ELEM. ' 'SCHOOLS ' /index=1 TO 12. COMPUTE #needle(#index)=x. END REPEAT. COMPUTE School=0. LOOP #=1 TO 12. COMPUTE School=School OR INDEX(TX_AFLTN,#needle(#) )>0. END LOOP IF School .
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Jignesh Sutar
By the way, you may encounter more problems with (correct) matches to "ES" than to "MIDDLE". "ES" could potentially match to a whole host of sub-string/natural words that are not intended to represent the abbreviation "ELEMENTARY SCHOOL".
Matching to such a short sub-string like "ES" would be extremely difficult, you would end up incorrectly allocating many more cases. Either you'll need to monitor very carefully and build whatever mechanism to assign accordingly or revise how the data is being collected (if possible, instruct respondents to avoid abbreviations for example) |
which is another reason for using the regular
expression syntax I posted. It understands word boundaries.
Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Jignesh Sutar <[hidden email]> To: [hidden email] Date: 03/09/2015 10:40 AM Subject: Re: [SPSSX-L] Best way to flag multiple Needles in Haystack Sent by: "SPSSX(r) Discussion" <[hidden email]> By the way, you may encounter more problems with (correct) matches to "ES" than to "MIDDLE". "ES" could potentially match to a whole host of sub-string/natural words that are not intended to represent the abbreviation "ELEMENTARY SCHOOL". Matching to such a short sub-string like "ES" would be extremely difficult, you would end up incorrectly allocating many more cases. Either you'll need to monitor very carefully and build whatever mechanism to assign accordingly or revise how the data is being collected (if possible, instruct respondents to avoid abbreviations for example) -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Best-way-to-flag-multiple-Needles-in-Haystack-tp5728938p5728946.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Hi All, Thanks for your responses. I can't believe I didn't think of adding the space to strings I wanted to flag. I saw that the 'ES ' had tons of records flagged that I didn't want flagged (FORWARD HEALTH SERVICES, for example), so I added a space to the ES so it read ' ES '. In my data, the description of a proper Elementary School never reads 'ES Jefferson', it's always 'Jefferson ES', so that worked well. Strangely, 'JEFFERSON ES' was still flagged as a school even though the ES in 'JEFFERSON ES' wasn't followed by a space...in my case, that was actually beneficial. Thanks again, Ariel On Mon, Mar 9, 2015 at 11:43 AM, Jon K Peck <[hidden email]> wrote: which is another reason for using the regular expression syntax I posted. It understands word boundaries. |
With respect to > Strangely, 'JEFFERSON ES' was still flagged as a school even though the ES in 'JEFFERSON ES' wasn't followed by a space...in my case, that was actually
beneficial. I believe your variable was an A60 and since this value (JEFF…) has 12 characters, I would assume that characters 13-60 have the value that equals the space character.
Thus ‘ES ‘ is found. However, if you look at value for that record in the data editor, there will appear to be no characters to the right of the ‘ES’. I assume this means that the data editor right trims the stored A60 value for editing purposes. Gene Maguin From: Ariel Barak [mailto:[hidden email]]
Hi All, Thanks for your responses. I can't believe I didn't think of adding the space to strings I wanted to flag. I saw that the 'ES ' had tons of records flagged that I didn't want flagged (FORWARD HEALTH SERVICES, for example), so I added a space to the ES so it read ' ES '. In my data, the description of a proper Elementary School never reads 'ES
Jefferson', it's always 'Jefferson ES', so that worked well. Strangely, 'JEFFERSON ES' was still flagged as a school even though the ES in 'JEFFERSON ES' wasn't followed by a space...in my case, that was actually beneficial. Thanks again, Ariel On Mon, Mar 9, 2015 at 11:43 AM, Jon K Peck <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |