Best way to flag multiple Needles in Haystack

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Best way to flag multiple Needles in Haystack

ariel barak
Hello All,

I poked around searching the listserv, but had trouble finding my exact issue.

I have a long list of agencies/employers that I'd like to flag as School-related or not. There are several keywords that I would like to search for i.e. Elementary, Middle, High, School, School District, Schools, etc. So far, I'm using the syntax below, but it flags MIDDLETON POLICE DEPARTMENT because it contains the word MIDDLE...I'd like to search only for the exact phrases to avoid this issue.

DATA LIST FIXED /TX_AFLTN 1-60(A).
BEGIN DATA 
MIDDLETON HIGH SCHOOL
MIDDLETON POLICE DEPARTMENT
EAST HIGH SCHOOL
END DATA.
LIST.

COMPUTE School=INDEX(TX_AFLTN,'MMSD')>0 OR 
INDEX(TX_AFLTN,'ELEMENTARY')>0 OR
INDEX(TX_AFLTN,'ES')>0 OR
INDEX(TX_AFLTN,'MIDDLE')>0 OR
INDEX(TX_AFLTN,'MS')>0 OR
INDEX(TX_AFLTN,'HIGH')>0 OR
INDEX(TX_AFLTN,'HS')>0 OR
INDEX(TX_AFLTN,'SCHOOL')>0 OR
INDEX(TX_AFLTN,'SCHOOL DISTRICT')>0 OR
INDEX(TX_AFLTN,'ELEM.')>0 OR
INDEX(TX_AFLTN,'SCHOOLS')>0.
EXECUTE.

Thanks in advance!

-Ariel

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Best way to flag multiple Needles in Haystack

David Marso
Administrator
"MIDDLE "?
--
ariel barak wrote
Hello All,

I poked around searching the listserv, but had trouble finding my exact
issue.

I have a long list of agencies/employers that I'd like to flag as
School-related or not. There are several keywords that I would like to
search for i.e. Elementary, Middle, High, School, School District, Schools,
etc. So far, I'm using the syntax below, but it flags MIDDLETON POLICE
DEPARTMENT because it contains the word MIDDLE...I'd like to search only
for the exact phrases to avoid this issue.

DATA LIST FIXED /TX_AFLTN 1-60(A).
BEGIN DATA
MIDDLETON HIGH SCHOOL
MIDDLETON POLICE DEPARTMENT
EAST HIGH SCHOOL
END DATA.
LIST.

COMPUTE School=INDEX(TX_AFLTN,'MMSD')>0 OR
INDEX(TX_AFLTN,'ELEMENTARY')>0 OR
INDEX(TX_AFLTN,'ES')>0 OR
INDEX(TX_AFLTN,'MIDDLE')>0 OR
INDEX(TX_AFLTN,'MS')>0 OR
INDEX(TX_AFLTN,'HIGH')>0 OR
INDEX(TX_AFLTN,'HS')>0 OR
INDEX(TX_AFLTN,'SCHOOL')>0 OR
INDEX(TX_AFLTN,'SCHOOL DISTRICT')>0 OR
INDEX(TX_AFLTN,'ELEM.')>0 OR
INDEX(TX_AFLTN,'SCHOOLS')>0.
EXECUTE.

Thanks in advance!

-Ariel

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Best way to flag multiple Needles in Haystack

Maguin, Eugene
In reply to this post by ariel barak

Perhaps others will have more elegant ideas but my idea is this

The common element in the words you are searching for is that each has a character string followed by a space. That’s how you know the difference between middle and middleton. How to exploit that?

You’re going to have to look at your list carefully to check for problems. The simplest change is to search for the text string plus a space character rather than the text string, as you do now.

 

So, possible simple problem strings: ‘highschool’, etc: search for the whole string.

More complicated problem: It just so happens that TX_AFLTN, which is A60, contains names that are exactly 60 characters long. Then, the search for ‘middle ‘ won’t work. Probably the simplest thing to do is first to compute the string length, which is not the variable width, by using Char.Length function and find the max value and, if necessary, use Alter Variable to widen the variable width, e.g., A60 to A61.

 

Gene Maguin

 

 

 

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Ariel Barak
Sent: Monday, March 09, 2015 11:46 AM
To: [hidden email]
Subject: Best way to flag multiple Needles in Haystack

 

Hello All,

 

I poked around searching the listserv, but had trouble finding my exact issue.

 

I have a long list of agencies/employers that I'd like to flag as School-related or not. There are several keywords that I would like to search for i.e. Elementary, Middle, High, School, School District, Schools, etc. So far, I'm using the syntax below, but it flags MIDDLETON POLICE DEPARTMENT because it contains the word MIDDLE...I'd like to search only for the exact phrases to avoid this issue.

 

DATA LIST FIXED /TX_AFLTN 1-60(A).

BEGIN DATA 

MIDDLETON HIGH SCHOOL

MIDDLETON POLICE DEPARTMENT

EAST HIGH SCHOOL

END DATA.

LIST.

 

COMPUTE School=INDEX(TX_AFLTN,'MMSD')>0 OR 

INDEX(TX_AFLTN,'ELEMENTARY')>0 OR

INDEX(TX_AFLTN,'ES')>0 OR

INDEX(TX_AFLTN,'MIDDLE')>0 OR

INDEX(TX_AFLTN,'MS')>0 OR

INDEX(TX_AFLTN,'HIGH')>0 OR

INDEX(TX_AFLTN,'HS')>0 OR

INDEX(TX_AFLTN,'SCHOOL')>0 OR

INDEX(TX_AFLTN,'SCHOOL DISTRICT')>0 OR

INDEX(TX_AFLTN,'ELEM.')>0 OR

INDEX(TX_AFLTN,'SCHOOLS')>0.

EXECUTE.

 

Thanks in advance!

 

-Ariel

 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Best way to flag multiple Needles in Haystack

David Marso
Administrator
In reply to this post by ariel barak
FWIW:  I would likely drop that into a DO REPEAT.
COMPUTE School=0.
DO REPEAT
  needle='MMSD ' 'ELEMENTARY ' 'ES ' 'MIDDLE ' 'MS ' 'HIGH ' 'HS ' 'SCHOOL ' 'SCHOOL DISTRICT ' 'ELEM. ' 'SCHOOLS '.
COMPUTE School=School OR INDEX(TX_AFLTN,needle)>0.
END REPEAT.
ariel barak wrote
Hello All,

I poked around searching the listserv, but had trouble finding my exact
issue.

I have a long list of agencies/employers that I'd like to flag as
School-related or not. There are several keywords that I would like to
search for i.e. Elementary, Middle, High, School, School District, Schools,
etc. So far, I'm using the syntax below, but it flags MIDDLETON POLICE
DEPARTMENT because it contains the word MIDDLE...I'd like to search only
for the exact phrases to avoid this issue.

DATA LIST FIXED /TX_AFLTN 1-60(A).
BEGIN DATA
MIDDLETON HIGH SCHOOL
MIDDLETON POLICE DEPARTMENT
EAST HIGH SCHOOL
END DATA.
LIST.

COMPUTE School=INDEX(TX_AFLTN,'MMSD')>0 OR
INDEX(TX_AFLTN,'ELEMENTARY')>0 OR
INDEX(TX_AFLTN,'ES')>0 OR
INDEX(TX_AFLTN,'MIDDLE')>0 OR
INDEX(TX_AFLTN,'MS')>0 OR
INDEX(TX_AFLTN,'HIGH')>0 OR
INDEX(TX_AFLTN,'HS')>0 OR
INDEX(TX_AFLTN,'SCHOOL')>0 OR
INDEX(TX_AFLTN,'SCHOOL DISTRICT')>0 OR
INDEX(TX_AFLTN,'ELEM.')>0 OR
INDEX(TX_AFLTN,'SCHOOLS')>0.
EXECUTE.

Thanks in advance!

-Ariel

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Best way to flag multiple Needles in Haystack

Jon K Peck
In reply to this post by ariel barak
The best way to do pattern matching is with a regular expression.  Here's an example.  This allows for terms that occur at the end of the input, commas, etc.  You can add other terms to the terms in re.search below, where | means or.  The \b directive means word boundary.

DATA LIST FIXED /TX_AFLTN 1-60(A).
BEGIN DATA
MIDDLETON HIGH SCHOOL
MIDDLETON POLICE DEPARTMENT
EAST HIGH SCHOOL
YAHOO ELEM.
END DATA.
DATASET NAME DATA.


begin program.
import re, spss

def school(s):
   if re.search(r"\b(SCHOOL|MIDDLE|HIGH|HS|ELEM|ELEMENTARY)\b", s):
       return True
   else:
       return False
end program.

SPSSINC TRANS RESULT=school
/FORMULA "school(TX_AFLTN)".


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Ariel Barak <[hidden email]>
To:        [hidden email]
Date:        03/09/2015 09:47 AM
Subject:        [SPSSX-L] Best way to flag multiple Needles in Haystack
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Hello All,

I poked around searching the listserv, but had trouble finding my exact issue.

I have a long list of agencies/employers that I'd like to flag as School-related or not. There are several keywords that I would like to search for i.e. Elementary, Middle, High, School, School District, Schools, etc. So far, I'm using the syntax below, but it flags MIDDLETON POLICE DEPARTMENT because it contains the word MIDDLE...I'd like to search only for the exact phrases to avoid this issue.

DATA LIST FIXED /TX_AFLTN 1-60(A).
BEGIN DATA 
MIDDLETON HIGH SCHOOL
MIDDLETON POLICE DEPARTMENT
EAST HIGH SCHOOL
END DATA.
LIST.

COMPUTE School=INDEX(TX_AFLTN,'MMSD')>0 OR 
INDEX(TX_AFLTN,'ELEMENTARY')>0 OR
INDEX(TX_AFLTN,'ES')>0 OR
INDEX(TX_AFLTN,'MIDDLE')>0 OR
INDEX(TX_AFLTN,'MS')>0 OR
INDEX(TX_AFLTN,'HIGH')>0 OR
INDEX(TX_AFLTN,'HS')>0 OR
INDEX(TX_AFLTN,'SCHOOL')>0 OR
INDEX(TX_AFLTN,'SCHOOL DISTRICT')>0 OR
INDEX(TX_AFLTN,'ELEM.')>0 OR
INDEX(TX_AFLTN,'SCHOOLS')>0.
EXECUTE.

Thanks in advance!

-Ariel

===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Best way to flag multiple Needles in Haystack

Jignesh Sutar
In reply to this post by ariel barak
As well, as inclusion terms, you could also build (multiple) exclusion terms, to counteract matches to "MIDDLE" or any other possible clash:


COMPUTE School=(INDEX(TX_AFLTN,'MMSD')>0 OR
INDEX(TX_AFLTN,'ELEMENTARY')>0 OR
INDEX(TX_AFLTN,'ES')>0 OR
INDEX(TX_AFLTN,'MIDDLE')>0 OR
INDEX(TX_AFLTN,'MS')>0 OR
INDEX(TX_AFLTN,'HIGH')>0 OR
INDEX(TX_AFLTN,'HS')>0 OR
INDEX(TX_AFLTN,'SCHOOL')>0 OR
INDEX(TX_AFLTN,'SCHOOL DISTRICT')>0 OR
INDEX(TX_AFLTN,'ELEM.')>0 OR
INDEX(TX_AFLTN,'SCHOOLS')>0)
AND
(INDEX(TX_AFLTN,'MIDDLETON ')=0 OR
 INDEX(TX_AFLTN,'MIDDLEFOO ')=0 OR
 INDEX(TX_AFLTN,'MIDDLEBAR ')=0).
Reply | Threaded
Open this post in threaded view
|

Re: Best way to flag multiple Needles in Haystack

Jignesh Sutar
The trailing space's were not intentional in my last post, it should have read


COMPUTE School=(INDEX(TX_AFLTN,'MMSD')>0 OR
INDEX(TX_AFLTN,'ELEMENTARY')>0 OR
INDEX(TX_AFLTN,'ES')>0 OR
INDEX(TX_AFLTN,'MIDDLE')>0 OR
INDEX(TX_AFLTN,'MS')>0 OR
INDEX(TX_AFLTN,'HIGH')>0 OR
INDEX(TX_AFLTN,'HS')>0 OR
INDEX(TX_AFLTN,'SCHOOL')>0 OR
INDEX(TX_AFLTN,'SCHOOL DISTRICT')>0 OR
INDEX(TX_AFLTN,'ELEM.')>0 OR
INDEX(TX_AFLTN,'SCHOOLS')>0)
AND
(INDEX(TX_AFLTN,'MIDDLETON')=0 OR
 INDEX(TX_AFLTN,'MIDDLEFOO')=0 OR
 INDEX(TX_AFLTN,'MIDDLEBAR')=0).
Reply | Threaded
Open this post in threaded view
|

Re: Best way to flag multiple Needles in Haystack

David Marso
Administrator
In reply to this post by David Marso
Also, if you have a HUGE file and a large list the following might be a bit quicker?
Initialize a vector of scratch variables at the beginning and terminate the loop immediately if found rather than exhaustively search the list?

VECTOR #needle(12,A20).
DO IF $CASENUM=1.
DO REPEAT x='MMSD ' 'ELEMENTARY ' 'ES ' 'MIDDLE ' 'MS ' 'HIGH ' 'HS ' 'SCHOOL ' 'SCHOOL DISTRICT ' 'ELEM. ' 'SCHOOLS ' /index=1 TO 12.
COMPUTE #needle(#index)=x.
END REPEAT.

COMPUTE School=0.
LOOP #=1 TO 12.
COMPUTE School=School OR INDEX(TX_AFLTN,#needle(#) )>0.
END LOOP IF School .





David Marso wrote
FWIW:  I would likely drop that into a DO REPEAT.
COMPUTE School=0.
DO REPEAT
  needle=.
COMPUTE School=School OR INDEX(TX_AFLTN,needle)>0.
END REPEAT.
ariel barak wrote
Hello All,

I poked around searching the listserv, but had trouble finding my exact
issue.

I have a long list of agencies/employers that I'd like to flag as
School-related or not. There are several keywords that I would like to
search for i.e. Elementary, Middle, High, School, School District, Schools,
etc. So far, I'm using the syntax below, but it flags MIDDLETON POLICE
DEPARTMENT because it contains the word MIDDLE...I'd like to search only
for the exact phrases to avoid this issue.

DATA LIST FIXED /TX_AFLTN 1-60(A).
BEGIN DATA
MIDDLETON HIGH SCHOOL
MIDDLETON POLICE DEPARTMENT
EAST HIGH SCHOOL
END DATA.
LIST.

COMPUTE School=INDEX(TX_AFLTN,'MMSD')>0 OR
INDEX(TX_AFLTN,'ELEMENTARY')>0 OR
INDEX(TX_AFLTN,'ES')>0 OR
INDEX(TX_AFLTN,'MIDDLE')>0 OR
INDEX(TX_AFLTN,'MS')>0 OR
INDEX(TX_AFLTN,'HIGH')>0 OR
INDEX(TX_AFLTN,'HS')>0 OR
INDEX(TX_AFLTN,'SCHOOL')>0 OR
INDEX(TX_AFLTN,'SCHOOL DISTRICT')>0 OR
INDEX(TX_AFLTN,'ELEM.')>0 OR
INDEX(TX_AFLTN,'SCHOOLS')>0.
EXECUTE.

Thanks in advance!

-Ariel

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Best way to flag multiple Needles in Haystack

Jignesh Sutar
In reply to this post by Jignesh Sutar
By the way, you may encounter more problems with (correct) matches to "ES" than to "MIDDLE". "ES" could potentially match to a whole host of sub-string/natural words that are not intended to represent the abbreviation "ELEMENTARY SCHOOL".

Matching to such a short sub-string like "ES" would be extremely difficult, you would end up incorrectly allocating many more cases. Either you'll need to monitor very carefully and build whatever mechanism to assign accordingly or revise how the data is being collected (if possible, instruct respondents to avoid abbreviations for example)
Reply | Threaded
Open this post in threaded view
|

Re: Best way to flag multiple Needles in Haystack

Jon K Peck
which is another reason for using the regular expression syntax I posted.  It understands word boundaries.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Jignesh Sutar <[hidden email]>
To:        [hidden email]
Date:        03/09/2015 10:40 AM
Subject:        Re: [SPSSX-L] Best way to flag multiple Needles in Haystack
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




By the way, you may encounter more problems with (correct) matches to "ES"
than to "MIDDLE". "ES" could potentially match to a whole host of
sub-string/natural words that are not intended to represent the abbreviation
"ELEMENTARY SCHOOL".

Matching to such a short sub-string like "ES" would be extremely difficult,
you would end up incorrectly allocating many more cases. Either you'll need
to monitor very carefully and build whatever mechanism to assign accordingly
or revise how the data is being collected (if possible, instruct respondents
to avoid abbreviations for example)



--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Best-way-to-flag-multiple-Needles-in-Haystack-tp5728938p5728946.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Best way to flag multiple Needles in Haystack

ariel barak
Hi All,

Thanks for your responses. I can't believe I didn't think of adding the space to strings I wanted to flag.

I saw that the 'ES ' had tons of records flagged that I didn't want flagged (FORWARD HEALTH SERVICES, for example), so I added a space to the ES so it read ' ES '. In my data, the description of a proper Elementary School never reads 'ES Jefferson', it's always 'Jefferson ES', so that worked well.

Strangely, 'JEFFERSON ES' was still flagged as a school even though the ES in 'JEFFERSON ES' wasn't followed by a space...in my case, that was actually beneficial.

Thanks again,
Ariel





On Mon, Mar 9, 2015 at 11:43 AM, Jon K Peck <[hidden email]> wrote:
which is another reason for using the regular expression syntax I posted.  It understands word boundaries.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: <a href="tel:720-342-5621" value="+17203425621" target="_blank">720-342-5621




From:        Jignesh Sutar <[hidden email]>
To:        [hidden email]
Date:        03/09/2015 10:40 AM
Subject:        Re: [SPSSX-L] Best way to flag multiple Needles in Haystack
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




By the way, you may encounter more problems with (correct) matches to "ES"
than to "MIDDLE". "ES" could potentially match to a whole host of
sub-string/natural words that are not intended to represent the abbreviation
"ELEMENTARY SCHOOL".

Matching to such a short sub-string like "ES" would be extremely difficult,
you would end up incorrectly allocating many more cases. Either you'll need
to monitor very carefully and build whatever mechanism to assign accordingly
or revise how the data is being collected (if possible, instruct respondents
to avoid abbreviations for example)



--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Best-way-to-flag-multiple-Needles-in-Haystack-tp5728938p5728946.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Best way to flag multiple Needles in Haystack

Maguin, Eugene

With respect to

> Strangely, 'JEFFERSON ES' was still flagged as a school even though the ES in 'JEFFERSON ES' wasn't followed by a space...in my case, that was actually beneficial.

 

I believe your variable was an A60 and since this value (JEFF…) has 12 characters, I would assume that characters 13-60 have the value that equals the space character. Thus ‘ES ‘ is found. However, if you look at value for that record in the data editor, there will appear to be no characters to the right of the ‘ES’. I assume this means that the data editor right trims the stored A60 value for editing purposes.

Gene Maguin

 

 

From: Ariel Barak [mailto:[hidden email]]
Sent: Monday, March 09, 2015 3:04 PM
To: Jon K Peck; Jignesh Sutar; David Marso; Maguin, Eugene
Cc: SPSS mailing list
Subject: Re: Best way to flag multiple Needles in Haystack

 

Hi All,

 

Thanks for your responses. I can't believe I didn't think of adding the space to strings I wanted to flag.

 

I saw that the 'ES ' had tons of records flagged that I didn't want flagged (FORWARD HEALTH SERVICES, for example), so I added a space to the ES so it read ' ES '. In my data, the description of a proper Elementary School never reads 'ES Jefferson', it's always 'Jefferson ES', so that worked well.

 

Strangely, 'JEFFERSON ES' was still flagged as a school even though the ES in 'JEFFERSON ES' wasn't followed by a space...in my case, that was actually beneficial.

 

Thanks again,

Ariel

 

 

 

 

 

On Mon, Mar 9, 2015 at 11:43 AM, Jon K Peck <[hidden email]> wrote:

which is another reason for using the regular expression syntax I posted.  It understands word boundaries.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone:
<a href="tel:720-342-5621" target="_blank">720-342-5621




From:        Jignesh Sutar <[hidden email]>
To:        [hidden email]
Date:        03/09/2015 10:40 AM
Subject:        Re: [SPSSX-L] Best way to flag multiple Needles in Haystack
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




By the way, you may encounter more problems with (correct) matches to "ES"
than to "MIDDLE". "ES" could potentially match to a whole host of
sub-string/natural words that are not intended to represent the abbreviation
"ELEMENTARY SCHOOL".

Matching to such a short sub-string like "ES" would be extremely difficult,
you would end up incorrectly allocating many more cases. Either you'll need
to monitor very carefully and build whatever mechanism to assign accordingly
or revise how the data is being collected (if possible, instruct respondents
to avoid abbreviations for example)



--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Best-way-to-flag-multiple-Needles-in-Haystack-tp5728938p5728946.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD