alternative to INDEX - wild cards in SPSS

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

alternative to INDEX - wild cards in SPSS

progster
Dear all is there a more sophisticated alternative to INDEX function?

I would like to classify some verbatim (not in english) according to fuzzier criteria.

e.g. being able to match mo***ain with  mountain.

Until now my strategy has been extracting tokens such as "mount" or "untain" with the index function.

COMPUTE newvar=INDEX(origin_var,"mount") GT 0
Reply | Threaded
Open this post in threaded view
|

Re: alternative to INDEX - wild cards in SPSS

Jon Peck
Index only works with exact matches, however regular expressions can be used via the SPSSINC TRANS extension command and a small regex expression for the patterns.  If you have a lot of these you might want to create a function which has a whole set of patterns to match against.

Regular expression pattern matching is very powerful, but getting the hang of how to write these takes some work.

Here is a simple mountain example.  IIt returns 1 or 0 depending on whether the pattern moun...ain is found, ignoring case, where any letters or digits but no blanks or punctuation can occur in the middle.

data list list /word(a20).
begin data
mountain
moabcdain
MOUNTTAIN
end data.

begin program.
import re
def mountain(arg):
    found =re.search(r"moun\w+?ain", arg, flags=re.I)
    print found
    return found is not None
end program.

spssinc trans result=hasmountain
/formula "mountain(word)"


On Mon, Mar 14, 2016 at 12:55 PM, progster <[hidden email]> wrote:
Dear all is there a more sophisticated alternative to INDEX function?

I would like to classify some verbatim (not in english) according to fuzzier
criteria.

e.g. being able to match mo***ain with  mountain.

Until now my strategy has been extracting tokens such as "mount" or "untain"
with the index function.

COMPUTE newvar=INDEX(origin_var,"mount") GT 0



--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/alternative-to-INDEX-wild-cards-in-SPSS-tp5731735.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: alternative to INDEX - wild cards in SPSS

Jon Peck
Another possibility that might be helpful would be to use the spell checker feature in the DE to correct spelling errors before proceeding.  It would, of course, require manual acceptance of corrections, but that might still be the fastest way.  The dictionaries shipped with Statistics cover ten languages.  You can select the language to use in the spelling dialog that comes up from the spell check toolbar icon.

On Mon, Mar 14, 2016 at 1:16 PM, Jon Peck <[hidden email]> wrote:
Index only works with exact matches, however regular expressions can be used via the SPSSINC TRANS extension command and a small regex expression for the patterns.  If you have a lot of these you might want to create a function which has a whole set of patterns to match against.

Regular expression pattern matching is very powerful, but getting the hang of how to write these takes some work.

Here is a simple mountain example.  IIt returns 1 or 0 depending on whether the pattern moun...ain is found, ignoring case, where any letters or digits but no blanks or punctuation can occur in the middle.

data list list /word(a20).
begin data
mountain
moabcdain
MOUNTTAIN
end data.

begin program.
import re
def mountain(arg):
    found =re.search(r"moun\w+?ain", arg, flags=re.I)
    print found
    return found is not None
end program.

spssinc trans result=hasmountain
/formula "mountain(word)"


On Mon, Mar 14, 2016 at 12:55 PM, progster <[hidden email]> wrote:
Dear all is there a more sophisticated alternative to INDEX function?

I would like to classify some verbatim (not in english) according to fuzzier
criteria.

e.g. being able to match mo***ain with  mountain.

Until now my strategy has been extracting tokens such as "mount" or "untain"
with the index function.

COMPUTE newvar=INDEX(origin_var,"mount") GT 0



--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/alternative-to-INDEX-wild-cards-in-SPSS-tp5731735.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
[hidden email]




--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD