|
Dear all is there a more sophisticated alternative to INDEX function?
I would like to classify some verbatim (not in english) according to fuzzier criteria. e.g. being able to match mo***ain with mountain. Until now my strategy has been extracting tokens such as "mount" or "untain" with the index function. COMPUTE newvar=INDEX(origin_var,"mount") GT 0 |
|
Index only works with exact matches, however regular expressions can be used via the SPSSINC TRANS extension command and a small regex expression for the patterns. If you have a lot of these you might want to create a function which has a whole set of patterns to match against. Regular expression pattern matching is very powerful, but getting the hang of how to write these takes some work. Here is a simple mountain example. IIt returns 1 or 0 depending on whether the pattern moun...ain is found, ignoring case, where any letters or digits but no blanks or punctuation can occur in the middle. data list list /word(a20). begin data mountain moabcdain MOUNTTAIN end data. begin program. import re def mountain(arg): found =re.search(r"moun\w+?ain", arg, flags=re.I) print found return found is not None end program. spssinc trans result=hasmountain /formula "mountain(word)" On Mon, Mar 14, 2016 at 12:55 PM, progster <[hidden email]> wrote: Dear all is there a more sophisticated alternative to INDEX function? |
|
Another possibility that might be helpful would be to use the spell checker feature in the DE to correct spelling errors before proceeding. It would, of course, require manual acceptance of corrections, but that might still be the fastest way. The dictionaries shipped with Statistics cover ten languages. You can select the language to use in the spelling dialog that comes up from the spell check toolbar icon. On Mon, Mar 14, 2016 at 1:16 PM, Jon Peck <[hidden email]> wrote:
|
| Free forum by Nabble | Edit this page |
