SPSSX Discussion

Can python be used to apply a spell check to comments in syntax.

Classic

List

Threaded

3 messages Options

Art Kendall

Can python be used to apply a spell check to comments in syntax.

We had some postings recently on comments in syntax.

Is there perhaps already a Python example that checks the spelling in comments?

If not, is there a Spell Check in Python. One could write SPSS syntax that reads another syntax file and passes the string to Python.

-- 
Art Kendall
Social Research Consultants

Art Kendall
Social Research Consultants

Jon K Peck

Re: Can python be used to apply a spell check to comments in syntax.

There is no spell checker in the Python standard library, but there are a number of third-party modules that could be used to build such a tool. I have not used any of them myself.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: Art Kendall <[hidden email]>
To: [hidden email],
Date: 01/18/2014 07:34 AM
Subject: [SPSSX-L] Can python be used to apply a spell check to comments in syntax.
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Art Kendall
Social Research Consultants

View this message in context: Can python be used to apply a spell check to comments in syntax.
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

Albert-Jan Roskam

Re: Can python be used to apply a spell check to comments in syntax.

>Subject: Re: [SPSSX-L] Can python be used to apply a spell check to� � � � � � � � � � � � � � comments in� � � � � � � � � � � � � � syntax.

>We had some postings recently on comments in syntax.
>
>Is there perhaps already a Python example that checks the spelling in comments?
>
>If not, is there a Spell Check in Python. One could write SPSS syntax that
reads another syntax file and passes the string to Python.

I checked out 'whoosh' (for the fun of it!) and used one book to create an index. Then I gave it some text with deliberate errors; the output is below. I think that the result will improve if you give it more data to index. Not sure why 'reads' comes up. However "Currently the suggestion engine is more like a “typo corrector” than a
real “spell checker” since it doesn’t do the kind of sophisticated
phonetic matching or semantic/contextual analysis a good spell checker
might. However, it is still very useful." [http://pythonhosted.org/Whoosh/spelling.html]

The code is here: http://pastebin.com/k4ACrNgy, but I also pasted it below

Id --> Did you mean: did
iz --> Did you mean: viz
saj --> Did you mean: say
thad --> Did you mean: had
shoult --> Did you mean: should
intent --> Did you mean: infant
impress --> Did you mean: express
saves --> Did you mean: save
reads --> Did you mean: read
happier --> Did you mean: happen
variable --> Did you mean: valuable
function --> Did you mean: fiction
big --> Did you mean: beg
com --> Did you mean: come
ment --> Did you mean: men
intent --> Did you mean: infant

import os.path
import urllib
import codecs
import cStringIO as StringIO
import re
import string
from whoosh import index, qparser
from whoosh.fields import Schema, ID, TEXT
from whoosh.index import open_dir

def write_index(src_paths, dst_dir):
� � � """Write search index"""
� � � schema = Schema(path=ID(unique=True, stored=True),
� � � � � � � � � � � � � � � � � � � content=TEXT(spelling=True))
� � � ix = index.create_in(dst_dir, schema=schema)
� � � writer = ix.writer()
� � � for src_path in src_paths:
� � � � � � � add_doc(writer, src_path)
� � � writer.commit()

def strip_punctuation(s):
� � � """strip all the punctuation from a string"""
� � � return re.sub("[%s]*" % re.escape(string.punctuation), "", s)

def add_doc(writer, path):
� � � """Add utf-8 encoded document to index"""
� � � fileObj = codecs.open(path, encoding="utf-8")
� � � content = strip_punctuation(fileObj.read())
� � � fileObj.close()
� � � for word in content.split():
� � � � � � � writer.add_document(path=path, content=word)

def parse_string(qstring, index):
� � � """Parse the user query string"""
� � � parser = qparser.QueryParser("content", index.schema)
� � � q = parser.parse(qstring)
� � � with index.searcher() as s:� � # Try correcting the query
� � � � � � � corrected = s.correct_query(q, qstring)
� � � � � � � if corrected.query != q:
� � � � � � � � � � � print qstring, "--> Did you mean:", corrected.string

def extract_comments(fileObj):
� � � """Extract comments from spss syntax and strip out punctuation"""
� � � subst = strip_punctuation
� � � comments = [subst(line).split() for line in fileObj if line[0] == u"*"]
� � � fileObj.close()
� � � return reduce(list.__add__, comments)

syntax = u"""\
* Id iz jajaja to saj thad names shoult reveal intent. What we want to impress upon you is that
* we are serious about this. Choosing good names takes time but saves more than it takes.
* So take care with your names and change them when you find better ones. Everyone who
* reads your code (including you) will be happier if you do.
* The name of a variable, function, or class, should answer all the big questions. It
* should tell you why it exists, what it does, and how it is used. If a name requires a com-
* ment, then the name does not reveal its intent."""

if __name__ == "__main__":
� � � url = "http://www.gutenberg.org/ebooks/97.txt.utf-8"� # Flatland
� � � dst_dir = u"d:/temp"
� � � book = os.path.join(dst_dir, u"book.txt")
� � � if not os.path.exists(book):
� � � � � � � urllib.urlretrieve (url, book)
� � � � � � � write_index([book], dst_dir)
� � � comments = extract_comments(StringIO.StringIO(syntax))
� � � index = open_dir(dst_dir)� # open index from file
� � � for comment in comments:
� � � � � � � parse_string(comment, index)

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD