SPSSX Discussion

OT text processing question

Classic

List

Threaded

3 messages Options

Art Kendall

OT text processing question

Do you know of a Python, or other free method of finding all of the within sentence bigrams (2 consecutive words ), trigrams (3 words), perhaps quadragrams (4 words) in any one of Word, WordPerfect, PDF, or HTML, an SPSS file with 1 sentence per case, a CSV file with one sentence per case.?

I am thinking of it (1) as a prelude to creating an index for a book and (2) frequency counts as a prelude to multidimensional scaling of frequencies in, e.g., the Federalist papers, or the ethics codes of scientific organizations, annual reports of countries to UNESCO wrt human rights..

The output would just be 2 fields: the frequency and the string (bigram, trigram, etc).

-- 
Art Kendall
Social Research Consultants

Art Kendall
Social Research Consultants

David Marso

Re: OT text processing question

Administrator

Since this is an SPSS list, I don't imagine you would object to a native SPSS syntax solution ;-)
--
data list / sentence (A120).
begin data
h a s j d a s d a s d j a a d a h d g h g a s j d g h a s h s j k d f k s a k f k s l f k l a s k f s a
k f k a s j f k a s j k f k l a s j k j k a s j k f d j k l a j k f j k a s j f l k a j k f a f j k l a
j f k l a j k f a j k f a j l l f g s a f d g a f g d f a g g a f d g a d g a a d e f e g s d s e w f s
end data.

SET MXLOOPS=10000000.
STRING A (A10).
LOOP.
+ COMPUTE #=INDEX(sentence," ").
+ COMPUTE A=SUBSTR(sentence,1,#-1).
+ COMPUTE Sentence=LTRIM(SUBSTR(sentence,#+1)).
+ XSAVE OUTFILE "G:\TEMP2\words.sav" / KEEP A.
END LOOP IF sentence=" ".
EXECUTE.

GET FILE "G:\TEMP2\words.sav" .
STRING Frag2 Frag3 (A50).

COMPUTE Frag2=LTRIM(CONCAT(RTRIM(LAG(A))," ", RTRIM(A) )).
COMPUTE Frag3=LTRIM(CONCAT(RTRIM(LAG(A,2))," ",RTRIM(Frag2))).
VARSTOCASES Make frag FROM frag2 frag3 /INDEX=ind .
AGGREGATE OUTFILE * / BREAK frag / N=N.

Art Kendall wrote

Do you know of a Python, or other free method of
finding all of the within sentence bigrams (2

consecutive words ), trigrams (3 words),
perhaps quadragrams (4 words) in any one of Word, WordPerfect,
PDF, or HTML, an SPSS file with 1 sentence per case, a CSV file with one sentence per case.?

I am thinking of it (1) as a prelude to creating an index for a
book and (2) frequency counts as a prelude to multidimensional
scaling of frequencies in, e.g.,
the Federalist papers, or the ethics codes of scientific
organizations, annual reports of countries to UNESCO wrt human rights..

The output would just be 2 fields: the frequency and the string
(bigram, trigram, etc).
--
Art Kendall
Social Research Consultants

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Albert-Jan Roskam

Re: OT text processing question

In reply to this post by Art Kendall

maybe the ngram library? can't find the frequency count though.

s = r"""Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean
commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et
magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis,
ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa
quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget,
arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo.
Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras
dapibus. Vivamus elementum semper nisi.""".split()

import ngram
print list(ngram.NGram(N=2).ngrams(s))

[['Lorem', 'ipsum'], ['ipsum', 'dolor'], ['dolor', 'sit'], ['sit', 'amet,'], ['amet,', 'consectetuer'], ['consectetuer', 'adipiscing'], ['adipiscing', 'elit.'], ['elit.', 'Aenean'], ['Aenean', 'commodo'], ['commodo', 'ligula'], ['ligula', 'eget'], ['eget', 'dolor.'], ['dolor.', 'Aenean'], ['Aenean', 'massa.'], ['massa.', 'Cum'], ['Cum', 'sociis'], ['sociis', 'natoque'], ['natoque', 'penatibus'], ['penatibus', 'et'], ['et', 'magnis'], ['magnis', 'dis'], ['dis', 'parturient'], ['parturient', 'montes,'], ['montes,', 'nascetur'], ['nascetur', 'ridiculus'], ['ridiculus', 'mus.'], ['mus.', 'Donec'], ['Donec', 'quam'], ['quam', 'felis,'], ['felis,', 'ultricies'], ['ultricies', 'nec,'], ['nec,', 'pellentesque'], ['pellentesque', 'eu,'], ['eu,', 'pretium'], ['pretium', 'quis,'], ['quis,', 'sem.'], ['sem.', 'Nulla'], ['Nulla', 'consequat'], ['consequat', 'massa'], ['massa', 'quis'], ['quis', 'enim.'], ['enim.', 'Donec'], ['Donec', 'pede'], ['pede', 'justo,'],
['justo,', 'fringilla'], ['fringilla', 'vel,'], ['vel,', 'aliquet'], ['aliquet', 'nec,'], ['nec,', 'vulputate'], ['vulputate', 'eget,'], ['eget,', 'arcu.'], ['arcu.', 'In'], ['In', 'enim'], ['enim', 'justo,'], ['justo,', 'rhoncus'], ['rhoncus', 'ut,'], ['ut,', 'imperdiet'], ['imperdiet', 'a,'], ['a,', 'venenatis'], ['venenatis', 'vitae,'], ['vitae,', 'justo.'], ['justo.', 'Nullam'], ['Nullam', 'dictum'], ['dictum', 'felis'], ['felis', 'eu'], ['eu', 'pede'], ['pede', 'mollis'], ['mollis', 'pretium.'], ['pretium.', 'Integer'], ['Integer', 'tincidunt.'], ['tincidunt.', 'Cras'], ['Cras', 'dapibus.'], ['dapibus.', 'Vivamus'], ['Vivamus', 'elementum'], ['elementum', 'semper'], ['semper', 'nisi.']]

wrt the formats: perhaps antiword (http://www.winfield.demon.nl/index.html) will help parsing the word processing docs (doc, wp). And LibreOffice also works with Python. For pdf there also is a library, but it won't always work well (e.g. encrypted pdfs). Html, sav, csv should be (almost) trivially simple.

Regards,
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a
fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

>________________________________
> From: Art Kendall <[hidden email]>
>To: [hidden email]
>Sent: Wednesday, April 17, 2013 2:44 PM
>Subject: [SPSSX-L] OT text processing question
>
>
>
>Do you know of a Python, or other free method of finding all of the within sentence bigrams (2 consecutive words ), trigrams (3 words), perhaps quadragrams (4 words) in any one of Word, WordPerfect, PDF, or HTML, an SPSS file with 1 sentence per case, a CSVfile with one sentence per case.?
>
>
> I am thinking of it (1) as a prelude to creating an index for a

book and (2) frequency counts as a prelude to multidimensional
scaling of frequencies in, e.g., the Federalist papers, or the ethics codes of scientific organizations, annual reports of countries to UNESCO wrt human rights..
>
>
>The output would just be 2 fields: the frequency and the string
(bigram, trigram, etc).
>
>--
Art Kendall
Social Research Consultants
>Art Kendall
>Social Research Consultants
>>________________________________
> View this message in context: OT text processing question
>Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
>
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD