OT text processing question

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

OT text processing question

Art Kendall
Do you know of a Python, or other free method of finding all of the within sentence  bigrams (2 consecutive words ), trigrams (3 words),  perhaps quadragrams (4 words) in any one  of Word, WordPerfect, PDF, or HTML, an SPSS file with 1 sentence per case, a CSV file with one sentence per case.?


 I am thinking of it  (1) as a prelude to creating an index for a book and (2) frequency counts as a prelude to multidimensional scaling of frequencies in, e.g.,  the Federalist papers, or the ethics codes of scientific organizations, annual reports of countries to UNESCO wrt human rights..



The output would just be 2 fields: the frequency and the string (bigram, trigram, etc).
-- 
Art Kendall
Social Research Consultants
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: OT text processing question

David Marso
Administrator
Since this is an SPSS list, I don't imagine you would object to a native SPSS syntax solution ;-)
--
data list / sentence (A120).
begin data
h a s j d a s d a s d j a a d a h d g h g a s j d g h a s h s j k d f k s a k f k s l f k l a s k f s a
k f k a s j f k a s j k f k l a s j k j k a s j k f d j k l a j k f j k a s j f l k a j k f a f j k l a
j f k l a j k f a j k f a j l l f g s a f d g a f g d f a g g a f d g a d g a a d e f e g s d s e w f s
end data.

SET MXLOOPS=10000000.
STRING  A (A10).
LOOP.
+  COMPUTE #=INDEX(sentence," ").
+  COMPUTE A=SUBSTR(sentence,1,#-1).
+  COMPUTE Sentence=LTRIM(SUBSTR(sentence,#+1)).
+  XSAVE OUTFILE "G:\TEMP2\words.sav" / KEEP A.
END LOOP IF sentence=" ".
EXECUTE.

GET FILE "G:\TEMP2\words.sav" .
STRING Frag2 Frag3 (A50).

COMPUTE Frag2=LTRIM(CONCAT(RTRIM(LAG(A))," ", RTRIM(A) )).
COMPUTE Frag3=LTRIM(CONCAT(RTRIM(LAG(A,2))," ",RTRIM(Frag2))).
VARSTOCASES Make frag FROM frag2 frag3 /INDEX=ind .
AGGREGATE OUTFILE * / BREAK frag / N=N.


Art Kendall wrote
Do you know of a Python, or other free method of
      finding all of the within sentence  bigrams (2

      consecutive words ), trigrams (3 words), 
      perhaps quadragrams (4 words) in any one  of Word, WordPerfect,
      PDF, or HTML, an SPSS file with 1 sentence per case, a CSV file with one sentence per case.?
     
     
       I am thinking of it  (1) as a prelude to creating an index for a
      book and (2) frequency counts as a prelude to multidimensional
      scaling of frequencies in, e.g., 
      the Federalist papers, or the ethics codes of scientific
      organizations, annual reports of countries to UNESCO wrt human rights..
   
   
    The output would just be 2 fields: the frequency and the string
    (bigram, trigram, etc).
    --
Art Kendall
Social Research Consultants
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: OT text processing question

Albert-Jan Roskam
In reply to this post by Art Kendall
maybe the ngram library? can't find the frequency count though.

s = r"""Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean
commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et
magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis,
ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa
quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget,
arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo.
Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras
dapibus. Vivamus elementum semper nisi.""".split()

import ngram
print list(ngram.NGram(N=2).ngrams(s))

[['Lorem', 'ipsum'], ['ipsum', 'dolor'], ['dolor', 'sit'], ['sit', 'amet,'], ['amet,', 'consectetuer'], ['consectetuer', 'adipiscing'], ['adipiscing', 'elit.'], ['elit.', 'Aenean'], ['Aenean', 'commodo'], ['commodo', 'ligula'], ['ligula', 'eget'], ['eget', 'dolor.'], ['dolor.', 'Aenean'], ['Aenean', 'massa.'], ['massa.', 'Cum'], ['Cum', 'sociis'], ['sociis', 'natoque'], ['natoque', 'penatibus'], ['penatibus', 'et'], ['et', 'magnis'], ['magnis', 'dis'], ['dis', 'parturient'], ['parturient', 'montes,'], ['montes,', 'nascetur'], ['nascetur', 'ridiculus'], ['ridiculus', 'mus.'], ['mus.', 'Donec'], ['Donec', 'quam'], ['quam', 'felis,'], ['felis,', 'ultricies'], ['ultricies', 'nec,'], ['nec,', 'pellentesque'], ['pellentesque', 'eu,'], ['eu,', 'pretium'], ['pretium', 'quis,'], ['quis,', 'sem.'], ['sem.', 'Nulla'], ['Nulla', 'consequat'], ['consequat', 'massa'], ['massa', 'quis'], ['quis', 'enim.'], ['enim.', 'Donec'], ['Donec', 'pede'], ['pede', 'justo,'],
 ['justo,', 'fringilla'], ['fringilla', 'vel,'], ['vel,', 'aliquet'], ['aliquet', 'nec,'], ['nec,', 'vulputate'], ['vulputate', 'eget,'], ['eget,', 'arcu.'], ['arcu.', 'In'], ['In', 'enim'], ['enim', 'justo,'], ['justo,', 'rhoncus'], ['rhoncus', 'ut,'], ['ut,', 'imperdiet'], ['imperdiet', 'a,'], ['a,', 'venenatis'], ['venenatis', 'vitae,'], ['vitae,', 'justo.'], ['justo.', 'Nullam'], ['Nullam', 'dictum'], ['dictum', 'felis'], ['felis', 'eu'], ['eu', 'pede'], ['pede', 'mollis'], ['mollis', 'pretium.'], ['pretium.', 'Integer'], ['Integer', 'tincidunt.'], ['tincidunt.', 'Cras'], ['Cras', 'dapibus.'], ['dapibus.', 'Vivamus'], ['Vivamus', 'elementum'], ['elementum', 'semper'], ['semper', 'nisi.']]

wrt the formats: perhaps antiword (http://www.winfield.demon.nl/index.html) will help parsing the word processing docs (doc, wp). And LibreOffice also works with Python. For pdf there also is a library, but it won't always work well (e.g. encrypted pdfs). Html, sav, csv should be (almost) trivially simple.


Regards,
Albert-Jan


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a
fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


>________________________________
> From: Art Kendall <[hidden email]>
>To: [hidden email]
>Sent: Wednesday, April 17, 2013 2:44 PM
>Subject: [SPSSX-L] OT       text processing question
>
>
>
>Do you know of a Python, or other free method of finding all of the within sentence  bigrams (2 consecutive words ), trigrams (3 words),  perhaps quadragrams (4 words) in any one  of Word, WordPerfect, PDF, or HTML, an SPSS file with 1 sentence per case, a CSVfile with one sentence per case.?
>
>
> I am thinking of it  (1) as a prelude to creating an index for a
      book and (2) frequency counts as a prelude to multidimensional
      scaling of frequencies in, e.g.,  the Federalist papers, or the ethics codes of scientific organizations, annual reports of countries to UNESCO wrt human rights..
>
>
>The output would just be 2 fields: the frequency and the string
    (bigram, trigram, etc).
>
>--
Art Kendall
Social Research Consultants
>Art Kendall
>Social Research Consultants
>>________________________________
> View this message in context: OT       text processing question
>Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
>
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD