Do you know of a Python, or other free method of
finding all of the within sentence bigrams (2
consecutive words ), trigrams (3 words),
perhaps quadragrams (4 words) in any one of Word, WordPerfect,
PDF, or HTML, an SPSS file with 1 sentence per case, a CSV file with one sentence per case.?
I am thinking of it (1) as a prelude to creating an index for a book and (2) frequency counts as a prelude to multidimensional scaling of frequencies in, e.g., the Federalist papers, or the ethics codes of scientific organizations, annual reports of countries to UNESCO wrt human rights.. The output would just be 2 fields: the frequency and the string (bigram, trigram, etc). -- Art Kendall Social Research Consultants
Art Kendall
Social Research Consultants |
Administrator
|
Since this is an SPSS list, I don't imagine you would object to a native SPSS syntax solution ;-)
-- data list / sentence (A120). begin data h a s j d a s d a s d j a a d a h d g h g a s j d g h a s h s j k d f k s a k f k s l f k l a s k f s a k f k a s j f k a s j k f k l a s j k j k a s j k f d j k l a j k f j k a s j f l k a j k f a f j k l a j f k l a j k f a j k f a j l l f g s a f d g a f g d f a g g a f d g a d g a a d e f e g s d s e w f s end data. SET MXLOOPS=10000000. STRING A (A10). LOOP. + COMPUTE #=INDEX(sentence," "). + COMPUTE A=SUBSTR(sentence,1,#-1). + COMPUTE Sentence=LTRIM(SUBSTR(sentence,#+1)). + XSAVE OUTFILE "G:\TEMP2\words.sav" / KEEP A. END LOOP IF sentence=" ". EXECUTE. GET FILE "G:\TEMP2\words.sav" . STRING Frag2 Frag3 (A50). COMPUTE Frag2=LTRIM(CONCAT(RTRIM(LAG(A))," ", RTRIM(A) )). COMPUTE Frag3=LTRIM(CONCAT(RTRIM(LAG(A,2))," ",RTRIM(Frag2))). VARSTOCASES Make frag FROM frag2 frag3 /INDEX=ind . AGGREGATE OUTFILE * / BREAK frag / N=N.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Art Kendall
maybe the ngram library? can't find the frequency count though.
s = r"""Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi.""".split() import ngram print list(ngram.NGram(N=2).ngrams(s)) [['Lorem', 'ipsum'], ['ipsum', 'dolor'], ['dolor', 'sit'], ['sit', 'amet,'], ['amet,', 'consectetuer'], ['consectetuer', 'adipiscing'], ['adipiscing', 'elit.'], ['elit.', 'Aenean'], ['Aenean', 'commodo'], ['commodo', 'ligula'], ['ligula', 'eget'], ['eget', 'dolor.'], ['dolor.', 'Aenean'], ['Aenean', 'massa.'], ['massa.', 'Cum'], ['Cum', 'sociis'], ['sociis', 'natoque'], ['natoque', 'penatibus'], ['penatibus', 'et'], ['et', 'magnis'], ['magnis', 'dis'], ['dis', 'parturient'], ['parturient', 'montes,'], ['montes,', 'nascetur'], ['nascetur', 'ridiculus'], ['ridiculus', 'mus.'], ['mus.', 'Donec'], ['Donec', 'quam'], ['quam', 'felis,'], ['felis,', 'ultricies'], ['ultricies', 'nec,'], ['nec,', 'pellentesque'], ['pellentesque', 'eu,'], ['eu,', 'pretium'], ['pretium', 'quis,'], ['quis,', 'sem.'], ['sem.', 'Nulla'], ['Nulla', 'consequat'], ['consequat', 'massa'], ['massa', 'quis'], ['quis', 'enim.'], ['enim.', 'Donec'], ['Donec', 'pede'], ['pede', 'justo,'], ['justo,', 'fringilla'], ['fringilla', 'vel,'], ['vel,', 'aliquet'], ['aliquet', 'nec,'], ['nec,', 'vulputate'], ['vulputate', 'eget,'], ['eget,', 'arcu.'], ['arcu.', 'In'], ['In', 'enim'], ['enim', 'justo,'], ['justo,', 'rhoncus'], ['rhoncus', 'ut,'], ['ut,', 'imperdiet'], ['imperdiet', 'a,'], ['a,', 'venenatis'], ['venenatis', 'vitae,'], ['vitae,', 'justo.'], ['justo.', 'Nullam'], ['Nullam', 'dictum'], ['dictum', 'felis'], ['felis', 'eu'], ['eu', 'pede'], ['pede', 'mollis'], ['mollis', 'pretium.'], ['pretium.', 'Integer'], ['Integer', 'tincidunt.'], ['tincidunt.', 'Cras'], ['Cras', 'dapibus.'], ['dapibus.', 'Vivamus'], ['Vivamus', 'elementum'], ['elementum', 'semper'], ['semper', 'nisi.']] wrt the formats: perhaps antiword (http://www.winfield.demon.nl/index.html) will help parsing the word processing docs (doc, wp). And LibreOffice also works with Python. For pdf there also is a library, but it won't always work well (e.g. encrypted pdfs). Html, sav, csv should be (almost) trivially simple. Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >________________________________ > From: Art Kendall <[hidden email]> >To: [hidden email] >Sent: Wednesday, April 17, 2013 2:44 PM >Subject: [SPSSX-L] OT text processing question > > > >Do you know of a Python, or other free method of finding all of the within sentence bigrams (2 consecutive words ), trigrams (3 words), perhaps quadragrams (4 words) in any one of Word, WordPerfect, PDF, or HTML, an SPSS file with 1 sentence per case, a CSVfile with one sentence per case.? > > > I am thinking of it (1) as a prelude to creating an index for a scaling of frequencies in, e.g., the Federalist papers, or the ethics codes of scientific organizations, annual reports of countries to UNESCO wrt human rights.. > > >The output would just be 2 fields: the frequency and the string (bigram, trigram, etc). > >-- Art Kendall Social Research Consultants >Art Kendall >Social Research Consultants >>________________________________ > View this message in context: OT text processing question >Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |