|
I am trying to learn about using stat on text sources.
Suppose I have a set of texts and want a count of specific words in each. For instance the classical work on the 85 Federalist papers used 30 words. This particular input is all in one file, each paper starts with "FEDERALIST No. " e.g. "FEDERALIST No. 1" for the first. There would be 1 record for each text. there would be 31 variables an id for the paper and the counts for the 30 words. This syntax creates made up data that looks like what I would want. I added hyphens to reserved words so that variables could have legit names. new file. input program. vector wordcount (30,f8). loop id = 1 to 85. loop #w = 1 to 30. compute wordcount(#w) = rnd(rv.poisson(50)). end loop. end case. end loop. end file. end input program. EXECUTE . dataset name kounts30. rename vars (wordcount1 to wordcount30 = upon also an by_ of on there this to_ although both enough while whilst always though commonly consequently considerable according apt direction innovation language vigor kind matter particularly probability work . There are more extensive applications I would be interested in such as as many variables as words that occur in any of the the texts. (not just the 30); using 2, 3, and 4 word strings that occur within sentences; the texts being in separate files; in PDF, etc. So If I can avoid re-inventing the wheel I would like to hear about existing python functions, etc. But I would like to start with this kind of application. I can send an ascii file, with the text if anyone wants it. Art Kendall Social Research Consultants ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
|
Check out Nigel Gilbert at Surrey Univ. He
has special software for text searching from qualitaive research. http://www.soc.surrey.ac.uk/staff/ngilbert/index.html
|
|
In reply to this post by Art Kendall
Python has extensive functionality for working with text both directly as simple strings and as regular expressions, which can be smart about word boundaries, case sensitivity and such. A simple way to do this if you have the data as long strings in SPSS, would be to define a Python function that returns a count for each of a list of expressions of interest. Then use the SPSSINC TRANS extension command to create a variable for each. I can do an example to get you started. If you are just working with text files, a similar sort of counting function would just read those and write a record of word counts for each. HTH, Jon Peck SPSS, an IBM Company Note: my email has changed. Please update your address book. [hidden email] 312-651-3435
I am trying to learn about using stat on text sources. Suppose I have a set of texts and want a count of specific words in each. For instance the classical work on the 85 Federalist papers used 30 words. This particular input is all in one file, each paper starts with "FEDERALIST No. " e.g. "FEDERALIST No. 1" for the first. There would be 1 record for each text. there would be 31 variables an id for the paper and the counts for the 30 words. This syntax creates made up data that looks like what I would want. I added hyphens to reserved words so that variables could have legit names. new file. input program. vector wordcount (30,f8). loop id = 1 to 85. loop #w = 1 to 30. compute wordcount(#w) = rnd(rv.poisson(50)). end loop. end case. end loop. end file. end input program. EXECUTE . dataset name kounts30. rename vars (wordcount1 to wordcount30 = upon also an by_ of on there this to_ although both enough while whilst always though commonly consequently considerable according apt direction innovation language vigor kind matter particularly probability work . There are more extensive applications I would be interested in such as as many variables as words that occur in any of the the texts. (not just the 30); using 2, 3, and 4 word strings that occur within sentences; the texts being in separate files; in PDF, etc. So If I can avoid re-inventing the wheel I would like to hear about existing python functions, etc. But I would like to start with this kind of application. I can send an ascii file, with the text if anyone wants it. Art Kendall Social Research Consultants ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Hi,
You can find a free sample chapter which happens t be about regexes on: http://www.informit.com/content/images/9780137129294/samplepages/0137129297_Sample.pdf Also, Python's standard library contains the program redemo.py, which I find very helpful for debugging regexes. Regexes are like a mini-language and are extremely powerful, but they can look pretty cryptic and intimidating. How about this one? ;-) match = re.search(r"""(?<![-\w]) (?:(?:en)?coding|charset) (?:=(["'])?([-\w]+)(?(1)\1) |:\s*([-\w]+))""".encode("utf8"), binary, re.IGNORECASE|re.VERBOSE) Luckily, on most occasions (including what you want to achieve) they're far easier. *phew* Cheers!! Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before you criticize someone, walk a mile in their shoes, that way when you do criticize them, you're a mile away and you have their shoes! ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ --- On Fri, 10/23/09, Jon K Peck <[hidden email]> wrote: > From: Jon K Peck <[hidden email]> > Subject: Re: [SPSSX-L] Do python procedures already exist for dealing with text. > To: [hidden email] > Date: Friday, October 23, 2009, 5:56 PM > > > Python has extensive > functionality for > working with text both directly as simple strings and as > regular expressions, > which can be smart about word boundaries, case sensitivity > and such. A > simple way to do this if you have the data as long strings > in SPSS, would > be to define a Python function that returns a count for > each of a list > of expressions of interest. Then use the SPSSINC > TRANS extension > command to create a variable for each. I can do an > example to get > you started. > > > > If you are just working > with text files, > a similar sort of counting function would just read those > and write a record > of word counts for each. > > > > HTH, > > > > Jon Peck > > SPSS, an IBM Company > > Note: my email has > changed. Please > update your address book. > > [hidden email] > > 312-651-3435 > > > > > > > > > From: > Art Kendall > <[hidden email]> > > To: > [hidden email] > > Date: > 10/23/2009 09:21 > AM > > Subject: > [SPSSX-L] Do python > procedures already > exist for dealing with > text. > > Sent > by: > "SPSSX(r) > Discussion" > <[hidden email]> > > > > > > > > > I am trying to learn about using stat on > text sources. > > Suppose I have a set of texts and want a count of specific > words in each. > > > > > For instance the classical work on the 85 Federalist papers > used 30 words. > This particular input is all in one file, each paper starts > with > > "FEDERALIST No. " e.g. "FEDERALIST No. > 1" for the first. > There would be 1 record for each text. there would be 31 > variables an id > for the paper and the counts for the 30 words. > > > > > > This syntax creates made up data that looks like what I > would want. I added > hyphens to reserved words so that variables could have > legit names. > > > > new file. > > input program. > > vector wordcount (30,f8). > > loop id = 1 to 85. > > loop #w = 1 to 30. > > compute wordcount(#w) = rnd(rv.poisson(50)). > > end loop. > > end case. > > end loop. > > end file. > > end input program. > > EXECUTE . > > dataset name kounts30. > > rename vars (wordcount1 to wordcount30 = > > upon also an by_ > of on there this to_ > although > > both enough while whilst always though commonly > consequently > > considerable according apt direction innovation language > vigor > > kind matter particularly probability work . > > > > There are more extensive applications I would be interested > in such as > as many variables as words that occur in any of the the > texts. (not just > the 30); using 2, 3, and 4 word strings that occur within > sentences; the > texts being in separate files; in PDF, etc. So If I > can avoid re-inventing > the wheel I would like to hear about existing python > functions, etc. But > I would like to start with this kind of application. > > > > I can send an ascii file, with the text if anyone wants > it. > > > > Art Kendall > > Social Research Consultants > > ===================== To manage your > subscription to SPSSX-L, > send a message to [hidden email] (not to > SPSSX-L), with no body > text except the command. To leave the list, send the > command SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the > command INFO REFCARD > > > > > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by John F Hall
I went to that link, and found some interesting things, but searching
the page for "text" or for "qualitative" did not work.
Would you please give more specifics? Art John F Hall wrote: ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
|
In reply to this post by Art Kendall
At 11:11 AM 10/23/2009, Art Kendall wrote:
I am trying to learn about using stat on text sources. This is 'concording' - preparing a concordance, a list of words used in a text, with the number of times they are used. You want to keep the counts of only a specific set of words. I don't know how good the Python solutions are; are they working for you. However, data-stream packages like SPSS (and SAS) are well suited to concording. Below is a solution in native SPSS. Counting for the concordance (AGGREGATE), keeping counts only for the selected words (MATCH FILES), and making each count a variable (CASESTOVARS) are natural SPSS operations. The SPSS transformation language and VARSTOCASES serve to break the text into words, though I'm sure Python could do the parsing more easily. * II. Concord the text ... . List |-----------------------------|---------------------------| |Output Created |30-OCT-2009 01:46:03 | |-----------------------------|---------------------------| [Text] Paragraph Line Text Par. 1 1 I am trying to learn about using stat on text sources. Par. 1 2 Suppose I have a set of texts and want a count of specific Par. 1 3 words in each. Par. 2 1 For instance the classical work on the 85 Federalist papers Par. 2 2 used 30 words. This particular input is all in one file, Par. 2 3 each paper starts with "FEDERALIST No. " e.g. "FEDERALIST Par. 2 4 No. 1" for the first. There would be 1 record for each text. Par. 2 5 there would be 31 variables an id for the paper and the Par. 2 6 counts for the 30 words. Par. 3 1 This syntax creates made up data that looks like what I Par. 3 2 would want. I added hyphens to reserved words so that Par. 3 3 variables could have legit names. Number of cases read: 12 Number of cases listed: 12 NEW FILE. ADD FILES /FILE=Text. DATASET NAME WordList. * II.A Parse the text lines into words. ... . STRING #Buffer (A100). STRING #OneWord (A15). NUMERIC #Index (F3). NUMERIC #WordEnd (F3). COMPUTE #Buffer = LTRIM(Text). VECTOR RawWord (50A12). LOOP #Index = 1 TO 50. . COMPUTE #WordEnd = INDEX (#Buffer,' '). . COMPUTE #OneWord = SUBSTR(#Buffer,1,#WordEnd). . COMPUTE RawWord (#Index) = #OneWord. . COMPUTE #Buffer = LTRIM(SUBSTR(#Buffer,#WordEnd)). END LOOP IF #Buffer EQ ' '. EXECUTE /* This appears to be necessary, to activate the file */ . * II.B Unroll, to one word per line, and ... . * standardize the forms of words. ... . VARSTOCASES /MAKE RawWord FROM RawWord1 RawWord2 RawWord3 RawWord4 RawWord5 RawWord6 RawWord7 RawWord8 RawWord9 RawWord10 RawWord11 RawWord12 RawWord13 RawWord14 RawWord15 RawWord16 RawWord17 RawWord18 RawWord19 RawWord20 RawWord21 RawWord22 RawWord23 RawWord24 RawWord25 RawWord26 RawWord27 RawWord28 RawWord29 RawWord30 RawWord31 RawWord32 RawWord33 RawWord34 RawWord35 RawWord36 RawWord37 RawWord38 RawWord39 RawWord40 RawWord41 RawWord42 RawWord43 RawWord44 RawWord45 RawWord46 RawWord47 RawWord48 RawWord49 RawWord50 /INDEX = Position(50) /KEEP = Paragraph Line /NULL = DROP. Variables to Cases |-----------------------------|---------------------------| |Output Created |30-OCT-2009 01:46:04 | |-----------------------------|---------------------------| [WordList] Generated Variables |--------|------| |Name |Label | |--------|------| |Position|<none>| |RawWord |<none>| |--------|------| Processing Statistics |-------------|--| |Variables In |53| |Variables Out|4 | |-------------|--| STRING Word (A12). COMPUTE Word=LOWER(RawWord). DO REPEAT Remove = ' ' '.' '"'. . COMPUTE Word = REPLACE(Word,Remove,''). END REPEAT. SELECT IF Word NE ' '. * II.C Count occurrences, to create the concordance ... . DATASET DECLARE Concordance. AGGREGATE OUTFILE= Concordance /BREAK = Paragraph Word /Count 'Number of occurrences of this word' = NU. DATASET ACTIVATE Concordance WINDOW=FRONT. * III. Keep only the particulary sought words ... . * III.A Prepare the keywords for merging ... . DATASET ACTIVATE Keywords WINDOW=FRONT. LIST. List |-----------------------------|---------------------------| |Output Created |30-OCT-2009 01:46:05 | |-----------------------------|---------------------------| [Keywords] Word Federal Federalist paper words I instance syntax Number of cases read: 7 Number of cases listed: 7 NEW FILE. ADD FILES /FILE=Keywords. COMPUTE Word = LOWER(LTRIM(Word)). SORT CASES BY Word. DATASET NAME KeysReady. * III.B Prepare the concordance for merging ... . NEW FILE. ADD FILE /FILES=Concordance. SORT CASES BY Word Paragraph. * III.C Merge, keep only the keywords, and re-sort ... . MATCH FILES /FILE =* /IN=InText /TABLE=KeysReady /IN=Keyword /BY Word. DATASET NAME KeyWordCounts. SELECT IF InText AND Keyword. SORT CASES BY Paragraph Word. * IV. Make the keyword counts variables ... . CASESTOVARS /ID = Paragraph /INDEX = Word /GROUPBY = INDEX /DROP = InText Keyword. Cases to Variables |-----------------------------|---------------------------| |Output Created |30-OCT-2009 01:46:09 | |-----------------------------|---------------------------| [KeyWordCounts] Generated Variables |-----------|----------|--------------------------| |Original |Word |Result | |Variable | |----------|---------------| | | |Name |Label | |-----------|----------|----------|---------------| |Count |federalist|federalist|federalist: | |Number of | | |Number of | |occurrences| | |occurrences of | |of this | | |this word | |word |----------|----------|---------------| | |i |i |i: Number of | | | | |occurrences of | | | | |this word | | |----------|----------|---------------| | |instance |instance |instance: | | | | |Number of | | | | |occurrences of | | | | |this word | | |----------|----------|---------------| | |paper |paper |paper: Number | | | | |of occurrences | | | | |of this word | | |----------|----------|---------------| | |syntax |syntax |syntax: Number | | | | |of occurrences | | | | |of this word | | |----------|----------|---------------| | |words |words |words: Number | | | | |of occurrences | | | | |of this word | |-----------|----------|----------|---------------| Processing Statistics |---------------|---| |Cases In |9 | |Cases Out |3 | |---------------|---| |Cases In/Cases |3.0| |Out | | |---------------|---| |Variables In |5 | |Variables Out |7 | |---------------|---| |Index Values |6 | |---------------|---| LIST. List |-----------------------------|---------------------------| |Output Created |30-OCT-2009 01:46:09 | |-----------------------------|---------------------------| [KeyWordCounts] Paragraph federalist i instance paper syntax words Par. 1 . 2 . . . 1 Par. 2 3 . 1 2 . 2 Par. 3 . 2 . . 1 1 Number of cases read: 3 Number of cases listed: 3 ============================ APPENDIX: Code and test data ============================ * C:\Documents and Settings\Richard\My Documents . * \Technical\spssx-l\Z-2009d\ . * 2009-10-23 Kendall - Do python procedures exist for dealing with text.SPS. * In response to posting (original subject line shortened): . * Date: Fri, 23 Oct 2009 11:11:02 -0400 . * From: Art Kendall <[hidden email]> . * Subject: Do python procedures exist for dealing with text . * To: [hidden email] . * . * Original subject line: . * Subject: Do python procedures already exist for dealing with text.. * "I am trying to learn about using stat on text sources. Suppose . * I have a set of texts and want a count of specific words in . * each. For instance the classical work on the 85 Federalist . * papers used 30 words." . * . * This input is all in one file, each paper starts with . * "FEDERALIST No. " e.g. "FEDERALIST No. 1". There would be . * record for each text, [with] 31 variables: an id for the paper . * and the counts for the 30 words." . * ................................................................. . * ................. Test data ..................... . * I.A Text to be concorded ... . NEW FILE. INPUT PROGRAM. . STRING Paragraph (A6). . NUMERIC Line (F3). . DATA LIST FIXED/ Paragraph 05-10 (A) Text 12-75 (A). END INPUT PROGRAM. *---|---10----|---20----|---30----|---40----|---50----|---60----|---70. BEGIN DATA Par. 1 I am trying to learn about using stat on text sources. Par. 1 Suppose I have a set of texts and want a count of specific Par. 1 words in each. Par. 2 For instance the classical work on the 85 Federalist papers Par. 2 used 30 words. This particular input is all in one file, Par. 2 each paper starts with "FEDERALIST No. " e.g. "FEDERALIST Par. 2 No. 1" for the first. There would be 1 record for each text. Par. 2 there would be 31 variables an id for the paper and the Par. 2 counts for the 30 words. Par. 3 This syntax creates made up data that looks like what I Par. 3 would want. I added hyphens to reserved words so that Par. 3 variables could have legit names. END DATA. DO IF $CASENUM EQ 1. . COMPUTE Line = 1. ELSE IF Paragraph NE LAG(Paragraph). . COMPUTE Line = 1. ELSE. . COMPUTE Line = LAG(Line) + 1. END IF. DATASET NAME Text. . /*-- LIST /*-*/. * I.B Particular words to be counted ... . NEW FILE. DATA LIST FREE /Word(A12). BEGIN DATA Federal Federalist paper words I instance syntax END DATA. DATASET NAME Keywords. . /*-- LIST /*-*/. * ................. Post after this point ..................... . * ................................................................. . * II. Concord the text ... . DATASET ACTIVATE Text WINDOW=FRONT. LIST. NEW FILE. ADD FILES /FILE=Text. DATASET NAME WordList. * II.A Parse the text lines into words. ... . STRING #Buffer (A100). STRING #OneWord (A15). NUMERIC #Index (F3). NUMERIC #WordEnd (F3). COMPUTE #Buffer = LTRIM(Text). VECTOR RawWord (50A12). LOOP #Index = 1 TO 50. . COMPUTE #WordEnd = INDEX (#Buffer,' '). . COMPUTE #OneWord = SUBSTR(#Buffer,1,#WordEnd). . COMPUTE RawWord (#Index) = #OneWord. . COMPUTE #Buffer = LTRIM(SUBSTR(#Buffer,#WordEnd)). END LOOP IF #Buffer EQ ' '. EXECUTE /* This appears to be necessary, to activate the file */ . * II.B Unroll, to one word per line, and ... . * standardize the forms of words. ... . VARSTOCASES /MAKE RawWord FROM RawWord1 RawWord2 RawWord3 RawWord4 RawWord5 RawWord6 RawWord7 RawWord8 RawWord9 RawWord10 RawWord11 RawWord12 RawWord13 RawWord14 RawWord15 RawWord16 RawWord17 RawWord18 RawWord19 RawWord20 RawWord21 RawWord22 RawWord23 RawWord24 RawWord25 RawWord26 RawWord27 RawWord28 RawWord29 RawWord30 RawWord31 RawWord32 RawWord33 RawWord34 RawWord35 RawWord36 RawWord37 RawWord38 RawWord39 RawWord40 RawWord41 RawWord42 RawWord43 RawWord44 RawWord45 RawWord46 RawWord47 RawWord48 RawWord49 RawWord50 /INDEX = Position(50) /KEEP = Paragraph Line /NULL = DROP. STRING Word (A12). COMPUTE Word=LOWER(RawWord). DO REPEAT Remove = ' ' '.' '"'. . COMPUTE Word = REPLACE(Word,Remove,''). END REPEAT. SELECT IF Word NE ' '. . /*-- LIST /CASES=30 /*-*/. * II.C Count occurrences, to create the concordance ... . DATASET DECLARE Concordance. AGGREGATE OUTFILE= Concordance /BREAK = Paragraph Word /Count 'Number of occurrences of this word' = NU. DATASET ACTIVATE Concordance WINDOW=FRONT. * III. Keep only the particulary sought words ... . * III.A Prepare the keywords for merging ... . DATASET ACTIVATE Keywords WINDOW=FRONT. LIST. NEW FILE. ADD FILES /FILE=Keywords. COMPUTE Word = LOWER(LTRIM(Word)). SORT CASES BY Word. DATASET NAME KeysReady. . /*-- LIST /*-*/. * III.B Prepare the concordance for merging ... . NEW FILE. ADD FILE /FILES=Concordance. SORT CASES BY Word Paragraph. * III.C Merge, keep only the keywords, and re-sort ... . MATCH FILES /FILE =* /IN=InText /TABLE=KeysReady /IN=Keyword /BY Word. DATASET NAME KeyWordCounts. SELECT IF InText AND Keyword. SORT CASES BY Paragraph Word. . /*-- LIST /*-*/. * IV. Make the keyword counts variables ... . CASESTOVARS /ID = Paragraph /INDEX = Word /GROUPBY = INDEX /DROP = InText Keyword. LIST. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
| Free forum by Nabble | Edit this page |
