Do python procedures already exist for dealing with text.

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Do python procedures already exist for dealing with text.

Art Kendall
I am trying to learn about using stat on text sources.
Suppose I have a set of texts and want a count of specific words in each. 

For instance the classical work on the 85 Federalist papers used 30 words. This particular input is all in one file, each paper starts with
"FEDERALIST No. " e.g. "FEDERALIST No. 1" for the first. There would be 1 record for each text. there would be 31 variables an id for the paper and the counts for the 30 words.


This syntax creates made up data that looks like what I would want. I added hyphens to reserved words so that variables could have legit names.

new file.
input program.
vector wordcount (30,f8).
loop id = 1 to 85.
loop #w = 1 to 30.
compute wordcount(#w) = rnd(rv.poisson(50)).
end loop.
end case.
end loop.
end file.
end input program.
EXECUTE .
dataset name kounts30.
rename vars (wordcount1 to wordcount30 =
 upon also an by_ of on there this to_ although
 both enough while whilst always though commonly consequently
 considerable according apt direction innovation language vigor
 kind matter particularly probability work .

There are more extensive applications I would be interested in such as as many variables as words that occur in any of the the texts. (not just the 30); using 2, 3, and 4 word strings that occur within sentences; the texts being in separate files; in PDF, etc.  So If I can avoid re-inventing the wheel I would like to hear about existing python functions, etc. But I would like to start with this kind of application.

I can send an ascii file, with the text if anyone wants it.

Art Kendall
Social Research Consultants
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Do python procedures already exist for dealing with text.

John F Hall
Check out Nigel Gilbert at Surrey Univ.  He has special software for text searching from qualitaive research.  http://www.soc.surrey.ac.uk/staff/ngilbert/index.html
----- Original Message -----
Sent: Friday, October 23, 2009 5:11 PM
Subject: Do python procedures already exist for dealing with text.

I am trying to learn about using stat on text sources.
Suppose I have a set of texts and want a count of specific words in each. 

For instance the classical work on the 85 Federalist papers used 30 words. This particular input is all in one file, each paper starts with
"FEDERALIST No. " e.g. "FEDERALIST No. 1" for the first. There would be 1 record for each text. there would be 31 variables an id for the paper and the counts for the 30 words.


This syntax creates made up data that looks like what I would want. I added hyphens to reserved words so that variables could have legit names.

new file.
input program.
vector wordcount (30,f8).
loop id = 1 to 85.
loop #w = 1 to 30.
compute wordcount(#w) = rnd(rv.poisson(50)).
end loop.
end case.
end loop.
end file.
end input program.
EXECUTE .
dataset name kounts30.
rename vars (wordcount1 to wordcount30 =
 upon also an by_ of on there this to_ although
 both enough while whilst always though commonly consequently
 considerable according apt direction innovation language vigor
 kind matter particularly probability work .

There are more extensive applications I would be interested in such as as many variables as words that occur in any of the the texts. (not just the 30); using 2, 3, and 4 word strings that occur within sentences; the texts being in separate files; in PDF, etc.  So If I can avoid re-inventing the wheel I would like to hear about existing python functions, etc. But I would like to start with this kind of application.

I can send an ascii file, with the text if anyone wants it.

Art Kendall
Social Research Consultants
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Do python procedures already exist for dealing with text.

Jon K Peck
In reply to this post by Art Kendall

Python has extensive functionality for working with text both directly as simple strings and as regular expressions, which can be smart about word boundaries, case sensitivity and such.  A simple way to do this if you have the data as long strings in SPSS, would be to define a Python function that returns a count for each of a list of expressions of interest.  Then use the SPSSINC TRANS extension command to create a variable for each.  I can do an example to get you started.

If you are just working with text files, a similar sort of counting function would just read those and write a record of word counts for each.

HTH,

Jon Peck
SPSS, an IBM Company

Note: my email has changed.   Please update your address book.
[hidden email]
312-651-3435



From: Art Kendall <[hidden email]>
To: [hidden email]
Date: 10/23/2009 09:21 AM
Subject: [SPSSX-L] Do python procedures already exist for dealing with              text.
Sent by: "SPSSX(r) Discussion" <[hidden email]>





I am trying to learn about using stat on text sources.
Suppose I have a set of texts and want a count of specific words in each.  

For instance the classical work on the 85 Federalist papers used 30 words. This particular input is all in one file, each paper starts with
"FEDERALIST No. " e.g. "FEDERALIST No. 1" for the first. There would be 1 record for each text. there would be 31 variables an id for the paper and the counts for the 30 words.


This syntax creates made up data that looks like what I would want. I added hyphens to reserved words so that variables could have legit names.

new file.
input program.
vector wordcount (30,f8).
loop id = 1 to 85.
loop #w = 1 to 30.
compute wordcount(#w) = rnd(rv.poisson(50)).
end loop.
end case.
end loop.
end file.
end input program.
EXECUTE .
dataset name kounts30.
rename vars (wordcount1 to wordcount30 =
upon also an
by_ of on there this to_ although
both enough while whilst always though commonly consequently
considerable according apt direction innovation language vigor
kind matter particularly probability work .

There are more extensive applications I would be interested in such as as many variables as words that occur in any of the the texts. (not just the 30); using 2, 3, and 4 word strings that occur within sentences; the texts being in separate files; in PDF, etc.  So If I can avoid re-inventing the wheel I would like to hear about existing python functions, etc. But I would like to start with this kind of application.

I can send an ascii file, with the text if anyone wants it.

Art Kendall
Social Research Consultants

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: Do python procedures already exist for dealing with text.

Albert-Jan Roskam
Hi,

You can find a free sample chapter which happens t be about regexes on:
http://www.informit.com/content/images/9780137129294/samplepages/0137129297_Sample.pdf

Also, Python's standard library contains the program redemo.py, which I find very helpful for debugging regexes. Regexes are like a mini-language and are extremely powerful, but they can look pretty cryptic and intimidating. How about this one? ;-)
match = re.search(r"""(?<![-\w])
                      (?:(?:en)?coding|charset)
                      (?:=(["'])?([-\w]+)(?(1)\1)
                      |:\s*([-\w]+))""".encode("utf8"),
                  binary, re.IGNORECASE|re.VERBOSE)
Luckily, on most occasions (including what you want to achieve) they're far easier. *phew*


Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Before you criticize someone, walk a mile in their shoes, that way
when you do criticize them, you're a mile away and you have their shoes!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


--- On Fri, 10/23/09, Jon K Peck <[hidden email]> wrote:

> From: Jon K Peck <[hidden email]>
> Subject: Re: [SPSSX-L] Do python procedures already exist for dealing with              text.
> To: [hidden email]
> Date: Friday, October 23, 2009, 5:56 PM
>
>
> Python has extensive
> functionality for
> working with text both directly as simple strings and as
> regular expressions,
> which can be smart about word boundaries, case sensitivity
> and such.  A
> simple way to do this if you have the data as long strings
> in SPSS, would
> be to define a Python function that returns a count for
> each of a list
> of expressions of interest.  Then use the SPSSINC
> TRANS extension
> command to create a variable for each.  I can do an
> example to get
> you started.
>
>
>
> If you are just working
> with text files,
> a similar sort of counting function would just read those
> and write a record
> of word counts for each.
>
>
>
> HTH,
>
>
>
> Jon Peck
>
> SPSS, an IBM Company
>
> Note: my email has
> changed.   Please
> update your address book.
>
> [hidden email]
>
> 312-651-3435
>
>
>
>
>
>
>
>
> From:
> Art Kendall
> <[hidden email]>
>
> To:
> [hidden email]
>
> Date:
> 10/23/2009 09:21
> AM
>
> Subject:
> [SPSSX-L] Do python
> procedures already
> exist for dealing with
>    text.
>
> Sent
> by:
> "SPSSX(r)
> Discussion"
> <[hidden email]>
>
>
>
>
>
>
>
>
> I am trying to learn about using stat on
> text sources.
>
> Suppose I have a set of texts and want a count of specific
> words in each.
>
>
>
>
> For instance the classical work on the 85 Federalist papers
> used 30 words.
> This particular input is all in one file, each paper starts
> with
>
> "FEDERALIST No. " e.g. "FEDERALIST No.
> 1" for the first.
> There would be 1 record for each text. there would be 31
> variables an id
> for the paper and the counts for the 30 words.
>
>
>
>
>
> This syntax creates made up data that looks like what I
> would want. I added
> hyphens to reserved words so that variables could have
> legit names.
>
>
>
> new file.
>
> input program.
>
> vector wordcount (30,f8).
>
> loop id = 1 to 85.
>
> loop #w = 1 to 30.
>
> compute wordcount(#w) = rnd(rv.poisson(50)).
>
> end loop.
>
> end case.
>
> end loop.
>
> end file.
>
> end input program.
>
> EXECUTE .
>
> dataset name kounts30.
>
> rename vars (wordcount1 to wordcount30 =
>
>  upon also an by_
> of on there this to_
> although
>
>  both enough while whilst always though commonly
> consequently
>
>  considerable according apt direction innovation language
> vigor
>
>  kind matter particularly probability work .
>
>
>
> There are more extensive applications I would be interested
> in such as
> as many variables as words that occur in any of the the
> texts. (not just
> the 30); using 2, 3, and 4 word strings that occur within
> sentences; the
> texts being in separate files; in PDF, etc.  So If I
> can avoid re-inventing
> the wheel I would like to hear about existing python
> functions, etc. But
> I would like to start with this kind of application.
>
>
>
> I can send an ascii file, with the text if anyone wants
> it.
>
>
>
> Art Kendall
>
> Social Research Consultants
>
> ===================== To manage your
> subscription to SPSSX-L,
> send a message to [hidden email] (not to
> SPSSX-L), with no body
> text except the command. To leave the list, send the
> command SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command INFO REFCARD
>
>
>
>
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Do python procedures already exist for dealing with text.

Art Kendall
In reply to this post by John F Hall
I went to that link, and found some interesting things, but searching the page for "text" or for "qualitative" did not work.

Would you please give more specifics?

Art

John F Hall wrote:
Check out Nigel Gilbert at Surrey Univ.  He has special software for text searching from qualitaive research.  http://www.soc.surrey.ac.uk/staff/ngilbert/index.html
----- Original Message -----
Sent: Friday, October 23, 2009 5:11 PM
Subject: Do python procedures already exist for dealing with text.

I am trying to learn about using stat on text sources.
Suppose I have a set of texts and want a count of specific words in each. 

For instance the classical work on the 85 Federalist papers used 30 words. This particular input is all in one file, each paper starts with
"FEDERALIST No. " e.g. "FEDERALIST No. 1" for the first. There would be 1 record for each text. there would be 31 variables an id for the paper and the counts for the 30 words.


This syntax creates made up data that looks like what I would want. I added hyphens to reserved words so that variables could have legit names.

new file.
input program.
vector wordcount (30,f8).
loop id = 1 to 85.
loop #w = 1 to 30.
compute wordcount(#w) = rnd(rv.poisson(50)).
end loop.
end case.
end loop.
end file.
end input program.
EXECUTE .
dataset name kounts30.
rename vars (wordcount1 to wordcount30 =
 upon also an by_ of on there this to_ although
 both enough while whilst always though commonly consequently
 considerable according apt direction innovation language vigor
 kind matter particularly probability work .

There are more extensive applications I would be interested in such as as many variables as words that occur in any of the the texts. (not just the 30); using 2, 3, and 4 word strings that occur within sentences; the texts being in separate files; in PDF, etc.  So If I can avoid re-inventing the wheel I would like to hear about existing python functions, etc. But I would like to start with this kind of application.

I can send an ascii file, with the text if anyone wants it.

Art Kendall
Social Research Consultants
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Do python procedures exist for dealing with text

Richard Ristow
In reply to this post by Art Kendall
At 11:11 AM 10/23/2009, Art Kendall wrote:

I am trying to learn about using stat on text sources.
Suppose I have a set of texts and want a count of specific words in each.

This is 'concording' - preparing a concordance, a list of words used in a text, with the number of times they are used. You want to keep the counts of only a specific set of words.

I don't know how good the Python solutions are; are they working for you. However, data-stream packages like SPSS (and SAS) are well suited to concording.

Below is a solution in native SPSS. Counting for the concordance (AGGREGATE), keeping counts only for the selected words (MATCH FILES), and making each count a variable (CASESTOVARS) are natural SPSS operations. The SPSS transformation language and VARSTOCASES serve to break the text into words, though I'm sure Python could do the parsing more easily.

*  II.   Concord the text                                        ... .

List
|-----------------------------|---------------------------|
|Output Created               |30-OCT-2009 01:46:03       |
|-----------------------------|---------------------------|
[Text]
Paragraph Line Text

Par. 1       1 I am trying to learn about using stat on text sources.
Par. 1       2 Suppose I have a set of texts and want a count of specific
Par. 1       3 words in each.
Par. 2       1 For instance the classical work on the 85 Federalist papers
Par. 2       2 used 30 words. This particular input is all in one file,
Par. 2       3 each paper starts with "FEDERALIST No. " e.g. "FEDERALIST
Par. 2       4 No. 1" for the first. There would be 1 record for each text.
Par. 2       5 there would be 31 variables an id for the paper and the
Par. 2       6 counts for the 30 words.
Par. 3       1 This syntax creates made up data that looks like what I
Par. 3       2 would want. I added hyphens to reserved words so that
Par. 3       3 variables could have legit names.

Number of cases read:  12    Number of cases listed:  12

 
NEW FILE.
ADD FILES  /FILE=Text.
DATASET NAME     WordList.

*  II.A  Parse the text lines into words.                        ... .

STRING     #Buffer  (A100).
STRING     #OneWord (A15).
NUMERIC    #Index   (F3).
NUMERIC    #WordEnd (F3).

COMPUTE    #Buffer = LTRIM(Text).

VECTOR     RawWord  (50A12).

LOOP       #Index   =  1 TO 50.
.  COMPUTE #WordEnd = INDEX (#Buffer,' ').
.  COMPUTE #OneWord = SUBSTR(#Buffer,1,#WordEnd).
.  COMPUTE RawWord (#Index)
                    = #OneWord.
.  COMPUTE #Buffer  = LTRIM(SUBSTR(#Buffer,#WordEnd)).
END LOOP
        IF #Buffer EQ ' '.

EXECUTE /* This appears to be necessary, to activate the file    */  .


*  II.B  Unroll, to one word per line, and                       ... .
*        standardize the forms of words.                         ... .  

VARSTOCASES
      /MAKE   RawWord
       FROM   RawWord1  RawWord2  RawWord3  RawWord4  RawWord5
              RawWord6  RawWord7  RawWord8  RawWord9  RawWord10
              RawWord11 RawWord12 RawWord13 RawWord14 RawWord15
              RawWord16 RawWord17 RawWord18 RawWord19 RawWord20
              RawWord21 RawWord22 RawWord23 RawWord24 RawWord25
              RawWord26 RawWord27 RawWord28 RawWord29 RawWord30
              RawWord31 RawWord32 RawWord33 RawWord34 RawWord35
              RawWord36 RawWord37 RawWord38 RawWord39 RawWord40
              RawWord41 RawWord42 RawWord43 RawWord44 RawWord45
              RawWord46 RawWord47 RawWord48 RawWord49 RawWord50
     /INDEX = Position(50)
     /KEEP  = Paragraph Line
     /NULL  = DROP.

 
Variables to Cases
|-----------------------------|---------------------------|
|Output Created               |30-OCT-2009 01:46:04       |
|-----------------------------|---------------------------|
[WordList]
 
Generated Variables
|--------|------|
|Name    |Label |
|--------|------|
|Position|<none>|
|RawWord |<none>|
|--------|------|
 
Processing Statistics
|-------------|--|
|Variables In |53|
|Variables Out|4 |
|-------------|--|

 
STRING  Word      (A12).
COMPUTE Word=LOWER(RawWord).
DO  REPEAT  Remove = ' '  '.'  '"'.
.   COMPUTE Word   = REPLACE(Word,Remove,'').
END REPEAT.

SELECT IF   Word NE ' '.


*  II.C  Count occurrences, to create the concordance            ... .

DATASET   DECLARE  Concordance.
AGGREGATE OUTFILE= Concordance
   /BREAK = Paragraph Word
   /Count 'Number of occurrences of this word' = NU.

DATASET   ACTIVATE Concordance WINDOW=FRONT.


*  III.  Keep only the particulary sought words                  ... .

*  III.A Prepare the keywords for merging                        ... .

DATASET ACTIVATE Keywords WINDOW=FRONT.
LIST.
List
|-----------------------------|---------------------------|
|Output Created               |30-OCT-2009 01:46:05       |
|-----------------------------|---------------------------|
[Keywords]
 
Word

Federal
Federalist
paper
words
I
instance
syntax

Number of cases read:  7    Number of cases listed:  7

 
NEW FILE.
ADD FILES /FILE=Keywords.

COMPUTE  Word = LOWER(LTRIM(Word)).
SORT CASES BY Word.
DATASET  NAME KeysReady.


*  III.B  Prepare the concordance for merging                    ... .

NEW FILE.
ADD FILE /FILES=Concordance.
SORT CASES BY Word Paragraph.

*  III.C  Merge, keep only the keywords, and re-sort             ... .

MATCH FILES
  /FILE =*         /IN=InText
  /TABLE=KeysReady /IN=Keyword
  /BY   Word.

DATASET NAME  KeyWordCounts.
SELECT  IF    InText AND Keyword.
SORT CASES BY Paragraph Word.


*  IV.    Make the keyword counts variables                      ... .

CASESTOVARS
 /ID      = Paragraph
 /INDEX   = Word
 /GROUPBY = INDEX
 /DROP    = InText Keyword.

Cases to Variables
|-----------------------------|---------------------------|
|Output Created               |30-OCT-2009 01:46:09       |
|-----------------------------|---------------------------|
[KeyWordCounts]
 
Generated Variables
|-----------|----------|--------------------------|
|Original   |Word      |Result                    |
|Variable   |          |----------|---------------|
|           |          |Name      |Label          |
|-----------|----------|----------|---------------|
|Count      |federalist|federalist|federalist:    |
|Number of  |          |          |Number of      |
|occurrences|          |          |occurrences of |
|of this    |          |          |this word      |
|word       |----------|----------|---------------|
|           |i         |i         |i: Number of   |
|           |          |          |occurrences of |
|           |          |          |this word      |
|           |----------|----------|---------------|
|           |instance  |instance  |instance:      |
|           |          |          |Number of      |
|           |          |          |occurrences of |
|           |          |          |this word      |
|           |----------|----------|---------------|
|           |paper     |paper     |paper: Number  |
|           |          |          |of occurrences |
|           |          |          |of this word   |
|           |----------|----------|---------------|
|           |syntax    |syntax    |syntax: Number |
|           |          |          |of occurrences |
|           |          |          |of this word   |
|           |----------|----------|---------------|
|           |words     |words     |words: Number  |
|           |          |          |of occurrences |
|           |          |          |of this word   |
|-----------|----------|----------|---------------|
 
Processing Statistics
|---------------|---|
|Cases In       |9  |
|Cases Out      |3  |
|---------------|---|
|Cases In/Cases |3.0|
|Out            |   |
|---------------|---|
|Variables In   |5  |
|Variables Out  |7  |
|---------------|---|
|Index Values   |6  |
|---------------|---|

 
LIST.
List
|-----------------------------|---------------------------|
|Output Created               |30-OCT-2009 01:46:09       |
|-----------------------------|---------------------------|
[KeyWordCounts]
 
Paragraph federalist       i instance   paper  syntax   words

Par. 1            .        2        .       .       .       1
Par. 2            3        .        1       2       .       2
Par. 3            .        2        .       .       1       1

Number of cases read:  3    Number of cases listed:  3
============================
APPENDIX: Code and test data
============================
*  C:\Documents and Settings\Richard\My Documents                              .
*    \Technical\spssx-l\Z-2009d\                                               .
*     2009-10-23 Kendall - Do python procedures exist for dealing with text.SPS.

*  In response to posting (original subject line shortened):         .
*  Date:    Fri, 23 Oct 2009 11:11:02 -0400                          .
*  From:    Art Kendall <[hidden email]>                          .
*  Subject: Do python procedures exist for dealing with text         .
*  To:      [hidden email]                                 .
*                                                                    .
*  Original subject line:                                            .
*  Subject: Do python procedures already exist for dealing with text..

*  "I am trying to learn about using stat on text sources. Suppose   .
*  I have a set of texts and want a count of specific words in       .
*  each. For instance the classical work on the 85 Federalist        .
*  papers used 30 words."                                            .
*                                                                    .
*  This input is all in one file, each paper starts with             .
*  "FEDERALIST No. " e.g. "FEDERALIST No. 1". There would be         .
*  record for each text, [with] 31 variables: an id for the paper    .
*  and the counts for the 30 words."                                 .

*  ................................................................. .
*  .................   Test data               ..................... .

*  I.A   Text to be concorded                                    ... .

NEW FILE.
INPUT PROGRAM.
.  STRING  Paragraph (A6).
.  NUMERIC Line      (F3).
.  DATA LIST FIXED/
    Paragraph 05-10 (A)
    Text      12-75 (A).
END INPUT PROGRAM.
*---|---10----|---20----|---30----|---40----|---50----|---60----|---70.
BEGIN DATA
    Par. 1 I am trying to learn about using stat on text sources.
    Par. 1 Suppose I have a set of texts and want a count of specific
    Par. 1 words in each. 
    Par. 2 For instance the classical work on the 85 Federalist papers
    Par. 2 used 30 words. This particular input is all in one file,
    Par. 2 each paper starts with "FEDERALIST No. " e.g. "FEDERALIST
    Par. 2 No. 1" for the first. There would be 1 record for each text.
    Par. 2 there would be 31 variables an id for the paper and the
    Par. 2 counts for the 30 words.
    Par. 3 This syntax creates made up data that looks like what I
    Par. 3 would want. I added hyphens to reserved words so that
    Par. 3 variables could have legit names.
END DATA.  
DO IF    $CASENUM EQ 1.
.  COMPUTE Line = 1.
ELSE IF  Paragraph NE LAG(Paragraph).
.  COMPUTE Line = 1.
ELSE.
.  COMPUTE Line = LAG(Line) + 1.
END IF.

DATASET NAME     Text.  
.  /*-- LIST /*-*/.

*  I.B   Particular words to be counted                          ... .

NEW FILE.
DATA LIST FREE /Word(A12).
BEGIN DATA
Federal Federalist paper words I instance syntax
END DATA.
DATASET NAME     Keywords.
.  /*-- LIST /*-*/.


*  .................   Post after this point   ..................... .
*  ................................................................. .

*  II.   Concord the text                                        ... .

DATASET ACTIVATE Text WINDOW=FRONT.
LIST.

NEW FILE.
ADD FILES  /FILE=Text.
DATASET NAME     WordList.

*  II.A  Parse the text lines into words.                        ... .

STRING     #Buffer  (A100).
STRING     #OneWord (A15).
NUMERIC    #Index   (F3).
NUMERIC    #WordEnd (F3).

COMPUTE    #Buffer = LTRIM(Text).

VECTOR     RawWord  (50A12).

LOOP       #Index   =  1 TO 50.
.  COMPUTE #WordEnd = INDEX (#Buffer,' ').
.  COMPUTE #OneWord = SUBSTR(#Buffer,1,#WordEnd).
.  COMPUTE RawWord (#Index)
                    = #OneWord.
.  COMPUTE #Buffer  = LTRIM(SUBSTR(#Buffer,#WordEnd)).
END LOOP
        IF #Buffer EQ ' '.

EXECUTE /* This appears to be necessary, to activate the file    */  .


*  II.B  Unroll, to one word per line, and                       ... .
*        standardize the forms of words.                         ... .  

VARSTOCASES 
      /MAKE   RawWord
       FROM   RawWord1  RawWord2  RawWord3  RawWord4  RawWord5
              RawWord6  RawWord7  RawWord8  RawWord9  RawWord10
              RawWord11 RawWord12 RawWord13 RawWord14 RawWord15
              RawWord16 RawWord17 RawWord18 RawWord19 RawWord20
              RawWord21 RawWord22 RawWord23 RawWord24 RawWord25
              RawWord26 RawWord27 RawWord28 RawWord29 RawWord30
              RawWord31 RawWord32 RawWord33 RawWord34 RawWord35
              RawWord36 RawWord37 RawWord38 RawWord39 RawWord40
              RawWord41 RawWord42 RawWord43 RawWord44 RawWord45
              RawWord46 RawWord47 RawWord48 RawWord49 RawWord50
     /INDEX = Position(50)
     /KEEP  = Paragraph Line
     /NULL  = DROP.
    
STRING  Word      (A12).
COMPUTE Word=LOWER(RawWord).
DO  REPEAT  Remove = ' '  '.'  '"'. 
.   COMPUTE Word   = REPLACE(Word,Remove,'').
END REPEAT.

SELECT IF   Word NE ' '.

.  /*-- LIST /CASES=30 /*-*/.


*  II.C  Count occurrences, to create the concordance            ... .

DATASET   DECLARE  Concordance.
AGGREGATE OUTFILE= Concordance
   /BREAK = Paragraph Word
   /Count 'Number of occurrences of this word' = NU.
DATASET   ACTIVATE Concordance WINDOW=FRONT.  


*  III.  Keep only the particulary sought words                  ... .

*  III.A Prepare the keywords for merging                        ... .

DATASET ACTIVATE Keywords WINDOW=FRONT.
LIST.

NEW FILE.
ADD FILES /FILE=Keywords.

COMPUTE  Word = LOWER(LTRIM(Word)).
SORT CASES BY Word.
DATASET  NAME KeysReady.
 
.  /*-- LIST /*-*/.


*  III.B  Prepare the concordance for merging                    ... .

NEW FILE.
ADD FILE /FILES=Concordance.
SORT CASES BY Word Paragraph.


*  III.C  Merge, keep only the keywords, and re-sort             ... .

MATCH FILES
  /FILE =*         /IN=InText
  /TABLE=KeysReady /IN=Keyword
  /BY   Word.
 
DATASET NAME  KeyWordCounts.
SELECT  IF    InText AND Keyword.
SORT CASES BY Paragraph Word.
 
.  /*-- LIST /*-*/.


*  IV.    Make the keyword counts variables                      ... .

CASESTOVARS
 /ID      = Paragraph
 /INDEX   = Word
 /GROUPBY = INDEX
 /DROP    = InText Keyword.

LIST.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD