SPSSX Discussion

singe and joint frequencies of words

Classic

List

Threaded

10 messages Options

Empi

singe and joint frequencies of words

Dear all,

my aim is to calculate the single and joint frequencies of words from texts
saved as a string-variable (“text_var”); each cell of the string-variable
contains multiple sentences (ultimately, I’d like to use these frequencies
to calculate a Jaccard-index to assess the strength of the co-occurrence of
words).

Ideally, the results would indicate per cell (1) how often word “x” occurs
(2) how often word “y” occurs and (3) how often words “x” and “y” occur
together as “xy” in a text.

I assume that the single frequencies of “x” and “y” and the joint frequency
of “xy” could be stored in three new variables - but it is not really clear
to me how to request the quantities.

I think that this syntax
compute var_x =char.index(lower(text), "cats") > 0.
compute var_y =char.index(lower(text), "dogs") > 0.

gives the single frequencies of the words “cats” and “dogs” per text. But I
failed to adjust this syntax (or any other syntax) in order to obtain the
joint frequencies of “cats” and “dogs” – can anybody help me out here???

Thank you very much & regards,
Empi

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Empi

Re: singe and joint frequencies of words

...and of course, the problem is to identify the joint occurecnce per
sentence, not just per text - so how can one identify sentences per text
where two (or more) words of interest occur together?

Best,
Empi

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

spss.giesel@yahoo.de

Re: singe and joint frequencies of words

In reply to this post by Empi

Dear Empi,

feels like a Sisyphean task without using a programming language like R or Python and some NLP package.
But maybe you can send some text examples and the result you want to achieve.

Regards,
Mario

Am Freitag, 25. Oktober 2019, 12:18:17 MESZ hat Empi <[hidden email]> Folgendes geschrieben:

Dear all,

my aim is to calculate the single and joint frequencies of words from texts

saved as a string-variable (“text_var”); each cell of the string-variable

contains multiple sentences (ultimately, I’d like to use these frequencies

to calculate a Jaccard-index to assess the strength of the co-occurrence of

words).

Ideally, the results would indicate per cell (1) how often word “x” occurs

(2) how often word “y” occurs and (3) how often words “x” and “y” occur

together as “xy” in a text.

I assume that the single frequencies of “x” and “y” and the joint frequency

of “xy” could be stored in three new variables - but it is not really clear

to me how to request the quantities.

I think that this syntax

compute var_x =char.index(lower(text), "cats") > 0.

compute var_y =char.index(lower(text), "dogs") > 0.

gives the single frequencies of the words “cats” and “dogs” per text. But I

failed to adjust this syntax (or any other syntax) in order to obtain the

joint frequencies of “cats” and “dogs” – can anybody help me out here???

Thank you very much & regards,

Empi

Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================

To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the

command. To leave the list, send the command

SIGNOFF SPSSX-L

For a list of commands to manage subscriptions, send the command

INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Kirill Orlov

Re: singe and joint frequencies of words

Just break text: Make one cell one word. There will be single very long column in your dataset. Cases are words in their sequence. Sentences can be separated by a blank cell or indicated by a separate categorical variable. Then remove waste words (if needed): stemma/lemmatization. Then AUTORECODE words into numeric codes. Then you can do everything you want.

25.10.2019 14:14, [hidden email] пишет:

Dear Empi,

feels like a Sisyphean task without using a programming language like R or Python and some NLP package.

But maybe you can send some text examples and the result you want to achieve.

Regards,

Mario

Am Freitag, 25. Oktober 2019, 12:18:17 MESZ hat Empi [hidden email] Folgendes geschrieben:

Dear all,

my aim is to calculate the single and joint frequencies of words from texts

saved as a string-variable (“text_var”); each cell of the string-variable

contains multiple sentences (ultimately, I’d like to use these frequencies

to calculate a Jaccard-index to assess the strength of the co-occurrence of

words).

Ideally, the results would indicate per cell (1) how often word “x” occurs

(2) how often word “y” occurs and (3) how often words “x” and “y” occur

together as “xy” in a text.

I assume that the single frequencies of “x” and “y” and the joint frequency

of “xy” could be stored in three new variables - but it is not really clear

to me how to request the quantities.

I think that this syntax

compute var_x =char.index(lower(text), "cats") > 0.

compute var_y =char.index(lower(text), "dogs") > 0.

gives the single frequencies of the words “cats” and “dogs” per text. But I

failed to adjust this syntax (or any other syntax) in order to obtain the

joint frequencies of “cats” and “dogs” – can anybody help me out here???

Thank you very much & regards,

Empi

--

Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================

To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the

command. To leave the list, send the command

SIGNOFF SPSSX-L

For a list of commands to manage subscriptions, send the command

INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Empi

Re: singe and joint frequencies of words

In reply to this post by spss.giesel@yahoo.de

Hi Mario,

many thanks for your reply - let me try to offer a (hopefully) somewhat more
precise description of my aim.
My question:

Using SPSS, how could one examine if two words within one sentence occur
together or not?

Let's imagine a researcher would be interested in counting how often the
words "illegal" and "immigrant*" in sentences in tweets from a politician
occur together or not.

In order to count the single occurences of "illegal" and "immigrant" per
tweet my earlier example should suffice:

compute var_immigrants =char.index(lower(text), "immigrants") > 0.
compute var_illegal =char.index(lower(text), "illegal") > 0.

But how can the char.index function - or any other functuin - be used to
(a) restrict the search to single sentences (as indicated by a dot "." or
maybe a question mark "?") and

(b) to indicate the joint occurence of the words, such as the phrase
"illegal immigrants"?

We could then calculate Jaccard's Index as =
f_illegal&immigrant / (f_illegal + f_immigrant - f_illegal&immigrant)

PS: Just let me know if Ishould provide some real tweets :)

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Maguin, Eugene

Re: singe and joint frequencies of words

Start where Kirill left off but with a modification. Number message and sentence within message. You have a dictionary of words and their numbers. In that long, single variable file, use aggregate for first occurrence of word x and word y. You now have a crosstab after filling in the sysmis values where a word or both were not in a sentence.
Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Empi
Sent: Friday, October 25, 2019 10:17 AM
To: [hidden email]
Subject: Re: singe and joint frequencies of words

Hi Mario,

many thanks for your reply - let me try to offer a (hopefully) somewhat more precise description of my aim.
My question:

Using SPSS, how could one examine if two words within one sentence occur together or not?

Let's imagine a researcher would be interested in counting how often the words "illegal" and "immigrant*" in sentences in tweets from a politician occur together or not.

In order to count the single occurences of "illegal" and "immigrant" per tweet my earlier example should suffice:

compute var_immigrants =char.index(lower(text), "immigrants") > 0.
compute var_illegal =char.index(lower(text), "illegal") > 0.

But how can the char.index function - or any other functuin - be used to
(a) restrict the search to single sentences (as indicated by a dot "." or maybe a question mark "?") and

(b) to indicate the joint occurence of the words, such as the phrase "illegal immigrants"?

We could then calculate Jaccard's Index as = f_illegal&immigrant / (f_illegal + f_immigrant - f_illegal&immigrant)

PS: Just let me know if Ishould provide some real tweets :)

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

spss.giesel@yahoo.de

Re: singe and joint frequencies of words

In reply to this post by Kirill Orlov

Hi, Empi,

To answer your question straight: It’s too complicated, too time consuming and too error prone to do this with SPSS syntax alone. If sentences are your basic analytical units I would start with breaking your tweets with some regular expression construction, e.g. in Python
---
import re
str = "\.|\?|\!" # separator is ‘.’ Or ‘?’ or ‘!’
x = re.split("\s", str)
---

You’ll get separate sentences, then and several rows per person.

If you don’t bother with programming you can take a more shirtsleeved approach:
•	Copy your variable content into Notepad++
•	Select all
•	Go to Search -> Replace
•	Find: for each sentence separator (.?! Etc.) insert the separator 
•	Search mode: Extended
•	Replace with: write your sentence separator and an additional “\n”
This will insert a new line.

You can copy the result into a new SPSS data file. Then you can use your cats&dogs syntax. Of course, you will lose relations to other variables in the dataset. But, sorry, there’s no easy way I'm aware of to do it otherwise.

Mario Giesel
Munich, Germany

Am Samstag, 26. Oktober 2019, 01:09:21 MESZ hat Kirill Orlov <[hidden email]> Folgendes geschrieben:

25.10.2019 14:14, [hidden email] пишет:

Dear Empi,

feels like a Sisyphean
            task without using a programming language like R or Python
            and some NLP package.
But maybe you can send some
          text examples and the result you want to achieve.

Regards,
Mario

Am Freitag, 25. Oktober 2019, 12:18:17 MESZ hat Empi [hidden email] Folgendes geschrieben:

Dear all,

my aim is to calculate the single and joint frequencies of words from texts

saved as a string-variable (“text_var”); each cell of the string-variable

contains multiple sentences (ultimately, I’d like to use these frequencies

to calculate a Jaccard-index to assess the strength of the co-occurrence of

words).

Ideally, the results would indicate per cell (1) how often word “x” occurs

(2) how often word “y” occurs and (3) how often words “x” and “y” occur

together as “xy” in a text.

I assume that the single frequencies of “x” and “y” and the joint frequency

of “xy” could be stored in three new variables - but it is not really clear

to me how to request the quantities.

I think that this syntax

compute var_x =char.index(lower(text), "cats") > 0.

compute var_y =char.index(lower(text), "dogs") > 0.

gives the single frequencies of the words “cats” and “dogs” per text. But I

failed to adjust this syntax (or any other syntax) in order to obtain the

joint frequencies of “cats” and “dogs” – can anybody help me out here???

Thank you very much & regards,

Empi

Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================

To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the

command. To leave the list, send the command

SIGNOFF SPSSX-L

For a list of commands to manage subscriptions, send the command

INFO REFCARD

spss.giesel@yahoo.de

Re: singe and joint frequencies of words

Sorry, my answer has been cut as it looks. Second try:


To answer
your question straight: It’s too complicated, too time consuming and too error
prone to do this with SPSS syntax alone. If sentences are your basic analytical
units I would start with breaking your tweets with some regular expression construction,
e.g. in Python
import re

str = "\.|\?|\!" #
separator is ‘.’ Or ‘?’ or ‘!’

x = re.split("\s", str)
You’ll get separate
sentences, then and several rows per person.
 
If you don’t
bother with programming you can take a more shirtsleeved approach:
·       
Copy your variable content into Notepad++
·       
Select all
·       
Go to Search -> Replace
·       
Find: for each sentence separator (.?! Etc.) insert
the separator 
·       
Search mode: Extended
·       
Replace with: write your sentence separator and
an additional “\n”
This will
insert a new line
You can
copy the result into a new SPSS data file. Then you can use your cats&dogs syntax.
Of course, you will lose relations to other variables in the dataset. But there’s
no easy way to do it otherwise.


Mario Giesel
Munich, Germany

Am Samstag, 26. Oktober 2019, 12:57:54 MESZ hat Mario Giesel <[hidden email]> Folgendes geschrieben:

Hi, Empi,

To answer your question straight: It’s too complicated, too time consuming and too error prone to do this with SPSS syntax alone. If sentences are your basic analytical units I would start with breaking your tweets with some regular expression construction, e.g. in Python
---
import re
str = "\.|\?|\!" # separator is ‘.’ Or ‘?’ or ‘!’
x = re.split("\s", str)
---

You’ll get separate sentences, then and several rows per person.

If you don’t bother with programming you can take a more shirtsleeved approach:
•	Copy your variable content into Notepad++
•	Select all
•	Go to Search -> Replace
•	Find: for each sentence separator (.?! Etc.) insert the separator 
•	Search mode: Extended
•	Replace with: write your sentence separator and an additional “\n”
This will insert a new line.

You can copy the result into a new SPSS data file. Then you can use your cats&dogs syntax. Of course, you will lose relations to other variables in the dataset. But, sorry, there’s no easy way I'm aware of to do it otherwise.

Mario Giesel
Munich, Germany

Am Samstag, 26. Oktober 2019, 01:09:21 MESZ hat Kirill Orlov <[hidden email]> Folgendes geschrieben:

25.10.2019 14:14, [hidden email] пишет:

Dear Empi,

feels like a Sisyphean
            task without using a programming language like R or Python
            and some NLP package.
But maybe you can send some
          text examples and the result you want to achieve.

Regards,
Mario

Am Freitag, 25. Oktober 2019, 12:18:17 MESZ hat Empi [hidden email] Folgendes geschrieben:

Dear all,

my aim is to calculate the single and joint frequencies of words from texts

saved as a string-variable (“text_var”); each cell of the string-variable

contains multiple sentences (ultimately, I’d like to use these frequencies

to calculate a Jaccard-index to assess the strength of the co-occurrence of

words).

Ideally, the results would indicate per cell (1) how often word “x” occurs

(2) how often word “y” occurs and (3) how often words “x” and “y” occur

together as “xy” in a text.

I assume that the single frequencies of “x” and “y” and the joint frequency

of “xy” could be stored in three new variables - but it is not really clear

to me how to request the quantities.

I think that this syntax

compute var_x =char.index(lower(text), "cats") > 0.

compute var_y =char.index(lower(text), "dogs") > 0.

gives the single frequencies of the words “cats” and “dogs” per text. But I

failed to adjust this syntax (or any other syntax) in order to obtain the

joint frequencies of “cats” and “dogs” – can anybody help me out here???

Thank you very much & regards,

Empi

Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================

To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the

command. To leave the list, send the command

SIGNOFF SPSSX-L

For a list of commands to manage subscriptions, send the command

INFO REFCARD

Jon Peck

Re: singe and joint frequencies of words

I posted a solution for the two-word problem on the IBM Predictive Analytics site, but I am copying it here. It uses the SPSSINC TRANS extension command with a small Python function to find counts of joint occurrences per sentence of two specified words. It could be generalized in a number of ways.

* Encoding: UTF-8.
data list list/text(a60).
begin data
"dogs and cats are enemies. but dogs sometimes like cats."
"there are no dogs here."
"are there cats or dogs here? Maybe just cats."
"there are elephants."
end data.
dataset name text.

begin program.
import re
def counter(text, word1, word2):
sentences = re.findall(r"(.*?)(?:\.|\?)", text)
paircount = 0
for s in sentences:
has1 = re.search(r"\b%s\b" % word1.strip(), s, flags=re.I) is not None
has2 = re.search(r"\b%s\b" % word2.strip(), s, flags=re.I) is not None
if has1 and has2:
paircount = paircount + 1
return paircount
end program.

spssinc trans result=counts
/formula 'counter(text, word1="dogs", word2="cats")'.

On Sat, Oct 26, 2019 at 5:00 AM Mario Giesel <[hidden email]> wrote:

Sorry, my answer has been cut as it looks. Second try:

To answer your question straight: It’s too complicated, too time consuming and too error prone to do this with SPSS syntax alone. If sentences are your basic analytical units I would start with breaking your tweets with some regular expression construction, e.g. in Python

import re
str = "\.|\?|\!" # separator is ‘.’ Or ‘?’ or ‘!’
x = re.split("\s", str)

You’ll get separate sentences, then and several rows per person.

If you don’t bother with programming you can take a more shirtsleeved approach:

·        Copy your variable content into Notepad++

·        Select all

·        Go to Search -> Replace

·        Find: for each sentence separator (.?! Etc.) insert the separator

·        Search mode: Extended

·        Replace with: write your sentence separator and an additional “\n”

This will insert a new line

You can copy the result into a new SPSS data file. Then you can use your cats&dogs syntax. Of course, you will lose relations to other variables in the dataset. But there’s no easy way to do it otherwise.

Mario Giesel
Munich, Germany

Am Samstag, 26. Oktober 2019, 12:57:54 MESZ hat Mario Giesel <[hidden email]> Folgendes geschrieben:

Hi, Empi,

To answer your question straight: It’s too complicated, too time consuming and too error prone to do this with SPSS syntax alone. If sentences are your basic analytical units I would start with breaking your tweets with some regular expression construction, e.g. in Python
---
import re
str = "\.|\?|\!" # separator is ‘.’ Or ‘?’ or ‘!’
x = re.split("\s", str)
---

You’ll get separate sentences, then and several rows per person.

If you don’t bother with programming you can take a more shirtsleeved approach:
• Copy your variable content into Notepad++
• Select all
• Go to Search -> Replace
• Find: for each sentence separator (.?! Etc.) insert the separator
• Search mode: Extended
• Replace with: write your sentence separator and an additional “\n”
This will insert a new line.

You can copy the result into a new SPSS data file. Then you can use your cats&dogs syntax. Of course, you will lose relations to other variables in the dataset. But, sorry, there’s no easy way I'm aware of to do it otherwise.

Mario Giesel
Munich, Germany

Am Samstag, 26. Oktober 2019, 01:09:21 MESZ hat Kirill Orlov <[hidden email]> Folgendes geschrieben:

Just break text: Make one cell one word. There will be single very long column in your dataset. Cases are words in their sequence. Sentences can be separated by a blank cell or indicated by a separate categorical variable. Then remove waste words (if needed): stemma/lemmatization. Then AUTORECODE words into numeric codes. Then you can do everything you want.

25.10.2019 14:14, [hidden email] пишет:

Dear Empi,

feels like a Sisyphean task without using a programming language like R or Python and some NLP package.

But maybe you can send some text examples and the result you want to achieve.

Regards,

Mario

Am Freitag, 25. Oktober 2019, 12:18:17 MESZ hat Empi [hidden email] Folgendes geschrieben:

Dear all,

my aim is to calculate the single and joint frequencies of words from texts

saved as a string-variable (“text_var”); each cell of the string-variable

contains multiple sentences (ultimately, I’d like to use these frequencies

to calculate a Jaccard-index to assess the strength of the co-occurrence of

words).

Ideally, the results would indicate per cell (1) how often word “x” occurs

(2) how often word “y” occurs and (3) how often words “x” and “y” occur

together as “xy” in a text.

I assume that the single frequencies of “x” and “y” and the joint frequency

of “xy” could be stored in three new variables - but it is not really clear

to me how to request the quantities.

I think that this syntax

compute var_x =char.index(lower(text), "cats") > 0.

compute var_y =char.index(lower(text), "dogs") > 0.

gives the single frequencies of the words “cats” and “dogs” per text. But I

failed to adjust this syntax (or any other syntax) in order to obtain the

joint frequencies of “cats” and “dogs” – can anybody help me out here???

Thank you very much & regards,

Empi

--

Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================

To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the

command. To leave the list, send the command

SIGNOFF SPSSX-L

For a list of commands to manage subscriptions, send the command

INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon K Peck
[hidden email]

David Marso-2

Re: singe and joint frequencies of words

In reply to this post by Empi

Parse into a single new record per word retaining caseid, Cartesian match records within each caseid, aggregate... Done. All of these steps can be found in this group's archives.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD