Dear all,
my aim is to calculate the single and joint frequencies of words from texts saved as a string-variable (“text_var”); each cell of the string-variable contains multiple sentences (ultimately, I’d like to use these frequencies to calculate a Jaccard-index to assess the strength of the co-occurrence of words). Ideally, the results would indicate per cell (1) how often word “x” occurs (2) how often word “y” occurs and (3) how often words “x” and “y” occur together as “xy” in a text. I assume that the single frequencies of “x” and “y” and the joint frequency of “xy” could be stored in three new variables - but it is not really clear to me how to request the quantities. I think that this syntax compute var_x =char.index(lower(text), "cats") > 0. compute var_y =char.index(lower(text), "dogs") > 0. gives the single frequencies of the words “cats” and “dogs” per text. But I failed to adjust this syntax (or any other syntax) in order to obtain the joint frequencies of “cats” and “dogs” – can anybody help me out here??? Thank you very much & regards, Empi -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
...and of course, the problem is to identify the joint occurecnce per
sentence, not just per text - so how can one identify sentences per text where two (or more) words of interest occur together? Best, Empi -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Empi
Dear Empi, feels like a Sisyphean task without using a programming language like R or Python and some NLP package. But maybe you can send some text examples and the result you want to achieve. Regards, Mario
Am Freitag, 25. Oktober 2019, 12:18:17 MESZ hat Empi <[hidden email]> Folgendes geschrieben:
Dear all, my aim is to calculate the single and joint frequencies of words from texts saved as a string-variable (“text_var”); each cell of the string-variable contains multiple sentences (ultimately, I’d like to use these frequencies to calculate a Jaccard-index to assess the strength of the co-occurrence of words). Ideally, the results would indicate per cell (1) how often word “x” occurs (2) how often word “y” occurs and (3) how often words “x” and “y” occur together as “xy” in a text. I assume that the single frequencies of “x” and “y” and the joint frequency of “xy” could be stored in three new variables - but it is not really clear to me how to request the quantities. I think that this syntax compute var_x =char.index(lower(text), "cats") > 0. compute var_y =char.index(lower(text), "dogs") > 0. gives the single frequencies of the words “cats” and “dogs” per text. But I failed to adjust this syntax (or any other syntax) in order to obtain the joint frequencies of “cats” and “dogs” – can anybody help me out here??? Thank you very much & regards, Empi -- ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Just break text: Make one cell one word. There will be single very
long column in your dataset. Cases are words in their sequence.
Sentences can be separated by a blank cell or indicated by a
separate categorical variable. Then remove waste words (if needed):
stemma/lemmatization. Then AUTORECODE words into numeric codes. Then
you can do everything you want.
25.10.2019 14:14, [hidden email]
пишет:
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by spss.giesel@yahoo.de
Hi Mario,
many thanks for your reply - let me try to offer a (hopefully) somewhat more precise description of my aim. My question: Using SPSS, how could one examine if two words within one sentence occur together or not? Let's imagine a researcher would be interested in counting how often the words "illegal" and "immigrant*" in sentences in tweets from a politician occur together or not. In order to count the single occurences of "illegal" and "immigrant" per tweet my earlier example should suffice: compute var_immigrants =char.index(lower(text), "immigrants") > 0. compute var_illegal =char.index(lower(text), "illegal") > 0. But how can the char.index function - or any other functuin - be used to (a) restrict the search to single sentences (as indicated by a dot "." or maybe a question mark "?") and (b) to indicate the joint occurence of the words, such as the phrase "illegal immigrants"? We could then calculate Jaccard's Index as = f_illegal&immigrant / (f_illegal + f_immigrant - f_illegal&immigrant) PS: Just let me know if Ishould provide some real tweets :) -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Start where Kirill left off but with a modification. Number message and sentence within message. You have a dictionary of words and their numbers. In that long, single variable file, use aggregate for first occurrence of word x and word y. You now have a crosstab after filling in the sysmis values where a word or both were not in a sentence.
Gene Maguin -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Empi Sent: Friday, October 25, 2019 10:17 AM To: [hidden email] Subject: Re: singe and joint frequencies of words Hi Mario, many thanks for your reply - let me try to offer a (hopefully) somewhat more precise description of my aim. My question: Using SPSS, how could one examine if two words within one sentence occur together or not? Let's imagine a researcher would be interested in counting how often the words "illegal" and "immigrant*" in sentences in tweets from a politician occur together or not. In order to count the single occurences of "illegal" and "immigrant" per tweet my earlier example should suffice: compute var_immigrants =char.index(lower(text), "immigrants") > 0. compute var_illegal =char.index(lower(text), "illegal") > 0. But how can the char.index function - or any other functuin - be used to (a) restrict the search to single sentences (as indicated by a dot "." or maybe a question mark "?") and (b) to indicate the joint occurence of the words, such as the phrase "illegal immigrants"? We could then calculate Jaccard's Index as = f_illegal&immigrant / (f_illegal + f_immigrant - f_illegal&immigrant) PS: Just let me know if Ishould provide some real tweets :) -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Kirill Orlov
Hi, Empi, To answer your question straight: It’s too complicated, too time consuming and too error prone to do this with SPSS syntax alone. If sentences are your basic analytical units I would start with breaking your tweets with some regular expression construction, e.g. in Python --- import re str = "\.|\?|\!" # separator is ‘.’ Or ‘?’ or ‘!’ x = re.split("\s", str) --- You’ll get separate sentences, then and several rows per person. If you don’t bother with programming you can take a more shirtsleeved approach: • Copy your variable content into Notepad++ • Select all • Go to Search -> Replace • Find: for each sentence separator (.?! Etc.) insert the separator • Search mode: Extended • Replace with: write your sentence separator and an additional “\n” This will insert a new line. You can copy the result into a new SPSS data file. Then you can use your cats&dogs syntax. Of course, you will lose relations to other variables in the dataset. But, sorry, there’s no easy way I'm aware of to do it otherwise. Mario Giesel Munich, Germany
Am Samstag, 26. Oktober 2019, 01:09:21 MESZ hat Kirill Orlov <[hidden email]> Folgendes geschrieben:
Just break text: Make one cell one word. There will be single very
long column in your dataset. Cases are words in their sequence.
Sentences can be separated by a blank cell or indicated by a
separate categorical variable. Then remove waste words (if needed):
stemma/lemmatization. Then AUTORECODE words into numeric codes. Then
you can do everything you want. 25.10.2019 14:14, [hidden email]
пишет:
Dear Empi,
feels like a Sisyphean
task without using a programming language like R or Python
and some NLP package.
But maybe you can send some
text examples and the result you want to achieve.
Regards,
Mario
Am Freitag, 25. Oktober 2019, 12:18:17 MESZ hat Empi
[hidden email] Folgendes geschrieben:
Dear all,
my aim is to calculate the single and joint
frequencies of words from texts
saved as a string-variable (“text_var”); each
cell of the string-variable
contains multiple sentences (ultimately, I’d
like to use these frequencies
to calculate a Jaccard-index to assess the
strength of the co-occurrence of
words).
Ideally, the results would indicate per cell
(1) how often word “x” occurs
(2) how often word “y” occurs and (3) how
often words “x” and “y” occur
together as “xy” in a text.
I assume that the single frequencies of “x”
and “y” and the joint frequency
of “xy” could be stored in three new
variables - but it is not really clear
to me how to request the quantities.
I think that this syntax
compute var_x =char.index(lower(text),
"cats") > 0.
compute var_y =char.index(lower(text),
"dogs") > 0.
gives the single frequencies of the words
“cats” and “dogs” per text. But I
failed to adjust this syntax (or any other
syntax) in order to obtain the
joint frequencies of “cats” and “dogs” – can
anybody help me out here???
Thank you very much & regards,
Empi
--
=====================
To manage your subscription to SPSSX-L, send
a message to
[hidden email]
(not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage
subscriptions, send the command
INFO REFCARD
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
|
Sorry, my answer has been cut as it looks. Second try: To answer
your question straight: It’s too complicated, too time consuming and too error
prone to do this with SPSS syntax alone. If sentences are your basic analytical
units I would start with breaking your tweets with some regular expression construction,
e.g. in Python
import re You’ll get separate sentences, then and several rows per person.
If you don’t bother with programming you can take a more shirtsleeved approach: · Copy your variable content into Notepad++ · Select all · Go to Search -> Replace · Find: for each sentence separator (.?! Etc.) insert the separator · Search mode: Extended · Replace with: write your sentence separator and an additional “\n” This will insert a new line You can copy the result into a new SPSS data file. Then you can use your cats&dogs syntax. Of course, you will lose relations to other variables in the dataset. But there’s no easy way to do it otherwise. Mario Giesel Munich, Germany
Am Samstag, 26. Oktober 2019, 12:57:54 MESZ hat Mario Giesel <[hidden email]> Folgendes geschrieben:
Hi, Empi, To answer your question straight: It’s too complicated, too time consuming and too error prone to do this with SPSS syntax alone. If sentences are your basic analytical units I would start with breaking your tweets with some regular expression construction, e.g. in Python --- import re str = "\.|\?|\!" # separator is ‘.’ Or ‘?’ or ‘!’ x = re.split("\s", str) --- You’ll get separate sentences, then and several rows per person. If you don’t bother with programming you can take a more shirtsleeved approach: • Copy your variable content into Notepad++ • Select all • Go to Search -> Replace • Find: for each sentence separator (.?! Etc.) insert the separator • Search mode: Extended • Replace with: write your sentence separator and an additional “\n” This will insert a new line. You can copy the result into a new SPSS data file. Then you can use your cats&dogs syntax. Of course, you will lose relations to other variables in the dataset. But, sorry, there’s no easy way I'm aware of to do it otherwise. Mario Giesel Munich, Germany
Am Samstag, 26. Oktober 2019, 01:09:21 MESZ hat Kirill Orlov <[hidden email]> Folgendes geschrieben:
Just break text: Make one cell one word. There will be single very
long column in your dataset. Cases are words in their sequence.
Sentences can be separated by a blank cell or indicated by a
separate categorical variable. Then remove waste words (if needed):
stemma/lemmatization. Then AUTORECODE words into numeric codes. Then
you can do everything you want. 25.10.2019 14:14, [hidden email]
пишет:
Dear Empi,
feels like a Sisyphean
task without using a programming language like R or Python
and some NLP package.
But maybe you can send some
text examples and the result you want to achieve.
Regards,
Mario
Am Freitag, 25. Oktober 2019, 12:18:17 MESZ hat Empi
[hidden email] Folgendes geschrieben:
Dear all,
my aim is to calculate the single and joint
frequencies of words from texts
saved as a string-variable (“text_var”); each
cell of the string-variable
contains multiple sentences (ultimately, I’d
like to use these frequencies
to calculate a Jaccard-index to assess the
strength of the co-occurrence of
words).
Ideally, the results would indicate per cell
(1) how often word “x” occurs
(2) how often word “y” occurs and (3) how
often words “x” and “y” occur
together as “xy” in a text.
I assume that the single frequencies of “x”
and “y” and the joint frequency
of “xy” could be stored in three new
variables - but it is not really clear
to me how to request the quantities.
I think that this syntax
compute var_x =char.index(lower(text),
"cats") > 0.
compute var_y =char.index(lower(text),
"dogs") > 0.
gives the single frequencies of the words
“cats” and “dogs” per text. But I
failed to adjust this syntax (or any other
syntax) in order to obtain the
joint frequencies of “cats” and “dogs” – can
anybody help me out here???
Thank you very much & regards,
Empi
--
=====================
To manage your subscription to SPSSX-L, send
a message to
[hidden email]
(not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage
subscriptions, send the command
INFO REFCARD
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
|
I posted a solution for the two-word problem on the IBM Predictive Analytics site, but I am copying it here. It uses the SPSSINC TRANS extension command with a small Python function to find counts of joint occurrences per sentence of two specified words. It could be generalized in a number of ways. * Encoding: UTF-8. data list list/text(a60). begin data "dogs and cats are enemies. but dogs sometimes like cats." "there are no dogs here." "are there cats or dogs here? Maybe just cats." "there are elephants." end data. dataset name text. begin program. import re def counter(text, word1, word2): sentences = re.findall(r"(.*?)(?:\.|\?)", text) paircount = 0 for s in sentences: has1 = re.search(r"\b%s\b" % word1.strip(), s, flags=re.I) is not None has2 = re.search(r"\b%s\b" % word2.strip(), s, flags=re.I) is not None if has1 and has2: paircount = paircount + 1 return paircount end program. spssinc trans result=counts /formula 'counter(text, word1="dogs", word2="cats")'. On Sat, Oct 26, 2019 at 5:00 AM Mario Giesel <[hidden email]> wrote:
|
In reply to this post by Empi
Parse into a single new record per word retaining caseid, Cartesian match records within each caseid, aggregate... Done. All of these steps can be found in this group's archives.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |