SPSSX Discussion

basic 'string' question

Classic

List

Threaded

19 messages Options

Talma

basic 'string' question

Hello list members,

I fear that because my question is so basic, I couldnt find any previous
discussion in this list.

What I just wanted to do is to count how often a specific term is used
across a number of answers to an open survey question, and to recode it in a
new, numeric variable.

In the SPSS-file, the answers are in string-format -e.g.

VAR_X
I like vanilla ice.
I prefer chocolate ice cream.
I love strawberry ice cream and vanilla ice cream.

and so on.

Now I need to check how often the term "vanilla" is used across all answers
and to recode it to a new variable which takes on the value 1 if the term
vanilla is used and zero if not.

I used

Compute VAR_Z=0.
If Var_X = 'vanilla' VAR_Z = 1.
exe.

But this doesn't work.

Any ideas how to solve my problem?

Many thanks!
T.

if

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Rick Oliver

Re: basic 'string' question

Try:

compute var_z=(char.index(lower(var_x), "vanilla"))>0.

On Sat, May 5, 2018 at 12:51 AM, Talma <[hidden email]> wrote:

Hello list members,

I fear that because my question is so basic, I couldnt find any previous
discussion in this list.

What I just wanted to do is to count how often a specific term is used
across a number of answers to an open survey question, and to recode it in a
new, numeric variable.

In the SPSS-file, the answers are in string-format -e.g.

VAR_X
I like vanilla ice.
I prefer chocolate ice cream.
I love strawberry ice cream and vanilla ice cream.

and so on.

Now I need to check how often the term "vanilla" is used across all answers
and to recode it to a new variable which takes on the value 1 if the term
vanilla is used and zero if not.

I used

Compute VAR_Z=0.
If Var_X = 'vanilla' VAR_Z = 1.
exe.

But this doesn't work.

Any ideas how to solve my problem?

Many thanks!
T.

if

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Talma

Re: basic 'string' question

Dear Rick,

many thanks, your suggestion worked indeed!

However, as a follow up question, may I ask I might extend the syntay to
convert multiple words to the numeric value '1' in the new variable var_z?
For example, I might need to identify sentences containing the term
'vanilla', but also those containing the terms 'chocolate' or 'strawberry'?

Best,
Talma

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon Peck

Re: basic 'string' question

You can generalize Rick's syntax like this.

compute var_z=char.index(lower(var_x), "vanilla") > 0 or char.index(lower(var_x), "chocolate") >0.

But if you have a lot of conditions to check, this gets unwieldy. It also does not consider words like

creamery, i.e., words that contain the word you are looking for.

A more general framework can easily be accommodated, but more information is needed on the real problem first.

On Sat, May 5, 2018 at 8:17 AM, Talma <[hidden email]> wrote:

Dear Rick,

many thanks, your suggestion worked indeed!

However, as a follow up question, may I ask I might extend the syntay to
convert multiple words to the numeric value '1' in the new variable var_z?
For example, I might need to identify sentences containing the term
'vanilla', but also those containing the terms 'chocolate' or 'strawberry'?

Best,
Talma

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck
[hidden email]

Talma

Re: basic 'string' question

Dear Jon,

many thanks for your example, which was already very useful – and you are
right, the real problem refers to many more terms…

Specifically, I’d like to analyse comments from a social media site using
freely available dictionaries that count certain terms contained in the
comments. These terms are identified with certain emotions.
For example, a post containing the adjective “angry” could be classified as
belonging to the category “anger” and so on (for this illustration, ignore
the multiple problems associated with this approach, such as negations
etc.).

However, such dictionaries (often available in *.txt or*.csv format, which
can be changed) easily contain several thousand terms…and requesting each
term separately would indeed become unwieldly : ). For illustration, here’s
a sample example (first 40 words) of a similar dictionary (not just
adjectives) taken from

; Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews."
; Proceedings of the ACM SIGKDD International Conference on Knowledge
; Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle,
; Washington, USA,

abound
abounds
abundance
abundant
accessable
accessible
acclaim
acclaimed
acclamation
accolade
accolades
accommodative
accomodative
accomplish
accomplished
accomplishment
accomplishments
accurate
accurately
achievable
achievement
achievements
achievible
acumen
adaptable
adaptive
adequate
adjustable
admirable
admirably
admiration
admire
admirer
admiring
admiringly
adorable
adore
adored
adorer
adoring

***

In case there is any more general options to use SPSS syntax for finding out
whether a string variable contains one of the terms above or not, it would
extremely helpful if you could your thoughts here in this already
superhelpful forum...

Many thanks & regards!!
Talma

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon Peck

Re: basic 'string' question

Here is a solution using the SPSSINC TRANS extension command. It is normally installed with Statistics, but if you don't already have it you can install it from the Extensions menu or in older versions Utilities.

First you define a dataset of words - I called it lookup - and make sure that your main dataset is active.

data list fixed/words(a30).

begin data

abound

abounds

abundance

abundant

accessable

accessible

acclaim

acclaimed

acclamation

accolade

accolades

accommodative

accomodative

accomplish

accomplished

accomplishment

accomplishments

accurate

accurately

achievable

achievement

achievements

achievible

acumen

adaptable

adaptive

adequate

adjustable

admirable

admirably

admiration

admire

admirer

admiring

admiringly

adorable

adore

adored

adorer

adoring

end data

dataset name lookup.

data list fixed/text(a50).

begin data

adorer

adoring

end data

dataset name main.

dataset activate main.

Next you define a Python class for use with SPSSINC TRANS. It reads the lookup dataset and creates a set containing the words ignoring case. It also creates a function, func, that will be called for each case in the main dataset. func splits the indicated variable's value at each blank and checks whether it appears in the set (ignoring case). In this example, the strings to check are in a variable named text.

begin program.

class vlookup(object):

"""Check values according to a dictionary specified as an SPSS dataset"""

def __init__(self, dataset):

"""dataset is a dataset of words

Lookups are made after trimming any trailing blanks and ignoring case

The class creates a function named func that can be referenced for lookups"""

spss.StartDataStep()

try:

ds = spss.Dataset(dataset)

cases = ds.cases

self.table = set()

for i in range(len(cases)):

self.table.add(cases[i, 0][0].rstrip().lower())

def func(x):

x = x.rstrip().split()

for word in x:

if word.lower() in self.table:

return True

return False

self.func = func

finally:

spss.EndDataStep()

end program.

This is the call to invoke all this. It first creates the word set from the named dataset and then processes a variable named text for each case. The result is a 1 or 0 (true or false) for each case according to whether any word in text is found in the lookup set.

spssinc trans result=hasword

/initial "vlookup('lookup')"

/formula "func(text)".

Regards,

Jon

On Sun, May 6, 2018 at 2:29 AM, Talma <[hidden email]> wrote:

Dear Jon,

many thanks for your example, which was already very useful – and you are
right, the real problem refers to many more terms…

Specifically, I’d like to analyse comments from a social media site using
freely available dictionaries that count certain terms contained in the
comments. These terms are identified with certain emotions.
For example, a post containing the adjective “angry” could be classified as
belonging to the category “anger” and so on (for this illustration, ignore
the multiple problems associated with this approach, such as negations
etc.).

However, such dictionaries (often available in *.txt or*.csv format, which
can be changed) easily contain several thousand terms…and requesting each
term separately would indeed become unwieldly : ). For illustration, here’s
a sample example (first 40 words) of a similar dictionary (not just
adjectives) taken from

; Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews."
; Proceedings of the ACM SIGKDD International Conference on Knowledge
; Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle,
; Washington, USA,

abound
abounds
abundance
abundant
accessable
accessible
acclaim
acclaimed
acclamation
accolade
accolades
accommodative
accomodative
accomplish
accomplished
accomplishment
accomplishments
accurate
accurately
achievable
achievement
achievements
achievible
acumen
adaptable
adaptive
adequate
adjustable
admirable
admirably
admiration
admire
admirer
admiring
admiringly
adorable
adore
adored
adorer
adoring

***

In case there is any more general options to use SPSS syntax for finding out
whether a string variable contains one of the terms above or not, it would
extremely helpful if you could your thoughts here in this already
superhelpful forum...

Many thanks & regards!!

Talma

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck
[hidden email]

David Marso

Re: basic 'string' question

Administrator

In reply to this post by Talma

Here is an approach which uses standard SPSS syntax ;-)
--
DATA LIST /word (A30).
BEGIN DATA
abound
abounds
abundance
abundant
accessable
accessible
acclaim
acclaimed
acclamation
accolade
accolades
accommodative
accomodative
accomplish
accomplished
accomplishment
accomplishments
accurate
accurately
achievable
achievement
achievements
achievible
acumen
adaptable
adaptive
adequate
adjustable
admirable
admirably
admiration
admire
admirer
admiring
admiringly
adorable
adore
adored
adorer
adoring
END DATA.
DATASET NAME Lookup.

DATA LIST /phrase (A200).
BEGIN DATA
data to evaluate goes here or GET FILE.....
END DATA.

DATASET NAME rawdata.
COMPUTE LineNumber=$CASENUM.
COMPUTE phrase=CONCAT(LTRIM(LOWER(phrase))," ").
SET MXLOOP=100000.
STRING Word (A30).
LOOP.
COMPUTE #=CHAR.INDEX(phrase," ").
DO IF # GT 0.
COMPUTE Word=CHAR.SUBSTR(phrase,1,#-1).
COMPUTE phrase=CHAR.SUBSTR(phrase,#+1).
XSAVE OUTFILE "C:\TEMP\parsedwords.sav" /KEEP LineNumber Word.
END IF.
END LOOP IF #=0.
EXECUTE.
GET FILE "C:\TEMP\parsedwords.sav".
SORT CASES BY Word.
MATCH FILES /FILE * /TABLE=LOOKUP /IN=InDictionary/BY Word.
AGGREGATE OUTFILE * /BREAK Word /WordCount=SUM(InDictionary).

Talma wrote

> Dear Jon,
>
> many thanks for your example, which was already very useful – and you are
> right, the real problem refers to many more terms…
>
> Specifically, I’d like to analyse comments from a social media site using
> freely available dictionaries that count certain terms contained in the
> comments. These terms are identified with certain emotions.
> For example, a post containing the adjective “angry” could be classified
> as
> belonging to the category “anger” and so on (for this illustration, ignore
> the multiple problems associated with this approach, such as negations
> etc.).
>
> However, such dictionaries (often available in *.txt or*.csv format, which
> can be changed) easily contain several thousand terms…and requesting each
> term separately would indeed become unwieldly : ). For illustration,
> here’s
> a sample example (first 40 words) of a similar dictionary (not just
> adjectives) taken from
>
> ; Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews."
> ; Proceedings of the ACM SIGKDD International Conference on
> Knowledge
> ; Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle,
> ; Washington, USA,
>
> abound
> abounds
> abundance
> abundant
> accessable
> accessible
> acclaim
> acclaimed
> acclamation
> accolade
> accolades
> accommodative
> accomodative
> accomplish
> accomplished
> accomplishment
> accomplishments
> accurate
> accurately
> achievable
> achievement
> achievements
> achievible
> acumen
> adaptable
> adaptive
> adequate
> adjustable
> admirable
> admirably
> admiration
> admire
> admirer
> admiring
> admiringly
> adorable
> adore
> adored
> adorer
> adoring
>
> ***
>
> In case there is any more general options to use SPSS syntax for finding
> out
> whether a string variable contains one of the terms above or not, it would
> extremely helpful if you could your thoughts here in this already
> superhelpful forum...
>
> Many thanks & regards!!
> Talma
>
>
>
>
>
>
> --
> Sent from: http://spssx-discussion.1045642.n5.nabble.com/
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

> (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Jon Peck

Re: basic 'string' question

In reply to this post by Jon Peck

The extendedTransforms.py module has two similar functions to the solution I posted.

vlookup looks up values in a Python dictionary constructed from an SPSS dataset. It differs from the posted solution in taking a key and returning an associated value.

vlookupinterval is similar but instead of an exact key match, it finds a value in a set of intervals and returns the associated value.

These functions as well as many others in this module can be used with SPSSINC TRANS. Here is a list of the contents.

subs: replace occurrences of a regular expression pattern with specified values

templatesub: substitue values in a template expression

levenshteindistance: calculate similarity between two strings

soundex: calculate the soundex value of a string (a rough phonetic encoding)

nysiis: enhanced sound encoding (claimed superior to soundex for surnames)

soundexallwords: calculate the soundex value for each word in a string and return a blank-separated string

median: median of a list of values

mode: mode of a list of values

multimode: up to n modes of a list of values

matchcount: compare value with list of values and count matches using

standard or custom comparison function

strtodatetime: convert a date/time string to an SPSS datetime value using a pattern

datetimetostr: convert an SPSS date/time value to a string using a pattern

lookup: return a value from a table lookup

vlookup: return a value from a table lookup (more convenient than lookup w SPSSINC TRANS)

vlookupinterval: return a value from a table lookup using intervals

sphDist: calculate distance between two points on earth using spherical approximation

ellipseDist: calculate distance between two points on earth using ellipsoidal approximation

jaroWinkler calculate Jaro-Winkler string similarity measure

extractDummies extract a set of binary variables from a value coded in powers of 2

packDummies pack a sequence of numeric and/or string values into a single float

translatechar map characters according to a conversion table

countWkdays count number of days between two dates that are not excluded

vlookupgroupinterval return a value associated with a group and a set of intervals for that group

countDaysWExclusions count days in interval exclusing specificied weekdays and other dates

DiceStringSimilarity compare strings using Dice bigram metric.

Dictdict find best match of strings using Dice metric

setRandomSeed initialize random number generator

invGaussian inverse Gaussian distribution random numbers

triangular triangular random numbers

On Mon, May 7, 2018 at 6:33 AM, William Dudley <[hidden email]> wrote:

Jon,

This is terrific.
I have a project for which this method will be very useful.

Bill

On Sun, May 6, 2018 at 3:42 PM, Jon Peck <[hidden email]> wrote:
Here is a solution using the SPSSINC TRANS extension command. It is normally installed with Statistics, but if you don't already have it you can install it from the Extensions menu or in older versions Utilities.

First you define a dataset of words - I called it lookup - and make sure that your main dataset is active.
data list fixed/words(a30).
begin data
abound
abounds
abundance
abundant
accessable
accessible
acclaim
acclaimed
acclamation
accolade
accolades
accommodative
accomodative
accomplish
accomplished
accomplishment
accomplishments
accurate
accurately
achievable
achievement
achievements
achievible
acumen
adaptable
adaptive
adequate
adjustable
admirable
admirably
admiration
admire
admirer
admiring
admiringly
adorable
adore
adored
adorer
adoring
end data
dataset name lookup.

data list fixed/text(a50).
begin data
adorer
adoring
end data
dataset name main.
dataset activate main.

Next you define a Python class for use with SPSSINC TRANS. It reads the lookup dataset and creates a set containing the words ignoring case. It also creates a function, func, that will be called for each case in the main dataset. func splits the indicated variable's value at each blank and checks whether it appears in the set (ignoring case). In this example, the strings to check are in a variable named text.

begin program.
class vlookup(object):
"""Check values according to a dictionary specified as an SPSS dataset"""
def __init__(self, dataset):
"""dataset is a dataset of words

Lookups are made after trimming any trailing blanks and ignoring case
The class creates a function named func that can be referenced for lookups"""

spss.StartDataStep()
try:
ds = spss.Dataset(dataset)
cases = ds.cases
self.table = set()
for i in range(len(cases)):
self.table.add(cases[i, 0][0].rstrip().lower())

def func(x):
x = x.rstrip().split()
for word in x:
if word.lower() in self.table:
return True
return False
self.func = func
finally:
spss.EndDataStep()
end program.

This is the call to invoke all this. It first creates the word set from the named dataset and then processes a variable named text for each case. The result is a 1 or 0 (true or false) for each case according to whether any word in text is found in the lookup set.

spssinc trans result=hasword
/initial "vlookup('lookup')"
/formula "func(text)".

Regards,
Jon

On Sun, May 6, 2018 at 2:29 AM, Talma <[hidden email]> wrote:
Dear Jon,

many thanks for your example, which was already very useful – and you are
right, the real problem refers to many more terms…

Specifically, I’d like to analyse comments from a social media site using
freely available dictionaries that count certain terms contained in the
comments. These terms are identified with certain emotions.
For example, a post containing the adjective “angry” could be classified as
belonging to the category “anger” and so on (for this illustration, ignore
the multiple problems associated with this approach, such as negations
etc.).

However, such dictionaries (often available in *.txt or*.csv format, which
can be changed) easily contain several thousand terms…and requesting each
term separately would indeed become unwieldly : ). For illustration, here’s
a sample example (first 40 words) of a similar dictionary (not just
adjectives) taken from

; Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews."
; Proceedings of the ACM SIGKDD International Conference on Knowledge
; Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle,
; Washington, USA,

abound
abounds
abundance
abundant
accessable
accessible
acclaim
acclaimed
acclamation
accolade
accolades
accommodative
accomodative
accomplish
accomplished
accomplishment
accomplishments
accurate
accurately
achievable
achievement
achievements
achievible
acumen
adaptable
adaptive
adequate
adjustable
admirable
admirably
admiration
admire
admirer
admiring
admiringly
adorable
adore
adored
adorer
adoring

***

In case there is any more general options to use SPSS syntax for finding out
whether a string variable contains one of the terms above or not, it would
extremely helpful if you could your thoughts here in this already
superhelpful forum...

Many thanks & regards!!

Talma

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

--
William N. Dudley, PhD
Professor - Public Health Education
The School of Health and Human Sciences
The University of North Carolina at Greensboro
437-L Coleman Building
Greensboro, NC 27402-6170
See my research on
GoogleScholar
ResearchGate
VOICE 336.256 2475

Jon K Peck
[hidden email]

bdates

Re: basic 'string' question

David,

When I run your syntax with the words in the file rawdata that Jon supplied, adorer and adoring, I get the following message:

Warning # 10954

The AGGREGATE command has produced an output file which has no cases -

probably as the result of a SELECT IF or WEIGHT command.

The parsedwords file has no words identified.

Brian

From: SPSSX(r) Discussion <[hidden email]> on behalf of Jon Peck <[hidden email]>
Sent: Monday, May 7, 2018 12:54:03 PM
To: [hidden email]
Subject: Re: basic 'string' question