Hello list members,
I fear that because my question is so basic, I couldnt find any previous discussion in this list. What I just wanted to do is to count how often a specific term is used across a number of answers to an open survey question, and to recode it in a new, numeric variable. In the SPSS-file, the answers are in string-format -e.g. VAR_X I like vanilla ice. I prefer chocolate ice cream. I love strawberry ice cream and vanilla ice cream. and so on. Now I need to check how often the term "vanilla" is used across all answers and to recode it to a new variable which takes on the value 1 if the term vanilla is used and zero if not. I used Compute VAR_Z=0. If Var_X = 'vanilla' VAR_Z = 1. exe. But this doesn't work. Any ideas how to solve my problem? Many thanks! T. if -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Try: compute var_z=(char.index(lower(var_x), "vanilla"))>0. On Sat, May 5, 2018 at 12:51 AM, Talma <[hidden email]> wrote: Hello list members, |
Dear Rick,
many thanks, your suggestion worked indeed! However, as a follow up question, may I ask I might extend the syntay to convert multiple words to the numeric value '1' in the new variable var_z? For example, I might need to identify sentences containing the term 'vanilla', but also those containing the terms 'chocolate' or 'strawberry'? Best, Talma -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
You can generalize Rick's syntax like this.
compute var_z=char.index(lower(var_x) But if you have a lot of conditions to check, this gets unwieldy. It also does not consider words like creamery, i.e., words that contain the word you are looking for. A more general framework can easily be accommodated, but more information is needed on the real problem first. On Sat, May 5, 2018 at 8:17 AM, Talma <[hidden email]> wrote: Dear Rick, |
Dear Jon,
many thanks for your example, which was already very useful – and you are right, the real problem refers to many more terms… Specifically, I’d like to analyse comments from a social media site using freely available dictionaries that count certain terms contained in the comments. These terms are identified with certain emotions. For example, a post containing the adjective “angry” could be classified as belonging to the category “anger” and so on (for this illustration, ignore the multiple problems associated with this approach, such as negations etc.). However, such dictionaries (often available in *.txt or*.csv format, which can be changed) easily contain several thousand terms…and requesting each term separately would indeed become unwieldly : ). For illustration, here’s a sample example (first 40 words) of a similar dictionary (not just adjectives) taken from ; Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." ; Proceedings of the ACM SIGKDD International Conference on Knowledge ; Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, ; Washington, USA, abound abounds abundance abundant accessable accessible acclaim acclaimed acclamation accolade accolades accommodative accomodative accomplish accomplished accomplishment accomplishments accurate accurately achievable achievement achievements achievible acumen adaptable adaptive adequate adjustable admirable admirably admiration admire admirer admiring admiringly adorable adore adored adorer adoring *** In case there is any more general options to use SPSS syntax for finding out whether a string variable contains one of the terms above or not, it would extremely helpful if you could your thoughts here in this already superhelpful forum... Many thanks & regards!! Talma -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Here is a solution using the SPSSINC TRANS extension command. It is normally installed with Statistics, but if you don't already have it you can install it from the Extensions menu or in older versions Utilities. First you define a dataset of words - I called it lookup - and make sure that your main dataset is active. data list fixed/words(a30). begin data abound abounds abundance abundant accessable accessible acclaim acclaimed acclamation accolade accolades accommodative accomodative accomplish accomplished accomplishment accomplishments accurate accurately achievable achievement achievements achievible acumen adaptable adaptive adequate adjustable admirable admirably admiration admire admirer admiring admiringly adorable adore adored adorer adoring end data dataset name lookup. data list fixed/text(a50). begin data adorer adoring end data dataset name main. dataset activate main. Next you define a Python class for use with SPSSINC TRANS. It reads the lookup dataset and creates a set containing the words ignoring case. It also creates a function, func, that will be called for each case in the main dataset. func splits the indicated variable's value at each blank and checks whether it appears in the set (ignoring case). In this example, the strings to check are in a variable named text. begin program. class vlookup(object): """Check values according to a dictionary specified as an SPSS dataset""" def __init__(self, dataset): """dataset is a dataset of words Lookups are made after trimming any trailing blanks and ignoring case The class creates a function named func that can be referenced for lookups""" spss.StartDataStep() try: ds = spss.Dataset(dataset) cases = ds.cases self.table = set() for i in range(len(cases)): self.table.add(cases[i, 0][0].rstrip().lower()) def func(x): x = x.rstrip().split() for word in x: if word.lower() in self.table: return True return False self.func = func finally: spss.EndDataStep() end program. This is the call to invoke all this. It first creates the word set from the named dataset and then processes a variable named text for each case. The result is a 1 or 0 (true or false) for each case according to whether any word in text is found in the lookup set. spssinc trans result=hasword /initial "vlookup('lookup')" /formula "func(text)". Regards, Jon On Sun, May 6, 2018 at 2:29 AM, Talma <[hidden email]> wrote: Dear Jon, |
Administrator
|
In reply to this post by Talma
Here is an approach which uses standard SPSS syntax ;-)
-- DATA LIST /word (A30). BEGIN DATA abound abounds abundance abundant accessable accessible acclaim acclaimed acclamation accolade accolades accommodative accomodative accomplish accomplished accomplishment accomplishments accurate accurately achievable achievement achievements achievible acumen adaptable adaptive adequate adjustable admirable admirably admiration admire admirer admiring admiringly adorable adore adored adorer adoring END DATA. DATASET NAME Lookup. DATA LIST /phrase (A200). BEGIN DATA data to evaluate goes here or GET FILE..... END DATA. DATASET NAME rawdata. COMPUTE LineNumber=$CASENUM. COMPUTE phrase=CONCAT(LTRIM(LOWER(phrase))," "). SET MXLOOP=100000. STRING Word (A30). LOOP. COMPUTE #=CHAR.INDEX(phrase," "). DO IF # GT 0. COMPUTE Word=CHAR.SUBSTR(phrase,1,#-1). COMPUTE phrase=CHAR.SUBSTR(phrase,#+1). XSAVE OUTFILE "C:\TEMP\parsedwords.sav" /KEEP LineNumber Word. END IF. END LOOP IF #=0. EXECUTE. GET FILE "C:\TEMP\parsedwords.sav". SORT CASES BY Word. MATCH FILES /FILE * /TABLE=LOOKUP /IN=InDictionary/BY Word. AGGREGATE OUTFILE * /BREAK Word /WordCount=SUM(InDictionary). Talma wrote > Dear Jon, > > many thanks for your example, which was already very useful – and you are > right, the real problem refers to many more terms… > > Specifically, I’d like to analyse comments from a social media site using > freely available dictionaries that count certain terms contained in the > comments. These terms are identified with certain emotions. > For example, a post containing the adjective “angry” could be classified > as > belonging to the category “anger” and so on (for this illustration, ignore > the multiple problems associated with this approach, such as negations > etc.). > > However, such dictionaries (often available in *.txt or*.csv format, which > can be changed) easily contain several thousand terms…and requesting each > term separately would indeed become unwieldly : ). For illustration, > here’s > a sample example (first 40 words) of a similar dictionary (not just > adjectives) taken from > > ; Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." > ; Proceedings of the ACM SIGKDD International Conference on > Knowledge > ; Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, > ; Washington, USA, > > abound > abounds > abundance > abundant > accessable > accessible > acclaim > acclaimed > acclamation > accolade > accolades > accommodative > accomodative > accomplish > accomplished > accomplishment > accomplishments > accurate > accurately > achievable > achievement > achievements > achievible > acumen > adaptable > adaptive > adequate > adjustable > admirable > admirably > admiration > admire > admirer > admiring > admiringly > adorable > adore > adored > adorer > adoring > > *** > > In case there is any more general options to use SPSS syntax for finding > out > whether a string variable contains one of the terms above or not, it would > extremely helpful if you could your thoughts here in this already > superhelpful forum... > > Many thanks & regards!! > Talma > > > > > > > -- > Sent from: http://spssx-discussion.1045642.n5.nabble.com/ > > ===================== > To manage your subscription to SPSSX-L, send a message to > LISTSERV@.UGA > (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD ----- Please reply to the list and not to my personal email. Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Jon Peck
The extendedTransforms.py module has two similar functions to the solution I posted. vlookup looks up values in a Python dictionary constructed from an SPSS dataset. It differs from the posted solution in taking a key and returning an associated value. vlookupinterval is similar but instead of an exact key match, it finds a value in a set of intervals and returns the associated value. These functions as well as many others in this module can be used with SPSSINC TRANS. Here is a list of the contents. subs: replace occurrences of a regular expression pattern with specified values templatesub: substitue values in a template expression levenshteindistance: calculate similarity between two strings soundex: calculate the soundex value of a string (a rough phonetic encoding) nysiis: enhanced sound encoding (claimed superior to soundex for surnames) soundexallwords: calculate the soundex value for each word in a string and return a blank-separated string median: median of a list of values mode: mode of a list of values multimode: up to n modes of a list of values matchcount: compare value with list of values and count matches using standard or custom comparison function strtodatetime: convert a date/time string to an SPSS datetime value using a pattern datetimetostr: convert an SPSS date/time value to a string using a pattern lookup: return a value from a table lookup vlookup: return a value from a table lookup (more convenient than lookup w SPSSINC TRANS) vlookupinterval: return a value from a table lookup using intervals sphDist: calculate distance between two points on earth using spherical approximation ellipseDist: calculate distance between two points on earth using ellipsoidal approximation jaroWinkler calculate Jaro-Winkler string similarity measure extractDummies extract a set of binary variables from a value coded in powers of 2 packDummies pack a sequence of numeric and/or string values into a single float translatechar map characters according to a conversion table countWkdays count number of days between two dates that are not excluded vlookupgroupinterval return a value associated with a group and a set of intervals for that group countDaysWExclusions count days in interval exclusing specificied weekdays and other dates DiceStringSimilarity compare strings using Dice bigram metric. Dictdict find best match of strings using Dice metric setRandomSeed initialize random number generator invGaussian inverse Gaussian distribution random numbers triangular triangular random numbers On Mon, May 7, 2018 at 6:33 AM, William Dudley <[hidden email]> wrote:
|
David,
When I run your syntax with the words in the file rawdata that Jon supplied, adorer and adoring, I get the following message:
Warning # 10954 The AGGREGATE command has produced an output file which has no cases - probably as the result of a SELECT IF or WEIGHT command.
The parsedwords file has no words identified.
Brian
From: SPSSX(r) Discussion <[hidden email]> on behalf of Jon Peck <[hidden email]>
Sent: Monday, May 7, 2018 12:54:03 PM To: [hidden email] Subject: Re: basic 'string' question The extendedTransforms.py module has two similar functions to the solution I posted.
vlookup looks up values in a Python dictionary constructed from an SPSS dataset. It differs from the posted solution in taking a key and returning an associated value.
vlookupinterval is similar but instead of an exact key match, it finds a value in a set of intervals and returns the associated value.
These functions as well as many others in this module can be used with SPSSINC TRANS. Here is a list of the contents.
subs: replace occurrences of a regular expression pattern with specified values
templatesub: substitue values in a template expression
levenshteindistance: calculate similarity between two strings
soundex: calculate the soundex value of a string (a rough phonetic encoding)
nysiis: enhanced sound encoding (claimed superior to soundex for surnames)
soundexallwords: calculate the soundex value for each word in a string and return a blank-separated string
median: median of a list of values
mode: mode of a list of values
multimode: up to n modes of a list of values
matchcount: compare value with list of values and count matches using
standard or custom comparison function
strtodatetime: convert a date/time string to an SPSS datetime value using a pattern
datetimetostr: convert an SPSS date/time value to a string using a pattern
lookup: return a value from a table lookup
vlookup: return a value from a table lookup (more convenient than lookup w SPSSINC TRANS)
vlookupinterval: return a value from a table lookup using intervals
sphDist: calculate distance between two points on earth using spherical approximation
ellipseDist: calculate distance between two points on earth using ellipsoidal approximation
jaroWinkler calculate Jaro-Winkler string similarity measure
extractDummies extract a set of binary variables from a value coded in powers of 2
packDummies pack a sequence of numeric and/or string values into a single float
translatechar map characters according to a conversion table
countWkdays count number of days between two dates that are not excluded
vlookupgroupinterval return a value associated with a group and a set of intervals for that group
countDaysWExclusions count days in interval exclusing specificied weekdays and other dates
DiceStringSimilarity compare strings using Dice bigram metric.
Dictdict find best match of strings using Dice metric
setRandomSeed initialize random number generator
invGaussian inverse Gaussian distribution random numbers
triangular triangular random numbers
On Mon, May 7, 2018 at 6:33 AM, William Dudley
<[hidden email]> wrote:
|
Administrator
|
Good catch Brian. I forgot that the padding gets trashed when running in
Unicode. FIXED here. DATA LIST /phrase (A200). BEGIN DATA adorer adoring END DATA. DATASET NAME rawdata. COMPUTE LineNumber=$CASENUM. COMPUTE phrase=CONCAT(LTRIM(LOWER(phrase))," "). SET MXLOOP=100000. STRING Word (A30). LOOP. + COMPUTE #=CHAR.INDEX(phrase," "). + DO IF # GT 0. + COMPUTE Word=CHAR.SUBSTR(phrase,1,#-1). + COMPUTE phrase=CHAR.SUBSTR(phrase,#+1). + ELSE. + IF (phrase NE "") Word=phrase. + END IF. + XSAVE OUTFILE "C:\TEMP\parsedwords.sav" /KEEP LineNumber Word. END LOOP IF #=0. EXECUTE. GET FILE "C:\TEMP\parsedwords.sav". SORT CASES BY Word. MATCH FILES /FILE * /TABLE=LOOKUP /IN=InDictionary/BY Word. AGGREGATE OUTFILE * /BREAK Word /WordCount=SUM(InDictionary). bdates wrote > David, > > > When I run your syntax with the words in the file rawdata that Jon > supplied, adorer and adoring, I get the following message: > > > Warning # 10954 > > The AGGREGATE command has produced an output file which has no cases - > > probably as the result of a SELECT IF or WEIGHT command. > > > The parsedwords file has no words identified. > > > > Brian > ________________________________ > From: SPSSX(r) Discussion < > SPSSX-L@.UGA > > on behalf of Jon Peck < > jkpeck@ > > > Sent: Monday, May 7, 2018 12:54:03 PM > To: > SPSSX-L@.UGA > Subject: Re: basic 'string' question > > The extendedTransforms.py module has two similar functions to the solution > I posted. > vlookup looks up values in a Python dictionary constructed from an SPSS > dataset. It differs from the posted solution in taking a key and > returning an associated value. > > vlookupinterval is similar but instead of an exact key match, it finds a > value in a set of intervals and returns the associated value. > > These functions as well as many others in this module can be used with > SPSSINC TRANS. Here is a list of the contents. > > subs: replace occurrences of a regular expression > pattern with specified values > templatesub: substitue values in a template expression > levenshteindistance: calculate similarity between two strings > soundex: calculate the soundex value of a string (a > rough phonetic encoding) > nysiis: enhanced sound encoding (claimed superior to > soundex for surnames) > soundexallwords: calculate the soundex value for each word in > a string and return a blank-separated string > median: median of a list of values > mode: mode of a list of values > multimode: up to n modes of a list of values > matchcount: compare value with list of values and count > matches using > standard or custom comparison function > strtodatetime: convert a date/time string to an SPSS > datetime value using a pattern > datetimetostr: convert an SPSS date/time value to a string > using a pattern > lookup: return a value from a table lookup > vlookup: return a value from a table lookup (more > convenient than lookup w SPSSINC TRANS) > vlookupinterval: return a value from a table lookup using > intervals > sphDist: calculate distance between two points on > earth using spherical approximation > ellipseDist: calculate distance between two points on > earth using ellipsoidal approximation > jaroWinkler calculate Jaro-Winkler string similarity > measure > extractDummies extract a set of binary variables from a > value coded in powers of 2 > packDummies pack a sequence of numeric and/or string > values into a single float > translatechar map characters according to a conversion > table > countWkdays count number of days between two dates that > are not excluded > vlookupgroupinterval return a value associated with a group and a > set of intervals for that group > countDaysWExclusions count days in interval exclusing specificied > weekdays and other dates > DiceStringSimilarity compare strings using Dice bigram metric. > Dictdict find best match of strings using Dice metric > setRandomSeed initialize random number generator > invGaussian inverse Gaussian distribution random numbers > triangular triangular random numbers > > On Mon, May 7, 2018 at 6:33 AM, William Dudley < > wndudley@ > <mailto: > wndudley@ > >> wrote: > Jon, > > This is terrific. > I have a project for which this method will be very useful. > > Bill > > > On Sun, May 6, 2018 at 3:42 PM, Jon Peck < > jkpeck@ > <mailto: > jkpeck@ > >> wrote: > Here is a solution using the SPSSINC TRANS extension command. It is > normally installed with Statistics, but if you don't already have it you > can install it from the Extensions menu or in older versions Utilities. > > First you define a dataset of words - I called it lookup - and make sure > that your main dataset is active. > data list fixed/words(a30). > begin data > abound > abounds > abundance > abundant > accessable > accessible > acclaim > acclaimed > acclamation > accolade > accolades > accommodative > accomodative > accomplish > accomplished > accomplishment > accomplishments > accurate > accurately > achievable > achievement > achievements > achievible > acumen > adaptable > adaptive > adequate > adjustable > admirable > admirably > admiration > admire > admirer > admiring > admiringly > adorable > adore > adored > adorer > adoring > end data > dataset name lookup. > > data list fixed/text(a50). > begin data > adorer > adoring > end data > dataset name main. > dataset activate main. > > Next you define a Python class for use with SPSSINC TRANS. It reads the > lookup dataset and creates a set containing the words ignoring case. It > also creates a function, func, that will be called for each case in the > main dataset. func splits the indicated variable's value at each blank > and checks whether it appears in the set (ignoring case). In this > example, the strings to check are in a variable named text. > > begin program. > class vlookup(object): > """Check values according to a dictionary specified as an SPSS > dataset""" > def __init__(self, dataset): > """dataset is a dataset of words > > Lookups are made after trimming any trailing blanks and ignoring > case > The class creates a function named func that can be referenced for > lookups""" > > spss.StartDataStep() > try: > ds = spss.Dataset(dataset) > cases = ds.cases > self.table = set() > for i in range(len(cases)): > self.table.add(cases[i, 0][0].rstrip().lower()) > > def func(x): > x = x.rstrip().split() > for word in x: > if word.lower() in self.table: > return True > return False > self.func = func > finally: > spss.EndDataStep() > end program. > > This is the call to invoke all this. It first creates the word set from > the named dataset and then processes a variable named text for each case. > The result is a 1 or 0 (true or false) for each case according to whether > any word in text is found in the lookup set. > > spssinc trans result=hasword > /initial "vlookup('lookup')" > /formula "func(text)". > > Regards, > Jon > > > On Sun, May 6, 2018 at 2:29 AM, Talma < > Talma.Claviger@ > <mailto: > Talma.Claviger@ > >> wrote: > Dear Jon, > > many thanks for your example, which was already very useful – and you are > right, the real problem refers to many more terms… > > Specifically, I’d like to analyse comments from a social media site using > freely available dictionaries that count certain terms contained in the > comments. These terms are identified with certain emotions. > For example, a post containing the adjective “angry” could be classified > as > belonging to the category “anger” and so on (for this illustration, ignore > the multiple problems associated with this approach, such as negations > etc.). > > However, such dictionaries (often available in *.txt or*.csv format, which > can be changed) easily contain several thousand terms…and requesting each > term separately would indeed become unwieldly : ). For illustration, > here’s > a sample example (first 40 words) of a similar dictionary (not just > adjectives) taken from > > ; Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." > ; Proceedings of the ACM SIGKDD International Conference on > Knowledge > ; Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, > ; Washington, USA, > > abound > abounds > abundance > abundant > accessable > accessible > acclaim > acclaimed > acclamation > accolade > accolades > accommodative > accomodative > accomplish > accomplished > accomplishment > accomplishments > accurate > accurately > achievable > achievement > achievements > achievible > acumen > adaptable > adaptive > adequate > adjustable > admirable > admirably > admiration > admire > admirer > admiring > admiringly > adorable > adore > adored > adorer > adoring > > *** > > In case there is any more general options to use SPSS syntax for finding > out > whether a string variable contains one of the terms above or not, it would > extremely helpful if you could your thoughts here in this already > superhelpful forum... > > Many thanks & regards!! > Talma > > > > > > > -- > Sent from: http://spssx-discussion.1045642.n5.nabble.com/ > > ===================== > To manage your subscription to SPSSX-L, send a message to > LISTSERV@.UGA > <mailto: > LISTSERV@.UGA > > (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > > > > -- > Jon K Peck > jkpeck@ > <mailto: > jkpeck@ > > > > ===================== To manage your subscription to SPSSX-L, send a > message to > LISTSERV@.UGA > <mailto: > LISTSERV@.UGA > > (not to SPSSX-L), with no body text except the command. To leave the > list, send the command SIGNOFF SPSSX-L For a list of commands to manage > subscriptions, send the command INFO REFCARD > > > > -- > William N. Dudley, PhD > Professor - Public Health Education > The School of Health and Human Sciences > The University of North Carolina at Greensboro > 437-L Coleman Building > Greensboro, NC 27402-6170 > See my research on > GoogleScholar<https://scholar.google.com/citations?user=ZiYmyb4AAAAJ&hl=en> > ResearchGate<https://www.researchgate.net/profile/William_Dudley> > VOICE 336.256 2475 > > [email signature image example.png] > > > > > -- > Jon K Peck > jkpeck@ > <mailto: > jkpeck@ > > > > ===================== To manage your subscription to SPSSX-L, send a > message to > LISTSERV@.UGA > <mailto: > LISTSERV@.UGA > > (not to SPSSX-L), with no body text except the command. To leave the > list, send the command SIGNOFF SPSSX-L For a list of commands to manage > subscriptions, send the command INFO REFCARD > > ===================== > To manage your subscription to SPSSX-L, send a message to > LISTSERV@.UGA > (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD ----- Please reply to the list and not to my personal email. Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
David,
Thanks! This syntax is really useful.
Brian
From: SPSSX(r) Discussion <[hidden email]> on behalf of David Marso <[hidden email]>
Sent: Monday, May 7, 2018 1:53:34 PM To: [hidden email] Subject: Re: basic 'string' question Good catch Brian. I forgot that the padding gets trashed when running in
Unicode. FIXED here. DATA LIST /phrase (A200). BEGIN DATA adorer adoring END DATA. DATASET NAME rawdata. COMPUTE LineNumber=$CASENUM. COMPUTE phrase=CONCAT(LTRIM(LOWER(phrase))," "). SET MXLOOP=100000. STRING Word (A30). LOOP. + COMPUTE #=CHAR.INDEX(phrase," "). + DO IF # GT 0. + COMPUTE Word=CHAR.SUBSTR(phrase,1,#-1). + COMPUTE phrase=CHAR.SUBSTR(phrase,#+1). + ELSE. + IF (phrase NE "") Word=phrase. + END IF. + XSAVE OUTFILE "C:\TEMP\parsedwords.sav" /KEEP LineNumber Word. END LOOP IF #=0. EXECUTE. GET FILE "C:\TEMP\parsedwords.sav". SORT CASES BY Word. MATCH FILES /FILE * /TABLE=LOOKUP /IN=InDictionary/BY Word. AGGREGATE OUTFILE * /BREAK Word /WordCount=SUM(InDictionary). bdates wrote > David, > > > When I run your syntax with the words in the file rawdata that Jon > supplied, adorer and adoring, I get the following message: > > > Warning # 10954 > > The AGGREGATE command has produced an output file which has no cases - > > probably as the result of a SELECT IF or WEIGHT command. > > > The parsedwords file has no words identified. > > > > Brian > ________________________________ > From: SPSSX(r) Discussion < > SPSSX-L@.UGA > > on behalf of Jon Peck < > jkpeck@ > > > Sent: Monday, May 7, 2018 12:54:03 PM > To: > SPSSX-L@.UGA > Subject: Re: basic 'string' question > > The extendedTransforms.py module has two similar functions to the solution > I posted. > vlookup looks up values in a Python dictionary constructed from an SPSS > dataset. It differs from the posted solution in taking a key and > returning an associated value. > > vlookupinterval is similar but instead of an exact key match, it finds a > value in a set of intervals and returns the associated value. > > These functions as well as many others in this module can be used with > SPSSINC TRANS. Here is a list of the contents. > > subs: replace occurrences of a regular expression > pattern with specified values > templatesub: substitue values in a template expression > levenshteindistance: calculate similarity between two strings > soundex: calculate the soundex value of a string (a > rough phonetic encoding) > nysiis: enhanced sound encoding (claimed superior to > soundex for surnames) > soundexallwords: calculate the soundex value for each word in > a string and return a blank-separated string > median: median of a list of values > mode: mode of a list of values > multimode: up to n modes of a list of values > matchcount: compare value with list of values and count > matches using > standard or custom comparison function > strtodatetime: convert a date/time string to an SPSS > datetime value using a pattern > datetimetostr: convert an SPSS date/time value to a string > using a pattern > lookup: return a value from a table lookup > vlookup: return a value from a table lookup (more > convenient than lookup w SPSSINC TRANS) > vlookupinterval: return a value from a table lookup using > intervals > sphDist: calculate distance between two points on > earth using spherical approximation > ellipseDist: calculate distance between two points on > earth using ellipsoidal approximation > jaroWinkler calculate Jaro-Winkler string similarity > measure > extractDummies extract a set of binary variables from a > value coded in powers of 2 > packDummies pack a sequence of numeric and/or string > values into a single float > translatechar map characters according to a conversion > table > countWkdays count number of days between two dates that > are not excluded > vlookupgroupinterval return a value associated with a group and a > set of intervals for that group > countDaysWExclusions count days in interval exclusing specificied > weekdays and other dates > DiceStringSimilarity compare strings using Dice bigram metric. > Dictdict find best match of strings using Dice metric > setRandomSeed initialize random number generator > invGaussian inverse Gaussian distribution random numbers > triangular triangular random numbers > > On Mon, May 7, 2018 at 6:33 AM, William Dudley < > wndudley@ > <mailto: > wndudley@ > >> wrote: > Jon, > > This is terrific. > I have a project for which this method will be very useful. > > Bill > > > On Sun, May 6, 2018 at 3:42 PM, Jon Peck < > jkpeck@ > <mailto: > jkpeck@ > >> wrote: > Here is a solution using the SPSSINC TRANS extension command. It is > normally installed with Statistics, but if you don't already have it you > can install it from the Extensions menu or in older versions Utilities. > > First you define a dataset of words - I called it lookup - and make sure > that your main dataset is active. > data list fixed/words(a30). > begin data > abound > abounds > abundance > abundant > accessable > accessible > acclaim > acclaimed > acclamation > accolade > accolades > accommodative > accomodative > accomplish > accomplished > accomplishment > accomplishments > accurate > accurately > achievable > achievement > achievements > achievible > acumen > adaptable > adaptive > adequate > adjustable > admirable > admirably > admiration > admire > admirer > admiring > admiringly > adorable > adore > adored > adorer > adoring > end data > dataset name lookup. > > data list fixed/text(a50). > begin data > adorer > adoring > end data > dataset name main. > dataset activate main. > > Next you define a Python class for use with SPSSINC TRANS. It reads the > lookup dataset and creates a set containing the words ignoring case. It > also creates a function, func, that will be called for each case in the > main dataset. func splits the indicated variable's value at each blank > and checks whether it appears in the set (ignoring case). In this > example, the strings to check are in a variable named text. > > begin program. > class vlookup(object): > """Check values according to a dictionary specified as an SPSS > dataset""" > def __init__(self, dataset): > """dataset is a dataset of words > > Lookups are made after trimming any trailing blanks and ignoring > case > The class creates a function named func that can be referenced for > lookups""" > > spss.StartDataStep() > try: > ds = spss.Dataset(dataset) > cases = ds.cases > self.table = set() > for i in range(len(cases)): > self.table.add(cases[i, 0][0].rstrip().lower()) > > def func(x): > x = x.rstrip().split() > for word in x: > if word.lower() in self.table: > return True > return False > self.func = func > finally: > spss.EndDataStep() > end program. > > This is the call to invoke all this. It first creates the word set from > the named dataset and then processes a variable named text for each case. > The result is a 1 or 0 (true or false) for each case according to whether > any word in text is found in the lookup set. > > spssinc trans result=hasword > /initial "vlookup('lookup')" > /formula "func(text)". > > Regards, > Jon > > > On Sun, May 6, 2018 at 2:29 AM, Talma < > Talma.Claviger@ > <mailto: > Talma.Claviger@ > >> wrote: > Dear Jon, > > many thanks for your example, which was already very useful – and you are > right, the real problem refers to many more terms… > > Specifically, I’d like to analyse comments from a social media site using > freely available dictionaries that count certain terms contained in the > comments. These terms are identified with certain emotions. > For example, a post containing the adjective “angry” could be classified > as > belonging to the category “anger” and so on (for this illustration, ignore > the multiple problems associated with this approach, such as negations > etc.). > > However, such dictionaries (often available in *.txt or*.csv format, which > can be changed) easily contain several thousand terms…and requesting each > term separately would indeed become unwieldly : ). For illustration, > here’s > a sample example (first 40 words) of a similar dictionary (not just > adjectives) taken from > > ; Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." > ; Proceedings of the ACM SIGKDD International Conference on > Knowledge > ; Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, > ; Washington, USA, > > abound > abounds > abundance > abundant > accessable > accessible > acclaim > acclaimed > acclamation > accolade > accolades > accommodative > accomodative > accomplish > accomplished > accomplishment > accomplishments > accurate > accurately > achievable > achievement > achievements > achievible > acumen > adaptable > adaptive > adequate > adjustable > admirable > admirably > admiration > admire > admirer > admiring > admiringly > adorable > adore > adored > adorer > adoring > > *** > > In case there is any more general options to use SPSS syntax for finding > out > whether a string variable contains one of the terms above or not, it would > extremely helpful if you could your thoughts here in this already > superhelpful forum... > > Many thanks & regards!! > Talma > > > > > > > -- > Sent from: http://spssx-discussion.1045642.n5.nabble.com/ > > ===================== > To manage your subscription to SPSSX-L, send a message to > LISTSERV@.UGA > <mailto: > LISTSERV@.UGA > > (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > > > > -- > Jon K Peck > jkpeck@ > <mailto: > jkpeck@ > > > > ===================== To manage your subscription to SPSSX-L, send a > message to > LISTSERV@.UGA > <mailto: > LISTSERV@.UGA > > (not to SPSSX-L), with no body text except the command. To leave the > list, send the command SIGNOFF SPSSX-L For a list of commands to manage > subscriptions, send the command INFO REFCARD > > > > -- > William N. Dudley, PhD > Professor - Public Health Education > The School of Health and Human Sciences > The University of North Carolina at Greensboro > 437-L Coleman Building > Greensboro, NC 27402-6170 > See my research on > GoogleScholar<https://scholar.google.com/citations?user=ZiYmyb4AAAAJ&hl=en> > ResearchGate<https://www.researchgate.net/profile/William_Dudley> > VOICE 336.256 2475 > > [email signature image example.png] > > > > > -- > Jon K Peck > jkpeck@ > <mailto: > jkpeck@ > > > > ===================== To manage your subscription to SPSSX-L, send a > message to > LISTSERV@.UGA > <mailto: > LISTSERV@.UGA > > (not to SPSSX-L), with no body text except the command. To leave the > list, send the command SIGNOFF SPSSX-L For a list of commands to manage > subscriptions, send the command INFO REFCARD > > ===================== > To manage your subscription to SPSSX-L, send a message to > LISTSERV@.UGA > (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD ----- Please reply to the list and not to my personal email. Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Jon Peck
Hi
can this syntax be modified in such a way that only complete words are analyzed? For example, compute var_z=char.index(lower(var_x), "vanilla") > 0 identifies not only "vanilla", but also "sweetvanilla" or "vanillacream". Suppose one is only interested in finding the term 'vanilla'. Is there any subcommand for char.index to achieve this? Many thanks for your response, nina -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
The SPSSINC TRANS solution I posted already eliminates the false partial word matches. On Sun, May 13, 2018 at 5:06 AM, nina <[hidden email]> wrote: Hi |
In reply to this post by nina
So, search for " vanilla ". However, that will fail if vanilla is the last word or, I suspect, the first word of the sentence or more generally if there any character other than a " " in either location. Gene Maguin
-----Original Message----- From: SPSSX(r) Discussion <[hidden email]> On Behalf Of nina Sent: Sunday, May 13, 2018 7:06 AM To: [hidden email] Subject: Re: basic 'string' question Hi can this syntax be modified in such a way that only complete words are analyzed? For example, compute var_z=char.index(lower(var_x), "vanilla") > 0 identifies not only "vanilla", but also "sweetvanilla" or "vanillacream". Suppose one is only interested in finding the term 'vanilla'. Is there any subcommand for char.index to achieve this? Many thanks for your response, nina -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Here is a more robust version of the spssinc trans usage. It handles things like 's and a terminal period on a string, in case that matters. As before, it is not fooled by matches within words. I'm repeating the whole example, but the only substantive change is the way the input string is split. data list fixed/words(a30). begin data abound abounds abundance abundant accessable accessible acclaim acclaimed acclamation accolade accolades accommodative accomodative accomplish accomplished accomplishment accomplishments accurate accurately achievable achievement achievements achievible acumen adaptable adaptive adequate adjustable admirable admirably admiration admire admirer admiring admiringly adorable adore adored adorer adoring end data dataset name lookup. data list fixed/text(a50). begin data adorer ADORER adoring admire. admirer's vanilla inadequate end data dataset name main. dataset activate main. begin program. import spss, re class vlookup(object): """Check values according to a dictionary specified as an SPSS dataset""" def __init__(self, dataset): """dataset is a dataset of words Lookups are made after breaking at word boundaries and ignoring letter case The class creates a function named func that can be referenced for lookups""" spss.StartDataStep() try: ds = spss.Dataset(dataset) cases = ds.cases self.table = set() for i in range(len(cases)): self.table.add(cases[i, 0][0].rstrip().lower()) def func(x): x = re.findall(r"\w+", x.lower()) for word in x: if word in self.table: return True return False self.func = func finally: spss.EndDataStep() end program. spssinc trans result=hasword /initial "vlookup('lookup')" /formula "func(text)". On Sun, May 13, 2018 at 1:44 PM, Maguin, Eugene <[hidden email]> wrote: So, search for " vanilla ". However, that will fail if vanilla is the last word or, I suspect, the first word of the sentence or more generally if there any character other than a " " in either location. Gene Maguin |
In reply to this post by Jon Peck
Hi, I have copied the below syntax and at the time of execution, I am getting the following error. I am using SPSS 22 and SPSSINC TRANS is installed Messages
Please help. Regards Manoj From: SPSSX(r) Discussion <[hidden email]>
On Behalf Of Jon Peck Here is a solution using the SPSSINC TRANS extension command. It is normally installed with Statistics, but if you don't already have it you can install it from the Extensions menu or in older versions Utilities. First you define a dataset of words - I called it lookup - and make sure that your main dataset is active. data list fixed/words(a30). begin data abound abounds abundance abundant accessable accessible acclaim acclaimed acclamation accolade accolades accommodative accomodative accomplish accomplished accomplishment accomplishments accurate accurately achievable achievement achievements achievible acumen adaptable adaptive adequate adjustable admirable admirably admiration admire admirer admiring admiringly adorable adore adored adorer adoring end data dataset name lookup. data list fixed/text(a50). begin data adorer adoring end data dataset name main. dataset activate main. Next you define a Python class for use with SPSSINC TRANS. It reads the lookup dataset and creates a set containing the words ignoring case. It also creates a function, func, that will be called for each
case in the main dataset. func splits the indicated variable's value at each blank and checks whether it appears in the set (ignoring case). In this example, the strings to check are in a variable named text. begin program. class vlookup(object): """Check values according to a dictionary specified as an SPSS dataset""" def __init__(self, dataset): """dataset is a dataset of words Lookups are made after trimming any trailing blanks and ignoring case The class creates a function named func that can be referenced for lookups""" spss.StartDataStep() try: ds = spss.Dataset(dataset) cases = ds.cases self.table = set() for i in range(len(cases)): self.table.add(cases[i, 0][0].rstrip().lower()) def func(x): x = x.rstrip().split() for word in x: if word.lower() in self.table: return True return False self.func = func finally: spss.EndDataStep() end program. This is the call to invoke all this. It first creates the word set from the named dataset and then processes a variable named text for each case. The result is a 1 or 0 (true or false) for each case according
to whether any word in text is found in the lookup set. spssinc trans result=hasword /initial "vlookup('lookup')" /formula "func(text)". Regards, Jon On Sun, May 6, 2018 at 2:29 AM, Talma <[hidden email]> wrote:
-- Jon K Peck ===================== To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
|
Administrator
|
Perhaps you need to get a more recent version of SPSSINC TRANS?
Just a wild guess. In all likelihood the code I posted will run faster and it is not a black box. Hi, I have copied the below syntax and at the time of execution, I am getting the following error. I am using SPSS 22 and SPSSINC TRANS is installed Messages Warnings name 'vlookup' is not defined Please help. Regards Manoj From: SPSSX(r) Discussion <SPSSX-L@.UGA> On Behalf Of Jon Peck Sent: 07 May 2018 01:12 To: SPSSX-L@.UGA Subject: Re: basic 'string' question Here is a solution using the SPSSINC TRANS extension command. It is normally installed with Statistics, but if you don't already have it you can install it from the Extensions menu or in older versions Utilities. First you define a dataset of words - I called it lookup - and make sure that your main dataset is active. data list fixed/words(a30). begin data abound abounds abundance abundant accessable accessible acclaim acclaimed acclamation accolade accolades accommodative accomodative accomplish accomplished accomplishment accomplishments accurate accurately achievable achievement achievements achievible acumen adaptable adaptive adequate adjustable admirable admirably admiration admire admirer admiring admiringly adorable adore adored adorer adoring end data dataset name lookup. data list fixed/text(a50). begin data adorer adoring end data dataset name main. dataset activate main. Next you define a Python class for use with SPSSINC TRANS. It reads the lookup dataset and creates a set containing the words ignoring case. It also creates a function, func, that will be called for each case in the main dataset. func splits the indicated variable's value at each blank and checks whether it appears in the set (ignoring case). In this example, the strings to check are in a variable named text. begin program. class vlookup(object): """Check values according to a dictionary specified as an SPSS dataset""" def __init__(self, dataset): """dataset is a dataset of words Lookups are made after trimming any trailing blanks and ignoring case The class creates a function named func that can be referenced for lookups""" spss.StartDataStep() try: ds = spss.Dataset(dataset) cases = ds.cases self.table = set() for i in range(len(cases)): self.table.add(cases[i, 0][0].rstrip().lower()) def func(x): x = x.rstrip().split() for word in x: if word.lower() in self.table: return True return False self.func = func finally: spss.EndDataStep() end program. This is the call to invoke all this. It first creates the word set from the named dataset and then processes a variable named text for each case. The result is a 1 or 0 (true or false) for each case according to whether any word in text is found in the lookup set. spssinc trans result=hasword /initial "vlookup('lookup')" /formula "func(text)". Regards, Jon <SNIP> ----- Please reply to the list and not to my personal email. Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Arora, Manoj (IMDLR)
The vlookup class was added to the extendedTransforms module in December, 2009, so the version of that module you have probably has this function. You have to reference it qualified by the module name in the SPSSINC TRANS formula. Here is an example. * The lookup table. data list free/ value(F8.0) akey(A1). begin data 10 'a' 20 'b' 100 'z' end data. dataset name lookup. * The main dataset. data list free/x(f8.0) y(A2). begin data 1 'a' 2 'b' 5 'a ' 10 '' 1 'b' end data. dataset name main. dataset activate main. spssinc trans result = resultcodealpha /initial "extendedTransforms.vlookup('akey', 'value', 'lookup')" /formula func(y). extendedTransforms.py is installed with Statistics as of several versions ago but is also posted on the SPSS Community website. If your version is actually too old for this, let me know, and I will send you the current version off list. On Tue, Jun 19, 2018 at 5:07 AM, Arora, Manoj (IMDLR) <[hidden email]> wrote:
|
Hi Jon, Thanks for your revert. I have checked and
extendedTransforms.py is not installed with SPSS. I request you to share the same. Regards From: Jon Peck <[hidden email]>
The vlookup class was added to the extendedTransforms module in December, 2009, so the version of that module you have probably has this function. You have to reference it qualified by the module name in
the SPSSINC TRANS formula. Here is an example. * The lookup table. data list free/ value(F8.0) akey(A1). begin data 10 'a' 20 'b' 100 'z' end data. dataset name lookup. * The main dataset. data list free/x(f8.0) y(A2). begin data 1 'a' 2 'b' 5 'a ' 10 '' 1 'b' end data. dataset name main. dataset activate main. spssinc trans result = resultcodealpha /initial "extendedTransforms.vlookup('akey', 'value', 'lookup')" /formula func(y). extendedTransforms.py is installed with Statistics as of several versions ago but is also posted on the SPSS Community website. If your version is actually too old for this, let me know, and I will send you
the current version off list. On Tue, Jun 19, 2018 at 5:07 AM, Arora, Manoj (IMDLR) <[hidden email]> wrote:
-- Jon K Peck |
Free forum by Nabble | Edit this page |