|
Thank you very much for your help!
I am using SPSS24. I have two groups of people writing definitions for the same course. The definitions are of different length using different words. Of course some Key words may match. I want to find out how each row of one column is similar to each row of another column. If they are similar, do they match on one Key word or three Key words? Can I get a number of those match for each category, say how many rows have three key word matching, or how many rows have 5 key word matching? |
|
Do you have pairs of responses? or do you have a single columns and 2 groups of cases?
Please create a small subset of your data so we can better understand how it is set up. Then create a syntax file that says DISPLAY DICTIONARY. Copy that output and paste it into a reply on this list
Art Kendall
Social Research Consultants |
|
For example, one column says,
Examines the psychological development of individuals moving from their early twenties into old age. The other says, Developmental and Child Psychology Here the word Psychology may be counted as match. There are over 40,000 cases in both columns. But we don't know what words they are in each case of each row. |
|
Administrator
|
But psychological and Psychology are *NOT* the same word.
How do you propose to resolve this? -- Some ideas. 1. SPLIT the strings into two VECTORS (search this archive for Parse). Two alternatives. 2a. Take these vectors from wide to long using VARSTOCASES. 3a. Do a Cartesian merge of the two vectors (search archives for this). 2-3b. Compare the two vectors with a nested LOOP. 4ab. Decide if the various substrings should be considered the same by applying an appropriate distance function (see archives, this has been discussed. 5ab. Ennumerate matches with AGGREGATE and merge to original file. Sorry for lack of specific detail but I'm slammed. Maybe this will get the ball rolling in the right direction. HTH. --
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
|
In reply to this post by ljttet
You should really have full-fledged text analysis software to handle stemming of word forms, but a rough approach could be carried out if you can provide more details, such as how do you deal with strings where the same word appears more than once? And, presumably, you would want to ignore common words such as a, the, if, and, but, .... and perhaps do something rough about plurals such as always ignoring a final s. Similarity of words could be exact match or, say, within a few keystrokes. If you wanted to make lists of word forms for important words, that could also be addressed. Are you always comparing two variables in the same case, or is cross-case comparison required? On Wed, May 10, 2017 at 11:22 AM, ljttet <[hidden email]> wrote: For example, one column says, |
|
My first thought, too, was "text analysis software." Without that, the first step must be,
"Fix the spelling." Then, I would equate synonyms (unless you care about these distinctions; distinguishing will lengthen the number of 'important' words).
I think I would start with VarsToCases, drop [a, and, the, ...], and aggregate to count. I would start with, say, 1000 cases, to keep the first results to a more readable length. That helps to
check spelling and synonyms, and might show other ambiguities.
Then I would probably base my comparisons on words or partial-words, taking the top 50 or so most relevant words. 100? More? Less? - be guided by the counts.
Then: Strip each list to its relevant words; cross-compare; compute a coefficient of some sort for the Similarity.
-- Rich Ulrich From: SPSSX(r) Discussion <[hidden email]> on behalf of Jon Peck <[hidden email]>
Sent: Wednesday, May 10, 2017 3:26:23 PM To: [hidden email] Subject: Re: How can I compare two columns of Text of different length? You should really have full-fledged text analysis software to handle stemming of word forms, but a rough approach could be carried out if you can provide more details, such as how do you deal with strings where
the same word appears more than once? And, presumably, you would want to ignore common words such as a, the, if, and, but, .... and perhaps do something rough about plurals such as always ignoring a final s. Similarity of words could be exact match or, say,
within a few keystrokes. If you wanted to make lists of word forms for important words, that could also be addressed. Are you always comparing two variables in the same case, or is cross-case comparison required?
On Wed, May 10, 2017 at 11:22 AM, ljttet
<[hidden email]> wrote:
For example, one column says, |
|
Thank you! Rich. I will try. Jun On Wed, May 10, 2017 at 9:40 PM, Rich Ulrich [via SPSSX Discussion] <[hidden email]> wrote:
|
| Free forum by Nabble | Edit this page |
