SPSSX Discussion

Identifying similiar text and categorizing them

Classic

List

Threaded

14 messages Options

mayank99

Identifying similiar text and categorizing them

Hi ,

I am looking for a way to categorize text data to parametric fields in the data I have .
I am working on an e-commerce data where on the profile page people are asked to fill their education . There are lot of parametric input available to them like "BA, BCom , BE etc" and there is also an option of "other degree".
If person selects other degree then a text box is provided to him to input his education. But customers fill the degrees available in parametric fields in that text box.
Now what I want to do is to categorize them in the available parametric fields.
So, is there any way that functions like Vlookup and read the degrees in text fields and categorize it to the parametric fields available.
for eg. the moment someone puts BA Geography it should be categorized to BA in parametric field and that code is assigned to it.

I had been stuck with this since last 2 weeks , please suggest how can i do it.

mayank99

Re: Identifying similiar text and categorizing them

This is the text input I have and I want to categorize them in the degrees like "BA, BCOM, BSC "
B A
Bsc Nursing
diploma
MBA
Hotel Management
bachelor of arts
Graduation
commerce
bachelor of commerce
ITI
B.B.A
b. com
BA economics
bsc maths
BHM
BACHELOR IN ARTS
+2
B.Pharm
BSc Chemistry
Arts
b.pharmacy
b.com.
B.Com Computers

Ruben Geert van den Berg

Re: Identifying similiar text and categorizing them

Short answer: use IF commands for setting the values of the "fixed" fields for those respondents who have a given answer on the "other" field. See http://www.spss-tutorials.com/if/.

However, when people can freely type in answers, they're virtually certain to be "messy" as your data illustrate. It may be best to "clean up" the open answers first before using IF commands.

You could create a new string variable and use SPSS string manipulations (especially CHAR.INDEX and LOWER) for "cleaning up" the original answers into a new (string or numeric) variable. For a quick walk through SPSS' main string functions, see http://www.spss-tutorials.com/string-tutorial/.

Last, use the aforementioned IF statements to assign values to the "fixed" fields based on the edited (rather than the original) open answers.

HTH,

Ruben

Albert-Jan Roskam-2

Re: Identifying similiar text and categorizing them

In reply to this post by mayank99

I would also ditch a periods and replace all whitespaces (space, tab, etc) by one space. Then you can try to match this to a lookup table with commonly found abbrevs as keys and harmonized abbrevs as values. You can also use Spssinc Trans + extendedTransforms.vlookup for this. That module also contains string similarity functions which may come in handy here.

------------------------------
On Mon, Jul 21, 2014 11:58 AM CEST Ruben Geert van den Berg wrote:

>Short answer: use IF commands for setting the values of the "fixed" fields
>for those respondents who have a given answer on the "other" field. See
>http://www.spss-tutorials.com/if/.
>
>However, when people can freely type in answers, they're virtually certain
>to be "messy" as your data illustrate. It may be best to "clean up" the open
>answers first before using IF commands.
>
>You could create a new string variable and use SPSS string manipulations
>(especially CHAR.INDEX and LOWER) for "cleaning up" the original answers
>into a new (string or numeric) variable. For a quick walk through SPSS' main
>string functions, see http://www.spss-tutorials.com/string-tutorial/.
>
>Last, use the aforementioned IF statements to assign values to the "fixed"
>fields based on the edited (rather than the original) open answers.
>
>HTH,
>
>Ruben
>
>
>
>--
>View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Identifying-similiar-text-and-categorizing-them-tp5726773p5726775.html
>Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
>=====================
>To manage your subscription to SPSSX-L, send a message to
>[hidden email] (not to SPSSX-L), with no body text except the
>command. To leave the list, send the command
>SIGNOFF SPSSX-L
>For a list of commands to manage subscriptions, send the command
>INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

mayank99

Re: Identifying similiar text and categorizing them

Thanks a lot .
But the problem i am dealing with is very complex . People write random text input with lot of spelling errors !! I need a way in SPSS text to correct all the spelling errors first in the data and then start categorizing it.

Also , there are 1000 ways in which people write bsc !! bachelor of science/ bachelor of nautical science/ bechelor of sience ! etc etc..
So is there a way to deal with spelling errors???

Bruce Weaver

Re: Identifying similiar text and categorizing them

Administrator

It is highly unlikely (probably impossible) that you would be able to program something that will catch and fix all errors before you've seen the input. In situations like this, I usually do something like the following:

1. Convert everything to lowercase (or uppercase). See the LOWER and UPCASE functions.
2. Automatically recode string variables to numeric. See AUTORECODE.
3. Use RECODE to combine categories of the new numeric variable.
4. Use ADD VALUE LABELS (if necessary) to tweak the value labels.

Before carrying out steps 3 and 4, you have to look at the results of 1 and 2. I don't know any easy way around that.

HTH.

mayank99 wrote

Thanks a lot .
But the problem i am dealing with is very complex . People write random text input with lot of spelling errors !! I need a way in SPSS text to correct all the spelling errors first in the data and then start categorizing it.

Also , there are 1000 ways in which people write bsc !! bachelor of science/ bachelor of nautical science/ bechelor of sience ! etc etc..
So is there a way to deal with spelling errors???

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

Ruben Geert van den Berg

Re: Identifying similiar text and categorizing them

In reply to this post by mayank99

Jon K Peck

Re: Identifying similiar text and categorizing them

I suggest that the first step should be to address the spelling errors in the data. From the help:

To check the spelling of string data values:
1. Select the Data View tab of the Data Editor.
2. Optionally, select one or more variables (columns) to check. To select a variable, click the variable name at the top of the column.
3. From the menus choose:

Utilities > Spelling

If there are no selected variables in Data View, all string variables will be checked.

After that, regular expressions can be used to search for particular words of interest. That requires some Python work, but we can go into more detail if you want to go down that road.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: Ruben Geert van den Berg <[hidden email]>
To: [hidden email],
Date: 07/22/2014 09:15 AM
Subject: Re: [SPSSX-L] Identifying similiar text and categorizing them
Sent by: "SPSSX(r) Discussion" <[hidden email]>

I agree that the extent to which you could automate this may be limited. So again, I'd first copy the original string values into a new string variable so you can always compare later results to original values. Bruce's suggestion of first converting everything to LOWER is a good catch. Do so. I'd also use INDEX to scan for different spellings of different words. Remember that it returns 0 if the substring is not present at all. So you could use it like IF CHAR.INDEX('bechelor') > 0 OR CHAR.INDEX('bsc') > 0 OR (...) to see whether cases have any indication of having any kind of bachelor's degree. And so on. Also, seehttp://www.spss-tutorials.com/index/. You can use SORT CASES to move cases who have no "hits" after each step to the top of the dataset for visual inspection. It may still be a lot of work (do work from syntax) but I'm not sure to what extent that's even avoidable. On the sunny side, you may become very proficient with SPSS' string functions which may turn out to be more useful than you'd expected. HTH, Ruben -- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/Identifying-similiar-text-and-categorizing-them-tp5726773p5726781.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

David Short

Re: Identifying similiar text and categorizing them

In reply to this post by mayank99

It's drudgery, but this works:

String Var2(a45).
Compute var2=UPCASE(var2).
*and all the other suggestions for standardizing the variable.
If INDEX(var2,"bsc")>0 Var1="Bachelor of Science".
* a billion more "IF INDEX...." statements go here. chuckle
Execute.

Just using the key word in a lot of "fill in the blank" data could make the task go faster.

David Short

Re: Identifying similiar text and categorizing them

It seems a stretch, but you might experiment with converting the varible to Soundex and then doing some fuzzy matching. Here's a link to code to create Soundex:
http://www.spsstools.net/Syntax/Strings/SoundexPhoneticComparison.txt

Jon K Peck

Re: Identifying similar text and categorizing them

Soundex was designed to work with English names. It isn't a great general word encoder. You can do it, though, using the SPSSINC TRANS extension command with the extendedTransforms.py module, which also includes a number of word encoding and string difference functions such as might be used in a spell checker.

However, I still think starting with the spelling checker built in to Statistics (or some other source prior to reading the data into Statistics) would be a good start.

The Text Analysis for Surveys product for SPSS, I should point out, is designed to do exactly the sort of tasks that are being discussed here using real linguistic analysis software, and would be a much better solution for serious work.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: David Short <[hidden email]>
To: [hidden email],
Date: 07/22/2014 07:12 PM
Subject: Re: [SPSSX-L] Identifying similiar text and categorizing them
Sent by: "SPSSX(r) Discussion" <[hidden email]>

It seems a stretch, but you might experiment with converting the varible to Soundex and then doing some fuzzy matching. Here's a link to code to create Soundex:http://www.spsstools.net/Syntax/Strings/SoundexPhoneticComparison.txt-- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/Identifying-similiar-text-and-categorizing-them-tp5726773p5726790.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

David Short

Re: Identifying similiar text and categorizing them

In reply to this post by David Short

I suggest looking at Google Refine. If nothing else, it's algorithms can suggest classifications of the variations of the categories in your list. Google Refine can help your coding task if you prefer to code the syntax in SPSS. If you're doing a one time clean up of messy data, Goggle Refine is free, easy and powerful. https://code.google.com/p/google-refine/

Albert-Jan Roskam-2

Re: Identifying similiar text and categorizing them

Hi,

I cobbled this together last night. I standardizes 80 % of the sample records. Please note that I don't know much about the titles you are looking for. Lesson for the next time: make this a multiple choice question! :-))

* sample data.

file handle degrees_file /name = "%temp%\degrees.sav".
dataset close all.

data list free /dirty_degree (a30).
begin data
'B A'
'Bsc Nursing'
'diploma'
'MBA'
'Hotel Management'
'bachelor of arts'
'Graduation'
'commerce'
'bachelor of commerce'
'ITI'
'B.B.A'
'b. com'
'BA economics'
'bsc maths'
'BHM'
'BACHELOR IN ARTS'
'+2'
'B.Pharm'
'BSc Chemistry'
'Arts'
'b.pharmacy'
'b.com.'
'B.Com Computer'
'PhDd'
'P h D'
end data.
save outfile = degrees_file.

* code, including lookup table (you could also use extendedTransforms.vlookup, perhaps).

get file = degrees_file.
begin program.
from difflib import get_close_matches as match
import re

# add abbreviated and full degree names, optionally use custom regex as second element
degrees = \
[('A.A.S.', 'Associate of Applied Science'),
('A.A.A.', 'Associate of Applied Arts'),
('A.P.S.', 'Associate of Political Science'),
('A.A.', 'Associate of Arts'),
('A.E.', 'Associate of Engineering'),
('A.S.', 'Associate of Science'),
('B.Sc.', 'Bachelor of Science'),
('B.F.A.', 'Bachelor of Fine Arts'),
('B.B.A.', 'Bachelor of Business Administration'),
('B.Arch.', 'Bachelor of Architecture'),
('B.A.', '(.*Arts|B[. ]?A)', 'Bachelor of Arts'),
('B.Com.', '.*Com(merce)?', 'Bachelor of Commerce'),
('B.S.', 'Bachelor of Science'),
('B.', 'Bachelor'),
('M.A.', 'Master of Arts'),
('M.F.A.', 'Master of Fine Arts'),
('M.S.', 'Master of Science'),
('M.Res.', 'Master of Research'),
('M.Phil.', 'Master of Philosophy'),
('LL.M.', 'Master of Laws'),
('M.B.A.', 'Master of Business Administration'),
('Ph.D', 'Doctor of Philosophy'),
('M.D.', 'Doctor of Medicine'),
('Ed.D.', 'Doctor of Education'),
('J.D.', 'Juris Doctor')]
degrees = sorted(degrees, key=lambda x: (len(x[0]), x[0]), reverse=True)
degrees_dict = {item[0]: item[-1] for item in degrees}
reverse_degrees = {item[-1].lower(): item[0] for item in degrees}

def harmonize_degree(dirty_degree):
dirty_degree = dirty_degree.rstrip()
for degree in degrees:
     if len(degree) == 3: # custom regex
        m = re.match(degree[1], dirty_degree, re.I)
        if m:
           return degree[-1], 0
     else: # simple regex: make period optional, period could also be space
        m = re.match(r"\b" + degree[0].replace(".", "[. ]?"), dirty_degree, re.I)
        if m:
           return degree[-1], 1
possibilities = [item[0].lower().replace(".", "") for item in degrees]
result = match(word=dirty_degree, possibilities=possibilities, n=1, cutoff=0.6) # lower cutoff = more false positives
if result:
     return degrees_dict.get(result[0]), 2
else:
     possibilities = [item[1].lower() for item in degrees]
     result = match(word=dirty_degree, possibilities=possibilities, n=1, cutoff=0.6)
     return reverse_degrees.get(result[0]) if result else "", 3

end program.

**some basic cleaning.
compute dirty_degree = replace(dirty_degree, ". ", ".").
* replace spaces single standing letters with nothing (P H D --> PHD).
spssinc trans result = dirty_degree type = 30 /formula "re.sub(r'(?i)\b([a-z])\b[. ]?', r'\1', dirty_degree)".
* replace one or more tabs, newlines etc by one space.
spssinc trans result = dirty_degree type = 30 /formula "re.sub('\s+', ' ', dirty_degree)".

** apply various strategies to try and figure out the standardized degree.
spssinc trans result = cleaner_degree strategy type = 50 0 /formula "harmonize_degree(dirty_degree)".
if (cleaner_degree eq "") strategy = 4.
value labels strategy 0 'custom regex' 1 'regex matching' 2 'difflib forward' 3 'difflib backward' 4 'nothing worked'.
frequencies strategy. /* especially check categories 2 & 3.

Regards,

Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a

fresh water system, and public health, what have the Romans ever done for us?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

----- Original Message -----
> From: David Short <[hidden email]>
> To: [hidden email]
> Cc:
> Sent: Wednesday, July 23, 2014 3:41 AM
> Subject: Re: [SPSSX-L] Identifying similiar text and categorizing them
>

> I suggest looking at Google Refine. If nothing else, it's algorithms can
> suggest classifications of the variations of the categories in your list.
> Google Refine can help your coding task if you prefer to code the syntax in
> SPSS. If you're doing a one time clean up of messy data, Goggle Refine is
> free, easy and powerful. https://code.google.com/p/google-refine/
>
>
>
> --
> View this message in context:
> http://spssx-discussion.1045642.n5.nabble.com/Identifying-similiar-text-and-categorizing-them-tp5726773p5726792.html
>
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

mayank99

Re: Identifying similiar text and categorizing them

WOW Albert !!
Thanks a lot for doing so much efforts ... its Actually amazing !! let me just try it now :)
Thanks again ....

P.S. I wish everything can be converted to multiple choice , then i would not have been facing this problem at all