SPSSX Discussion

Text Field

Classic

List

Threaded

4 messages Options

Michael Krei=?ISO-8859-1?Q?=DFig?=

Text Field

Hi all,

I have the job the evaluate skin cancer data which I received from an
external clinic. Those nice colleagues created a text field with the
following contents (or similar)

Breslow: 1.5, Level 4
TD 2.05, Inv.-level 3

and so on.

Is there any possibility to extract the figures for tumor thickness (1.5 or
2.05) and invasion level (4 or 3) and put it in numerical fields. To my
knowledge, it's not possible but maybe, some of you habe an idea.

There are thousands of cases, so I can't do it manually.

Mike

Anton Balabanov

Re: Text Field

Hi, Mike,

my solution is simple, but not sure how practical, because to make it
work properly you should guarantee for your thousand of cases that...
1) tumor and invasion numbers always follow in the sequence like in your
example;
2) no other number values (other than tumor and invasion) separated from
the text appear in the case line;
3) tumor and invasion numbers have at least one leading space and at
least one comma or trailing space or tab or end of string or ... after
them;
4) no more than 10 tokens (separated portions of text/numbers) appear in
the case line (this could be relaxed by setting x1 TO xN in the code).

If at least one of these assumptions is violated, the solution will be
more sofisticated, involving parsing the string .

Well, I tested it with your example successfully. Put your data within
BEGIN DATA/END DATA statements.

DATA LIST LIST /X1 TO X10 var1 var2 (12F4.2) .
BEGIN DATA
Breslow: 1.5, Level 4 something else (not numbers)
TD: 2.05,something else (not numbers) Inv.-level 3
END DATA.

DO REPEAT x=x1 TO x10.
DO IF ~MISS(x).
- IF MISS(var1) var1=x.
- COMPUTE var2=x.
END IF.
END REPEAT.
LIST var1 var2.
DELETE VARS x1 to x10.

Best,
Anton

> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]]
> On Behalf Of Michael Krei Яig
> Sent: Thursday, November 02, 2006 11:23 AM
> To: [hidden email]
> Subject: Text Field
>
>
> Hi all,
>
> I have the job the evaluate skin cancer data which I received
> from an external clinic. Those nice colleagues created a text
> field with the following contents (or similar)
>
> Breslow: 1.5, Level 4
> TD 2.05, Inv.-level 3
>
> and so on.
>
> Is there any possibility to extract the figures for tumor
> thickness (1.5 or
> 2.05) and invasion level (4 or 3) and put it in numerical
> fields. To my knowledge, it's not possible but maybe, some of
> you habe an idea.
>
> There are thousands of cases, so I can't do it manually.
>
> Mike
>

Spousta Jan

Re: Text Field

In reply to this post by Michael Krei=?ISO-8859-1?Q?=DFig?=

Hi Mike,

Try the program bellow.

BTW I am just reading a book from your emerited colleague from Uni Tübingen, the theologian Hans Küng :-)

Greetings

Jan

* prepare the data.
data list fixed /diagn (a25).
begin data.
Breslow: 1.5, Level 4
TD 2.05, Inv.-level 3
end data.
exe.

* run a Python program which finds the numbers using regular expressions.
begin program python.
import spss
import re

i=([0]) # position of the variable in the file
dataCursor=spss.Cursor(i) # get the data from the active SPSS file
oneVar=dataCursor.fetchall()
dataCursor.close()

myPattern = re.compile(r'^\D*(\d+\.?\d*)\D+(\d)\D*$') # define the regular expression
# edit the regexp if there are other patterns in the data file
# - I wrote it so that it works for the two examples you gave us

print "Patterns found:\n"
for item in oneVar: # cycle through the rows
desc = item[0].strip()
a = myPattern.search(desc).groups() # find the numbers
print desc+"\t"+a[0]+"\t"+a[1] # print results in the tab-separated format
# (you can paste them into Excel sheet then)

print "\nHave a nice day, Mike"
end program.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Michael Kreißig
Sent: Thursday, November 02, 2006 9:23 AM
To: [hidden email]
Subject: Text Field

Hi all,

I have the job the evaluate skin cancer data which I received from an external clinic. Those nice colleagues created a text field with the following contents (or similar)

Breslow: 1.5, Level 4
TD 2.05, Inv.-level 3

and so on.

Is there any possibility to extract the figures for tumor thickness (1.5 or
2.05) and invasion level (4 or 3) and put it in numerical fields. To my knowledge, it's not possible but maybe, some of you habe an idea.

There are thousands of cases, so I can't do it manually.

Mike

Peck, Jon

Re: Text Field

Jan has nicely illustrated the power and convenience of regular expressions as provided in Python programmability. The version below shows how you can use SPSS 15 and the Bonus Pack modules to create the new values directly in SPSS.

Regular expression search is provided by the search class in the extendedTransforms module. This example creates the thickness variable based on a regular expression that means one or more digits (\d+) followed optionally immediately by a decimal point and more digits. If no digits are found, the result is SYSMIS.

The tproc.execute() line takes care of reading the data case by case and writing the search result back to the existing dataset.
* prepare the data.
data list fixed /diagn (a25).
begin data.
Breslow: 1.5, Level 4
TD 2.05, Inv.-level 3
end data.

begin program.
import spss, trans, extendedTransforms

tproc = trans.Tfunction()
tproc.append(extendedTransforms.search, "thickness", "f",
["diagn", trans.const(r"\d+\.*\d*")])
tproc.execute()
end program.

-Jon Peck

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Spousta Jan
Sent: Thursday, November 02, 2006 3:50 AM
To: [hidden email]
Subject: Re: [SPSSX-L] Text Field

Hi Mike,

Try the program bellow.

BTW I am just reading a book from your emerited colleague from Uni Tübingen, the theologian Hans Küng :-)

Greetings

Jan

* prepare the data.
data list fixed /diagn (a25).
begin data.
Breslow: 1.5, Level 4
TD 2.05, Inv.-level 3
end data.
exe.

* run a Python program which finds the numbers using regular expressions.
begin program python.
import spss
import re

i=([0]) # position of the variable in the file
dataCursor=spss.Cursor(i) # get the data from the active SPSS file
oneVar=dataCursor.fetchall()
dataCursor.close()

myPattern = re.compile(r'^\D*(\d+\.?\d*)\D+(\d)\D*$') # define the regular expression
# edit the regexp if there are other patterns in the data file
# - I wrote it so that it works for the two examples you gave us

print "Patterns found:\n"
for item in oneVar: # cycle through the rows
desc = item[0].strip()
a = myPattern.search(desc).groups() # find the numbers
print desc+"\t"+a[0]+"\t"+a[1] # print results in the tab-separated format
# (you can paste them into Excel sheet then)

print "\nHave a nice day, Mike"
end program.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Michael Kreißig
Sent: Thursday, November 02, 2006 9:23 AM
To: [hidden email]
Subject: Text Field

Hi all,

I have the job the evaluate skin cancer data which I received from an external clinic. Those nice colleagues created a text field with the following contents (or similar)

Breslow: 1.5, Level 4
TD 2.05, Inv.-level 3

and so on.

Is there any possibility to extract the figures for tumor thickness (1.5 or
2.05) and invasion level (4 or 3) and put it in numerical fields. To my knowledge, it's not possible but maybe, some of you habe an idea.

There are thousands of cases, so I can't do it manually.

Mike