Hi all,
I have the job the evaluate skin cancer data which I received from an external clinic. Those nice colleagues created a text field with the following contents (or similar) Breslow: 1.5, Level 4 TD 2.05, Inv.-level 3 and so on. Is there any possibility to extract the figures for tumor thickness (1.5 or 2.05) and invasion level (4 or 3) and put it in numerical fields. To my knowledge, it's not possible but maybe, some of you habe an idea. There are thousands of cases, so I can't do it manually. Mike |
Hi, Mike,
my solution is simple, but not sure how practical, because to make it work properly you should guarantee for your thousand of cases that... 1) tumor and invasion numbers always follow in the sequence like in your example; 2) no other number values (other than tumor and invasion) separated from the text appear in the case line; 3) tumor and invasion numbers have at least one leading space and at least one comma or trailing space or tab or end of string or ... after them; 4) no more than 10 tokens (separated portions of text/numbers) appear in the case line (this could be relaxed by setting x1 TO xN in the code). If at least one of these assumptions is violated, the solution will be more sofisticated, involving parsing the string . Well, I tested it with your example successfully. Put your data within BEGIN DATA/END DATA statements. DATA LIST LIST /X1 TO X10 var1 var2 (12F4.2) . BEGIN DATA Breslow: 1.5, Level 4 something else (not numbers) TD: 2.05,something else (not numbers) Inv.-level 3 END DATA. DO REPEAT x=x1 TO x10. DO IF ~MISS(x). - IF MISS(var1) var1=x. - COMPUTE var2=x. END IF. END REPEAT. LIST var1 var2. DELETE VARS x1 to x10. Best, Anton > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] > On Behalf Of Michael Krei Яig > Sent: Thursday, November 02, 2006 11:23 AM > To: [hidden email] > Subject: Text Field > > > Hi all, > > I have the job the evaluate skin cancer data which I received > from an external clinic. Those nice colleagues created a text > field with the following contents (or similar) > > Breslow: 1.5, Level 4 > TD 2.05, Inv.-level 3 > > and so on. > > Is there any possibility to extract the figures for tumor > thickness (1.5 or > 2.05) and invasion level (4 or 3) and put it in numerical > fields. To my knowledge, it's not possible but maybe, some of > you habe an idea. > > There are thousands of cases, so I can't do it manually. > > Mike > |
In reply to this post by Michael Krei=?ISO-8859-1?Q?=DFig?=
Hi Mike,
Try the program bellow. BTW I am just reading a book from your emerited colleague from Uni Tübingen, the theologian Hans Küng :-) Greetings Jan * prepare the data. data list fixed /diagn (a25). begin data. Breslow: 1.5, Level 4 TD 2.05, Inv.-level 3 end data. exe. * run a Python program which finds the numbers using regular expressions. begin program python. import spss import re i=([0]) # position of the variable in the file dataCursor=spss.Cursor(i) # get the data from the active SPSS file oneVar=dataCursor.fetchall() dataCursor.close() myPattern = re.compile(r'^\D*(\d+\.?\d*)\D+(\d)\D*$') # define the regular expression # edit the regexp if there are other patterns in the data file # - I wrote it so that it works for the two examples you gave us print "Patterns found:\n" for item in oneVar: # cycle through the rows desc = item[0].strip() a = myPattern.search(desc).groups() # find the numbers print desc+"\t"+a[0]+"\t"+a[1] # print results in the tab-separated format # (you can paste them into Excel sheet then) print "\nHave a nice day, Mike" end program. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Michael Kreißig Sent: Thursday, November 02, 2006 9:23 AM To: [hidden email] Subject: Text Field Hi all, I have the job the evaluate skin cancer data which I received from an external clinic. Those nice colleagues created a text field with the following contents (or similar) Breslow: 1.5, Level 4 TD 2.05, Inv.-level 3 and so on. Is there any possibility to extract the figures for tumor thickness (1.5 or 2.05) and invasion level (4 or 3) and put it in numerical fields. To my knowledge, it's not possible but maybe, some of you habe an idea. There are thousands of cases, so I can't do it manually. Mike |
Jan has nicely illustrated the power and convenience of regular expressions as provided in Python programmability. The version below shows how you can use SPSS 15 and the Bonus Pack modules to create the new values directly in SPSS.
Regular expression search is provided by the search class in the extendedTransforms module. This example creates the thickness variable based on a regular expression that means one or more digits (\d+) followed optionally immediately by a decimal point and more digits. If no digits are found, the result is SYSMIS. The tproc.execute() line takes care of reading the data case by case and writing the search result back to the existing dataset. * prepare the data. data list fixed /diagn (a25). begin data. Breslow: 1.5, Level 4 TD 2.05, Inv.-level 3 end data. begin program. import spss, trans, extendedTransforms tproc = trans.Tfunction() tproc.append(extendedTransforms.search, "thickness", "f", ["diagn", trans.const(r"\d+\.*\d*")]) tproc.execute() end program. -Jon Peck -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Spousta Jan Sent: Thursday, November 02, 2006 3:50 AM To: [hidden email] Subject: Re: [SPSSX-L] Text Field Hi Mike, Try the program bellow. BTW I am just reading a book from your emerited colleague from Uni Tübingen, the theologian Hans Küng :-) Greetings Jan * prepare the data. data list fixed /diagn (a25). begin data. Breslow: 1.5, Level 4 TD 2.05, Inv.-level 3 end data. exe. * run a Python program which finds the numbers using regular expressions. begin program python. import spss import re i=([0]) # position of the variable in the file dataCursor=spss.Cursor(i) # get the data from the active SPSS file oneVar=dataCursor.fetchall() dataCursor.close() myPattern = re.compile(r'^\D*(\d+\.?\d*)\D+(\d)\D*$') # define the regular expression # edit the regexp if there are other patterns in the data file # - I wrote it so that it works for the two examples you gave us print "Patterns found:\n" for item in oneVar: # cycle through the rows desc = item[0].strip() a = myPattern.search(desc).groups() # find the numbers print desc+"\t"+a[0]+"\t"+a[1] # print results in the tab-separated format # (you can paste them into Excel sheet then) print "\nHave a nice day, Mike" end program. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Michael Kreißig Sent: Thursday, November 02, 2006 9:23 AM To: [hidden email] Subject: Text Field Hi all, I have the job the evaluate skin cancer data which I received from an external clinic. Those nice colleagues created a text field with the following contents (or similar) Breslow: 1.5, Level 4 TD 2.05, Inv.-level 3 and so on. Is there any possibility to extract the figures for tumor thickness (1.5 or 2.05) and invasion level (4 or 3) and put it in numerical fields. To my knowledge, it's not possible but maybe, some of you habe an idea. There are thousands of cases, so I can't do it manually. Mike |
Free forum by Nabble | Edit this page |