parsing strings

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

parsing strings

Doug Keyser
I'm trying to parse a string variable with thousands of cases into new
variables.  For example, a variable named "demotext" has the following
string:

|NodeID:10115|TenureID:4|GenderID:2

I'd like to be able to create separate variables titled NodeID, TenureID,
GenderID with their respective values...10115, 4, 2.

As mentioned, I'll need to be able to parse out all of the numbers into
separate variables...and will need to do this for many cases...and there
is a good potential that all of the values won't always be the same length
as the given example.

Thoughts?

Thanks,

dk
Reply | Threaded
Open this post in threaded view
|

Re: parsing strings

Richard Ristow
At 08:47 AM 3/28/2007, Doug Keyser wrote:

>I'm trying to parse a string variable with thousands of cases into new
>variables.  For example, a variable named "demotext" has the following
>string:
>
>|NodeID:10115|TenureID:4|GenderID:2
>
>I'd like to be able to create separate variables titled NodeID,
>TenureID, GenderID with their respective values...10115, 4, 2.
>
>Thoughts?

Well (smile), another 'transformation program vs. Python'. Python's
going to have an edge here, because it can read your string of keywords
and generate the code to declare the variables. HOWEVER, here's a
transformation-program solution.

I'm writing assuming

. The only variables you're creating are NodeID, TenureID, and
GenderID;

. The input string values is in variable StringV, which is no more than
100 characters long;

. The string's internal structure is indefinite repetition of
"|<keyword>:<value>". (Code doesn't check for syntax errors, which is a
bad deficiency in a parser; among other things, it can let parser bugs
get by.)

. The combination of a keyword and its value is never more than 25
characters long;

. There are no more than 10 keyword-value pairs

. All values are numeric, and F5 is an OK format for all the variables;

. A keyword can't occur more than once in the string. (If it does, the
value from the latest occurrence will be used, with no warning.)

These can be changed or adjusted, of course.

Code is untested; and, I'm afraid, for even a simple parser like this,
that means there'll be an error somewhere.

NUMERIC   NodeID TenureID GenderID (F5).
STRING    BadKey (A10).
VAR LABEL BadKey
           '(Last) unrecognized keyword found in string'.

STRING  #Parsing         (A100)
        /#Assign          (A25)
        /#KeyStr  #ValStr (A12).
NUMERIC #Value           (F5).

COMPUTE #Parsing = LTRIM(StringV).
COMPUTE #Parsing = LTRIM(#Parsing,'|').

LOOP #AsgnNum = 1 TO 10
      IF #Parsing NE ' '.
.  COMPUTE #Index   = INDEX(#Parsing,'|').
.  DO IF   #Index GT 0.
.     COMPUTE #Assign  = SUBSTR(#Parsing,1,#Index-1).
.     COMPUTE #Parsing = SUBSTR(#Parsing,#Index).
.  ELSE.
.     COMPUTE #Assign  = #Parsing.
.     COMPUTE #Parsing = ''.
.  END IF.
.  COMPUTE #Parsing = LTRIM(#Parsing,'|').
.  COMPUTE #Assign  = LTRIM(#Assign).

.  COMPUTE #Index   = INDEX (#Assign,':').
.  COMPUTE #KeyStr  = SUBSTR(#Assign,1,#Index).
.  COMPUTE #ValStr  = SUBSTR(#Assign,#Index+1).
.  COMPUTE #Value   = NUMBER(#ValStr,F12).

.  COMPUTE #Matched = 0.
.  DO REPEAT      KeyWord = 'NodeID' 'TenureID' 'GenderID'
                  /TgtVbl  =  NodeID   TenureID   GenderID.
.     DO IF       KeyWord = #KeyStr.
.        COMPUTE  TgtVbl  = #Value.
.        COMPUTE #Matched = 1.
.     END IF.
.  END REPEAT.

.  IF #Matched NE 1 BadKey = #KeyStr.
END LOOP.
Reply | Threaded
Open this post in threaded view
|

Re: parsing strings

Richard Ristow
In reply to this post by Doug Keyser
Update: test data, and tested code.

At 08:47 AM 3/28/2007, Doug Keyser wrote:

>I'm trying to parse a string variable with thousands of cases into new
>variables.  For example, a variable named "demotext" has the following
>string:
>
>|NodeID:10115|TenureID:4|GenderID:2
>
>I'd like to be able to create separate variables titled NodeID,
>TenureID, GenderID with their respective values...10115, 4, 2.

This is tested; it fixes one simple bug that prevented functioning, and
adds a couple of features. It's SPSS 15 draft output, but all language
features used should work back at least through release 9.

In this version, intermediate variables are regular variables whose
names begin with '@'. Change those to '#' to make them scratch
variables, to leave the output file uncluttered.

|-----------------------------|---------------------------|
|Output Created               |29-MAR-2007 18:59:45       |
|-----------------------------|---------------------------|
[Keyser] C:\Documents and Settings\Richard\My Documents
            \Eudora mail\Attachments\parsing1.sav

demoText

|NodeID:10132|RecDate:1/18/2007|TenureID:|GenderID:
|NodeID:10115|RecDate:1/18/2007|TenureID:4|GenderID:2
|NodeID:10134|RecDate:1/18/2007|TenureID:2|GenderID:2
|NodeID:10134|RecDate:1/18/2007|TenureID:1|GenderID:2
|NodeID:10134|RecDate:1/18/2007|TenureID:4|GenderID:2
|NodeID:10133|RecDate:1/18/2007|TenureID:1|GenderID:1
|NodeID:10134|RecDate:1/18/2007|TenureID:3|GenderID:2
|NodeID:10133|RecDate:1/18/2007|TenureID:7|GenderID:2
|NodeID:10115|RecDate:1/18/2007|TenureID:1|GenderID:2
|NodeID:10134|RecDate:1/18/2007|TenureID:7|GenderID:3


Number of cases read:  10    Number of cases listed:  10


NUMERIC   NodeID RecDate TenureID genderid (F5).
FORMATS          RecDate                   (DATE11).
STRING    BadKey (A10).
VAR LABEL BadKey
           '(Last) unrecognized keyword found in string'.


*  Working variables                                             ***.
STRING  @Parsing  /* Remaining unparsed part on input       */ (A255)
        /@Assign   /* Assigment pair (<keyword>:<value>)     */  (A25)
        /@KeyStr   /* Keyword part of <keyword>:<value> pair */  (A12)
        /@ValStr   /* Value   part of <keyword>:<value> pair */  (A12).
NUMERIC @Value    /* Value   part, converted to numeric     */   (F5)
        /@Index    /* Result of "INDEX" search in a string   */   (F5)
        /@AsgnNum  /* Counter, through 'assignment' pairs    */   (F5).

COMPUTE @Parsing = LTRIM(demotext).
COMPUTE @Parsing = LTRIM(@Parsing,'|').

LOOP @AsgnNum = 1 TO 10
       IF @Parsing NE ' '.
.  COMPUTE @Index   = INDEX(@Parsing,'|').
.  DO IF   @Index GT 0.
.     COMPUTE @Assign  = SUBSTR(@Parsing,1,@Index-1).
.     COMPUTE @Parsing = SUBSTR(@Parsing,@Index).
.  ELSE.
.     COMPUTE @Assign  = @Parsing.
.     COMPUTE @Parsing = ''.
.  END IF.

.  COMPUTE @Parsing = LTRIM(@Parsing,'|').
.  COMPUTE @Assign  = LTRIM(@Assign).

.  COMPUTE @Index   = INDEX (@Assign,':').
.  COMPUTE @KeyStr  = SUBSTR(@Assign,1,@Index-1).
.  COMPUTE @ValStr  = SUBSTR(@Assign,@Index+1).

.  COMPUTE Matched = 0.
.  DO REPEAT      KeyWord = 'NodeID' 'RecDate' 'TenureID' 'genderid'
                  /TgtVbl  =  NodeID   RecDate   TenureID   genderid.
.     DO IF       UPCASE(@KeyStr) =  UPCASE(KeyWord).
.        DO IF    UPCASE(@KeyStr) = 'RECDATE'.
*           Special case: value is a date, not an integer *** .
.           COMPUTE  @Value   = NUMBER(@ValStr,ADATE12).
          ELSE.
*           Integer values:                               *** .
.           COMPUTE  @Value   = NUMBER(@ValStr,F12).
.        END IF.
.        COMPUTE  TgtVbl  = @Value.
.        COMPUTE  Matched = 1.
.     END IF.
.  END REPEAT.

.  IF Matched NE 1 BadKey = @KeyStr.
END LOOP.

LIST demoText TO BadKey.

List
|-----------------------------|---------------------------|
|Output Created               |29-MAR-2007 18:59:47       |
|-----------------------------|---------------------------|
[Keyser] C:\Documents and Settings\Richard\My Documents
            \Eudora mail\Attachments\parsing1.sav

The variables are listed in the following order:

LINE   1: demoText
LINE   2: NodeID RecDate TenureID genderid BadKey


     demoText: |NodeID:10132|RecDate:1/18/2007|TenureID:|GenderID:
       NodeID: 10132 18-JAN-2007     .     .

     demoText: |NodeID:10115|RecDate:1/18/2007|TenureID:4|GenderID:2
       NodeID: 10115 18-JAN-2007     4     2

     demoText: |NodeID:10134|RecDate:1/18/2007|TenureID:2|GenderID:2
       NodeID: 10134 18-JAN-2007     2     2

     demoText: |NodeID:10134|RecDate:1/18/2007|TenureID:1|GenderID:2
       NodeID: 10134 18-JAN-2007     1     2

     demoText: |NodeID:10134|RecDate:1/18/2007|TenureID:4|GenderID:2
       NodeID: 10134 18-JAN-2007     4     2

     demoText: |NodeID:10133|RecDate:1/18/2007|TenureID:1|GenderID:1
       NodeID: 10133 18-JAN-2007     1     1

     demoText: |NodeID:10134|RecDate:1/18/2007|TenureID:3|GenderID:2
       NodeID: 10134 18-JAN-2007     3     2

     demoText: |NodeID:10133|RecDate:1/18/2007|TenureID:7|GenderID:2
       NodeID: 10133 18-JAN-2007     7     2

     demoText: |NodeID:10115|RecDate:1/18/2007|TenureID:1|GenderID:2
       NodeID: 10115 18-JAN-2007     1     2

     demoText: |NodeID:10134|RecDate:1/18/2007|TenureID:7|GenderID:3
       NodeID: 10134 18-JAN-2007     7     3

Number of cases read:  10    Number of cases listed:  10