I'm trying to parse a string variable with thousands of cases into new
variables. For example, a variable named "demotext" has the following string: |NodeID:10115|TenureID:4|GenderID:2 I'd like to be able to create separate variables titled NodeID, TenureID, GenderID with their respective values...10115, 4, 2. As mentioned, I'll need to be able to parse out all of the numbers into separate variables...and will need to do this for many cases...and there is a good potential that all of the values won't always be the same length as the given example. Thoughts? Thanks, dk |
At 08:47 AM 3/28/2007, Doug Keyser wrote:
>I'm trying to parse a string variable with thousands of cases into new >variables. For example, a variable named "demotext" has the following >string: > >|NodeID:10115|TenureID:4|GenderID:2 > >I'd like to be able to create separate variables titled NodeID, >TenureID, GenderID with their respective values...10115, 4, 2. > >Thoughts? Well (smile), another 'transformation program vs. Python'. Python's going to have an edge here, because it can read your string of keywords and generate the code to declare the variables. HOWEVER, here's a transformation-program solution. I'm writing assuming . The only variables you're creating are NodeID, TenureID, and GenderID; . The input string values is in variable StringV, which is no more than 100 characters long; . The string's internal structure is indefinite repetition of "|<keyword>:<value>". (Code doesn't check for syntax errors, which is a bad deficiency in a parser; among other things, it can let parser bugs get by.) . The combination of a keyword and its value is never more than 25 characters long; . There are no more than 10 keyword-value pairs . All values are numeric, and F5 is an OK format for all the variables; . A keyword can't occur more than once in the string. (If it does, the value from the latest occurrence will be used, with no warning.) These can be changed or adjusted, of course. Code is untested; and, I'm afraid, for even a simple parser like this, that means there'll be an error somewhere. NUMERIC NodeID TenureID GenderID (F5). STRING BadKey (A10). VAR LABEL BadKey '(Last) unrecognized keyword found in string'. STRING #Parsing (A100) /#Assign (A25) /#KeyStr #ValStr (A12). NUMERIC #Value (F5). COMPUTE #Parsing = LTRIM(StringV). COMPUTE #Parsing = LTRIM(#Parsing,'|'). LOOP #AsgnNum = 1 TO 10 IF #Parsing NE ' '. . COMPUTE #Index = INDEX(#Parsing,'|'). . DO IF #Index GT 0. . COMPUTE #Assign = SUBSTR(#Parsing,1,#Index-1). . COMPUTE #Parsing = SUBSTR(#Parsing,#Index). . ELSE. . COMPUTE #Assign = #Parsing. . COMPUTE #Parsing = ''. . END IF. . COMPUTE #Parsing = LTRIM(#Parsing,'|'). . COMPUTE #Assign = LTRIM(#Assign). . COMPUTE #Index = INDEX (#Assign,':'). . COMPUTE #KeyStr = SUBSTR(#Assign,1,#Index). . COMPUTE #ValStr = SUBSTR(#Assign,#Index+1). . COMPUTE #Value = NUMBER(#ValStr,F12). . COMPUTE #Matched = 0. . DO REPEAT KeyWord = 'NodeID' 'TenureID' 'GenderID' /TgtVbl = NodeID TenureID GenderID. . DO IF KeyWord = #KeyStr. . COMPUTE TgtVbl = #Value. . COMPUTE #Matched = 1. . END IF. . END REPEAT. . IF #Matched NE 1 BadKey = #KeyStr. END LOOP. |
In reply to this post by Doug Keyser
Update: test data, and tested code.
At 08:47 AM 3/28/2007, Doug Keyser wrote: >I'm trying to parse a string variable with thousands of cases into new >variables. For example, a variable named "demotext" has the following >string: > >|NodeID:10115|TenureID:4|GenderID:2 > >I'd like to be able to create separate variables titled NodeID, >TenureID, GenderID with their respective values...10115, 4, 2. This is tested; it fixes one simple bug that prevented functioning, and adds a couple of features. It's SPSS 15 draft output, but all language features used should work back at least through release 9. In this version, intermediate variables are regular variables whose names begin with '@'. Change those to '#' to make them scratch variables, to leave the output file uncluttered. |-----------------------------|---------------------------| |Output Created |29-MAR-2007 18:59:45 | |-----------------------------|---------------------------| [Keyser] C:\Documents and Settings\Richard\My Documents \Eudora mail\Attachments\parsing1.sav demoText |NodeID:10132|RecDate:1/18/2007|TenureID:|GenderID: |NodeID:10115|RecDate:1/18/2007|TenureID:4|GenderID:2 |NodeID:10134|RecDate:1/18/2007|TenureID:2|GenderID:2 |NodeID:10134|RecDate:1/18/2007|TenureID:1|GenderID:2 |NodeID:10134|RecDate:1/18/2007|TenureID:4|GenderID:2 |NodeID:10133|RecDate:1/18/2007|TenureID:1|GenderID:1 |NodeID:10134|RecDate:1/18/2007|TenureID:3|GenderID:2 |NodeID:10133|RecDate:1/18/2007|TenureID:7|GenderID:2 |NodeID:10115|RecDate:1/18/2007|TenureID:1|GenderID:2 |NodeID:10134|RecDate:1/18/2007|TenureID:7|GenderID:3 Number of cases read: 10 Number of cases listed: 10 NUMERIC NodeID RecDate TenureID genderid (F5). FORMATS RecDate (DATE11). STRING BadKey (A10). VAR LABEL BadKey '(Last) unrecognized keyword found in string'. * Working variables ***. STRING @Parsing /* Remaining unparsed part on input */ (A255) /@Assign /* Assigment pair (<keyword>:<value>) */ (A25) /@KeyStr /* Keyword part of <keyword>:<value> pair */ (A12) /@ValStr /* Value part of <keyword>:<value> pair */ (A12). NUMERIC @Value /* Value part, converted to numeric */ (F5) /@Index /* Result of "INDEX" search in a string */ (F5) /@AsgnNum /* Counter, through 'assignment' pairs */ (F5). COMPUTE @Parsing = LTRIM(demotext). COMPUTE @Parsing = LTRIM(@Parsing,'|'). LOOP @AsgnNum = 1 TO 10 IF @Parsing NE ' '. . COMPUTE @Index = INDEX(@Parsing,'|'). . DO IF @Index GT 0. . COMPUTE @Assign = SUBSTR(@Parsing,1,@Index-1). . COMPUTE @Parsing = SUBSTR(@Parsing,@Index). . ELSE. . COMPUTE @Assign = @Parsing. . COMPUTE @Parsing = ''. . END IF. . COMPUTE @Parsing = LTRIM(@Parsing,'|'). . COMPUTE @Assign = LTRIM(@Assign). . COMPUTE @Index = INDEX (@Assign,':'). . COMPUTE @KeyStr = SUBSTR(@Assign,1,@Index-1). . COMPUTE @ValStr = SUBSTR(@Assign,@Index+1). . COMPUTE Matched = 0. . DO REPEAT KeyWord = 'NodeID' 'RecDate' 'TenureID' 'genderid' /TgtVbl = NodeID RecDate TenureID genderid. . DO IF UPCASE(@KeyStr) = UPCASE(KeyWord). . DO IF UPCASE(@KeyStr) = 'RECDATE'. * Special case: value is a date, not an integer *** . . COMPUTE @Value = NUMBER(@ValStr,ADATE12). ELSE. * Integer values: *** . . COMPUTE @Value = NUMBER(@ValStr,F12). . END IF. . COMPUTE TgtVbl = @Value. . COMPUTE Matched = 1. . END IF. . END REPEAT. . IF Matched NE 1 BadKey = @KeyStr. END LOOP. LIST demoText TO BadKey. List |-----------------------------|---------------------------| |Output Created |29-MAR-2007 18:59:47 | |-----------------------------|---------------------------| [Keyser] C:\Documents and Settings\Richard\My Documents \Eudora mail\Attachments\parsing1.sav The variables are listed in the following order: LINE 1: demoText LINE 2: NodeID RecDate TenureID genderid BadKey demoText: |NodeID:10132|RecDate:1/18/2007|TenureID:|GenderID: NodeID: 10132 18-JAN-2007 . . demoText: |NodeID:10115|RecDate:1/18/2007|TenureID:4|GenderID:2 NodeID: 10115 18-JAN-2007 4 2 demoText: |NodeID:10134|RecDate:1/18/2007|TenureID:2|GenderID:2 NodeID: 10134 18-JAN-2007 2 2 demoText: |NodeID:10134|RecDate:1/18/2007|TenureID:1|GenderID:2 NodeID: 10134 18-JAN-2007 1 2 demoText: |NodeID:10134|RecDate:1/18/2007|TenureID:4|GenderID:2 NodeID: 10134 18-JAN-2007 4 2 demoText: |NodeID:10133|RecDate:1/18/2007|TenureID:1|GenderID:1 NodeID: 10133 18-JAN-2007 1 1 demoText: |NodeID:10134|RecDate:1/18/2007|TenureID:3|GenderID:2 NodeID: 10134 18-JAN-2007 3 2 demoText: |NodeID:10133|RecDate:1/18/2007|TenureID:7|GenderID:2 NodeID: 10133 18-JAN-2007 7 2 demoText: |NodeID:10115|RecDate:1/18/2007|TenureID:1|GenderID:2 NodeID: 10115 18-JAN-2007 1 2 demoText: |NodeID:10134|RecDate:1/18/2007|TenureID:7|GenderID:3 NodeID: 10134 18-JAN-2007 7 3 Number of cases read: 10 Number of cases listed: 10 |
Free forum by Nabble | Edit this page |