Parsing a delimited string with variable length

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Parsing a delimited string with variable length

William Dudley
I have a data set with the results of data for a "check all that apply" survey.  This is a string variable.  The unique responses are separated by a | sign like in these cases

1|4|17|8
2
17/1
-99

Note that the values do not have to be in order and that I have both single and double digit numbers.

The values in the string may indicate a symptom with 1 = Nausea, 2 = Fatigue, etc.

I want to arrive at a series of binary variables with 0 = No and 1 = Yes, such as Nausea, Fatigue, etc. 

If I can figure out how to extract the strings between |'s then I can figure out the rest (the creation of binary variables).

I have tried the CHAR.Index command but this will require a series of Do If's and it is complicated by the variable length of the string.

I would like to stay within SPSS syntax for my clients' benefits.I can export just the id and this variable as a cvs file and open in SPSS and declare | as a delimiter, and then match files  but Id rather avoid these extra  steps.

​Any thoughts?
Thanks in advance, Bill

--
William N Dudley, PhD
President
Piedmont Research Strategies, Inc
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Parsing a delimited string with variable length

Jon Peck
Try this code.

data list list/x(a10).
begin data
1|4|17|8
2
17|1
-99
end data.
dataset name data.

vector y(10, a3).
loop #i = 1 to 10.
compute #bar = char.index(x, "|").
do if #bar > 0.
compute y(#i) = substr(x, 1, #bar-1).
compute x = substr(x, #bar+1).
else.
compute y(#i) = x.
break.
end if.
end loop.
exec.

(In Python, it would be one command :-))




On Sun, Jul 24, 2016 at 10:09 AM, William Dudley <[hidden email]> wrote:
I have a data set with the results of data for a "check all that apply" survey.  This is a string variable.  The unique responses are separated by a | sign like in these cases

1|4|17|8
2
17/1
-99

Note that the values do not have to be in order and that I have both single and double digit numbers.

The values in the string may indicate a symptom with 1 = Nausea, 2 = Fatigue, etc.

I want to arrive at a series of binary variables with 0 = No and 1 = Yes, such as Nausea, Fatigue, etc. 

If I can figure out how to extract the strings between |'s then I can figure out the rest (the creation of binary variables).

I have tried the CHAR.Index command but this will require a series of Do If's and it is complicated by the variable length of the string.

I would like to stay within SPSS syntax for my clients' benefits.I can export just the id and this variable as a cvs file and open in SPSS and declare | as a delimiter, and then match files  but Id rather avoid these extra  steps.

​Any thoughts?
Thanks in advance, Bill

--
William N Dudley, PhD
President
Piedmont Research Strategies, Inc
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Parsing a delimited string with variable length

Jon Peck
p.s.  Once the items are split, you can define a MC set and analyze directly with CTABLES, e.g.,
MRSETS
  /MCGROUP NAME=$y VARIABLES=y1 y10 y2 y3 y4 y5 y6 y7 y8 y9
  /DISPLAY NAME=[$y].
MISSING VALUES y1 TO y10("").
CTABLES
  /TABLE $y.


On Sun, Jul 24, 2016 at 11:28 AM, Jon Peck <[hidden email]> wrote:
Try this code.

data list list/x(a10).
begin data
1|4|17|8
2
17|1
-99
end data.
dataset name data.

vector y(10, a3).
loop #i = 1 to 10.
compute #bar = char.index(x, "|").
do if #bar > 0.
compute y(#i) = substr(x, 1, #bar-1).
compute x = substr(x, #bar+1).
else.
compute y(#i) = x.
break.
end if.
end loop.
exec.

(In Python, it would be one command :-))




On Sun, Jul 24, 2016 at 10:09 AM, William Dudley <[hidden email]> wrote:
I have a data set with the results of data for a "check all that apply" survey.  This is a string variable.  The unique responses are separated by a | sign like in these cases

1|4|17|8
2
17/1
-99

Note that the values do not have to be in order and that I have both single and double digit numbers.

The values in the string may indicate a symptom with 1 = Nausea, 2 = Fatigue, etc.

I want to arrive at a series of binary variables with 0 = No and 1 = Yes, such as Nausea, Fatigue, etc. 

If I can figure out how to extract the strings between |'s then I can figure out the rest (the creation of binary variables).

I have tried the CHAR.Index command but this will require a series of Do If's and it is complicated by the variable length of the string.

I would like to stay within SPSS syntax for my clients' benefits.I can export just the id and this variable as a cvs file and open in SPSS and declare | as a delimiter, and then match files  but Id rather avoid these extra  steps.

​Any thoughts?
Thanks in advance, Bill

--
William N Dudley, PhD
President
Piedmont Research Strategies, Inc
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD



--
Jon K Peck
[hidden email]




--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Parsing a delimited string with variable length

William Dudley
Thanks Jon
I will give it a try!

Just curious
What would python command be?



On Sun, Jul 24, 2016 at 1:49 PM, Jon Peck <[hidden email]> wrote:
p.s.  Once the items are split, you can define a MC set and analyze directly with CTABLES, e.g.,
MRSETS
  /MCGROUP NAME=$y VARIABLES=y1 y10 y2 y3 y4 y5 y6 y7 y8 y9
  /DISPLAY NAME=[$y].
MISSING VALUES y1 TO y10("").
CTABLES
  /TABLE $y.


On Sun, Jul 24, 2016 at 11:28 AM, Jon Peck <[hidden email]> wrote:
Try this code.

data list list/x(a10).
begin data
1|4|17|8
2
17|1
-99
end data.
dataset name data.

vector y(10, a3).
loop #i = 1 to 10.
compute #bar = char.index(x, "|").
do if #bar > 0.
compute y(#i) = substr(x, 1, #bar-1).
compute x = substr(x, #bar+1).
else.
compute y(#i) = x.
break.
end if.
end loop.
exec.

(In Python, it would be one command :-))




On Sun, Jul 24, 2016 at 10:09 AM, William Dudley <[hidden email]> wrote:
I have a data set with the results of data for a "check all that apply" survey.  This is a string variable.  The unique responses are separated by a | sign like in these cases

1|4|17|8
2
17/1
-99

Note that the values do not have to be in order and that I have both single and double digit numbers.

The values in the string may indicate a symptom with 1 = Nausea, 2 = Fatigue, etc.

I want to arrive at a series of binary variables with 0 = No and 1 = Yes, such as Nausea, Fatigue, etc. 

If I can figure out how to extract the strings between |'s then I can figure out the rest (the creation of binary variables).

I have tried the CHAR.Index command but this will require a series of Do If's and it is complicated by the variable length of the string.

I would like to stay within SPSS syntax for my clients' benefits.I can export just the id and this variable as a cvs file and open in SPSS and declare | as a delimiter, and then match files  but Id rather avoid these extra  steps.

​Any thoughts?
Thanks in advance, Bill

--
William N Dudley, PhD
President
Piedmont Research Strategies, Inc
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD



--
Jon K Peck
[hidden email]




--
Jon K Peck
[hidden email]




--
William N Dudley, PhD
President
Piedmont Research Strategies, Inc
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Parsing a delimited string with variable length

Jon Peck
spssinc trans result = y1 to y10 type=3 /formula 're.split(r"\|", x)'.

This, as in the native code I posted earlier, assumes up to 10 values and a width of each code <=3.  Change as needed.

The spssinc trans extension command is installed as part of the Python Essentials.
type=3 declares the result variables as A3.
The formula uses the regular expression r"\|" to split on the vertical bar via the re.split function.  The preceding r says not to interpret \ as a Python escape character, which isn't actually needed here, but since re's often have \ in them, it's good practice to always use it.
Inside the re, \| means to take the vertical bar literally rather than in its grammatical meaning in an re.  In an re, | means "or".

On Sun, Jul 24, 2016 at 12:00 PM, William Dudley <[hidden email]> wrote:
Thanks Jon
I will give it a try!

Just curious
What would python command be?



On Sun, Jul 24, 2016 at 1:49 PM, Jon Peck <[hidden email]> wrote:
p.s.  Once the items are split, you can define a MC set and analyze directly with CTABLES, e.g.,
MRSETS
  /MCGROUP NAME=$y VARIABLES=y1 y10 y2 y3 y4 y5 y6 y7 y8 y9
  /DISPLAY NAME=[$y].
MISSING VALUES y1 TO y10("").
CTABLES
  /TABLE $y.


On Sun, Jul 24, 2016 at 11:28 AM, Jon Peck <[hidden email]> wrote:
Try this code.

data list list/x(a10).
begin data
1|4|17|8
2
17|1
-99
end data.
dataset name data.

vector y(10, a3).
loop #i = 1 to 10.
compute #bar = char.index(x, "|").
do if #bar > 0.
compute y(#i) = substr(x, 1, #bar-1).
compute x = substr(x, #bar+1).
else.
compute y(#i) = x.
break.
end if.
end loop.
exec.

(In Python, it would be one command :-))




On Sun, Jul 24, 2016 at 10:09 AM, William Dudley <[hidden email]> wrote:
I have a data set with the results of data for a "check all that apply" survey.  This is a string variable.  The unique responses are separated by a | sign like in these cases

1|4|17|8
2
17/1
-99

Note that the values do not have to be in order and that I have both single and double digit numbers.

The values in the string may indicate a symptom with 1 = Nausea, 2 = Fatigue, etc.

I want to arrive at a series of binary variables with 0 = No and 1 = Yes, such as Nausea, Fatigue, etc. 

If I can figure out how to extract the strings between |'s then I can figure out the rest (the creation of binary variables).

I have tried the CHAR.Index command but this will require a series of Do If's and it is complicated by the variable length of the string.

I would like to stay within SPSS syntax for my clients' benefits.I can export just the id and this variable as a cvs file and open in SPSS and declare | as a delimiter, and then match files  but Id rather avoid these extra  steps.

​Any thoughts?
Thanks in advance, Bill

--
William N Dudley, PhD
President
Piedmont Research Strategies, Inc
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD



--
Jon K Peck
[hidden email]




--
Jon K Peck
[hidden email]




--
William N Dudley, PhD
President
Piedmont Research Strategies, Inc



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Parsing a delimited string with variable length

Rich Ulrich

Jon,
Is there a reason not to return the values as Numeric?  (Don't you get that from "type= 0"?)

I can do a lot more with Numbers than with Strings.  Right now, he is still left with translation
before he can do handy assignment if he wants the originally-mentioned string of 0/1 indicator variables:
Using the Y1 to y10 defined with numbers, here is (untested) code to assign values, along with error protection.

* assuming 25 symptions.
Vector z(25).
RECODE z1 to z25, NSymp(else=0).

LOOP  #= 1 to 10.
  DO IF (y(#) eq -99).
* leave all the values 0, and count as zero.
BREAK.
* quit if end of actual values.
  ELSE IF (y(#) eq 0).
BREAK
  ELSE IF y(#) le 25).
* just skip if  GT 25 and thus out of range?.
* increment from 0 to 1; or even from 1 to 2 if duplicated scoring.
  COMPUTE z(y(#)) = z(y(#)) + 1.
* the count is always useful.
COMPUTE NSymp= NSymp+1.
END LOOP.

--
Rich Ulrich



Date: Sun, 24 Jul 2016 12:11:18 -0600
From: [hidden email]
Subject: Re: Parsing a delimited string with variable length
To: [hidden email]

spssinc trans result = y1 to y10 type=3 /formula 're.split(r"\|", x)'.

This, as in the native code I posted earlier, assumes up to 10 values and a width of each code <=3.  Change as needed.

The spssinc trans extension command is installed as part of the Python Essentials.
type=3 declares the result variables as A3.
The formula uses the regular expression r"\|" to split on the vertical bar via the re.split function.  The preceding r says not to interpret \ as a Python escape character, which isn't actually needed here, but since re's often have \ in them, it's good practice to always use it.
Inside the re, \| means to take the vertical bar literally rather than in its grammatical meaning in an re.  In an re, | means "or".

On Sun, Jul 24, 2016 at 12:00 PM, William Dudley <[hidden email]> wrote:
Thanks Jon
I will give it a try!

Just curious
What would python command be?



On Sun, Jul 24, 2016 at 1:49 PM, Jon Peck <[hidden email]> wrote:
p.s.  Once the items are split, you can define a MC set and analyze directly with CTABLES, e.g.,
MRSETS
  /MCGROUP NAME=$y VARIABLES=y1 y10 y2 y3 y4 y5 y6 y7 y8 y9
  /DISPLAY NAME=[$y].
MISSING VALUES y1 TO y10("").
CTABLES
  /TABLE $y.

...deleted
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Parsing a delimited string with variable length

Jon Peck
I left the output variables as strings as I didn't know whether the values were always numeric, but setting the type parameter to 0 would work fine and create numeric variables.  For tabulation purposes, though, since the values are categorical, either numeric or string would work pretty much the same.  If the magnitudes have semantics, then numeric might be better.

On Mon, Jul 25, 2016 at 1:04 AM, Rich Ulrich <[hidden email]> wrote:

Jon,
Is there a reason not to return the values as Numeric?  (Don't you get that from "type= 0"?)

I can do a lot more with Numbers than with Strings.  Right now, he is still left with translation
before he can do handy assignment if he wants the originally-mentioned string of 0/1 indicator variables:
Using the Y1 to y10 defined with numbers, here is (untested) code to assign values, along with error protection.

* assuming 25 symptions.
Vector z(25).
RECODE z1 to z25, NSymp(else=0).

LOOP  #= 1 to 10.
  DO IF (y(#) eq -99).
* leave all the values 0, and count as zero.
BREAK.
* quit if end of actual values.
  ELSE IF (y(#) eq 0).
BREAK
  ELSE IF y(#) le 25).
* just skip if  GT 25 and thus out of range?.
* increment from 0 to 1; or even from 1 to 2 if duplicated scoring.
  COMPUTE z(y(#)) = z(y(#)) + 1.
* the count is always useful.
COMPUTE NSymp= NSymp+1.
END LOOP.

--
Rich Ulrich



Date: Sun, 24 Jul 2016 12:11:18 -0600
From: [hidden email]
Subject: Re: Parsing a delimited string with variable length
To: [hidden email]

spssinc trans result = y1 to y10 type=3 /formula 're.split(r"\|", x)'.

This, as in the native code I posted earlier, assumes up to 10 values and a width of each code <=3.  Change as needed.

The spssinc trans extension command is installed as part of the Python Essentials.
type=3 declares the result variables as A3.
The formula uses the regular expression r"\|" to split on the vertical bar via the re.split function.  The preceding r says not to interpret \ as a Python escape character, which isn't actually needed here, but since re's often have \ in them, it's good practice to always use it.
Inside the re, \| means to take the vertical bar literally rather than in its grammatical meaning in an re.  In an re, | means "or".

On Sun, Jul 24, 2016 at 12:00 PM, William Dudley <[hidden email]> wrote:
Thanks Jon
I will give it a try!

Just curious
What would python command be?



On Sun, Jul 24, 2016 at 1:49 PM, Jon Peck <[hidden email]> wrote:
p.s.  Once the items are split, you can define a MC set and analyze directly with CTABLES, e.g.,
MRSETS
  /MCGROUP NAME=$y VARIABLES=y1 y10 y2 y3 y4 y5 y6 y7 y8 y9
  /DISPLAY NAME=[$y].
MISSING VALUES y1 TO y10("").
CTABLES
  /TABLE $y.

...deleted



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Parsing a delimited string with variable length

David Marso
Administrator
In reply to this post by William Dudley
http://spssx-discussion.1045642.n5.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1068821&query=Parse

William Dudley wrote
I have a data set with the results of data for a "check all that apply"
survey.  This is a string variable.  The unique responses are separated by
a | sign like in these cases

1|4|17|8
2
17/1
-99

Note that the values do not have to be in order and that I have both single
and double digit numbers.

The values in the string may indicate a symptom with 1 = Nausea, 2 =
Fatigue, etc.

I want to arrive at a series of binary variables with 0 = No and 1 = Yes,
such as Nausea, Fatigue, etc.

If I can figure out how to extract the strings between |'s then I can
figure out the rest (the creation of binary variables).

I have tried the CHAR.Index command but this will require a series of Do
If's and it is complicated by the variable length of the string.

I would like to stay within SPSS syntax for my clients' benefits.I can
export just the id and this variable as a cvs file and open in SPSS and
declare | as a delimiter, and then match files  but Id rather avoid these
extra  steps.

​Any thoughts?
Thanks in advance, Bill


--
William N Dudley, PhD
President
Piedmont Research Strategies, Inc

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Parsing a delimited string with variable length

MLIves
In reply to this post by Jon Peck

Except for sorting—Strings sort numbers differently than numeric order.

e.g.

1

10

11

2

20

3

 

Melissa

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jon Peck
Sent: Monday, July 25, 2016 9:02 AM
To: [hidden email]
Subject: Re: [SPSSX-L] Parsing a delimited string with variable length

 

I left the output variables as strings as I didn't know whether the values were always numeric, but setting the type parameter to 0 would work fine and create numeric variables.  For tabulation purposes, though, since the values are categorical, either numeric or string would work pretty much the same.  If the magnitudes have semantics, then numeric might be better.

 

On Mon, Jul 25, 2016 at 1:04 AM, Rich Ulrich <[hidden email]> wrote:


Jon,
Is there a reason not to return the values as Numeric?  (Don't you get that from "type= 0"?)

I can do a lot more with Numbers than with Strings.  Right now, he is still left with translation
before he can do handy assignment if he wants the originally-mentioned string of 0/1 indicator variables:
Using the Y1 to y10 defined with numbers, here is (untested) code to assign values, along with error protection.

* assuming 25 symptions.
Vector z(25).
RECODE z1 to z25, NSymp(else=0).

LOOP  #= 1 to 10.
  DO IF (y(#) eq -99).
* leave all the values 0, and count as zero.
BREAK.
* quit if end of actual values.
  ELSE IF (y(#) eq 0).
BREAK
  ELSE IF y(#) le 25).
* just skip if  GT 25 and thus out of range?.
* increment from 0 to 1; or even from 1 to 2 if duplicated scoring.
  COMPUTE z(y(#)) = z(y(#)) + 1.
* the count is always useful.
COMPUTE NSymp= NSymp+1.
END LOOP.

--
Rich Ulrich


Date: Sun, 24 Jul 2016 12:11:18 -0600
From: [hidden email]
Subject: Re: Parsing a delimited string with variable length
To: [hidden email]

spssinc trans result = y1 to y10 type=3 /formula 're.split(r"\|", x)'.

 

This, as in the native code I posted earlier, assumes up to 10 values and a width of each code <=3.  Change as needed.

 

The spssinc trans extension command is installed as part of the Python Essentials.

type=3 declares the result variables as A3.

The formula uses the regular expression r"\|" to split on the vertical bar via the re.split function.  The preceding r says not to interpret \ as a Python escape character, which isn't actually needed here, but since re's often have \ in them, it's good practice to always use it.

Inside the re, \| means to take the vertical bar literally rather than in its grammatical meaning in an re.  In an re, | means "or".

 

On Sun, Jul 24, 2016 at 12:00 PM, William Dudley <[hidden email]> wrote:

Thanks Jon

I will give it a try!

 

Just curious

What would python command be?

 

 

 

On Sun, Jul 24, 2016 at 1:49 PM, Jon Peck <[hidden email]> wrote:

p.s.  Once the items are split, you can define a MC set and analyze directly with CTABLES, e.g.,

MRSETS

  /MCGROUP NAME=$y VARIABLES=y1 y10 y2 y3 y4 y5 y6 y7 y8 y9

  /DISPLAY NAME=[$y].

MISSING VALUES y1 TO y10("").

CTABLES

  /TABLE $y.


...deleted



 

--

Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD




This correspondence contains proprietary information some or all of which may be legally privileged; it is for the intended recipient only. If you are not the intended recipient you must not use, disclose, distribute, copy, print, or rely on this correspondence and completely dispose of the correspondence immediately. Please notify the sender if you have received this email in error. NOTE: Messages to or from the State of Connecticut domain may be subject to the Freedom of Information statutes and regulations.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD