How to split text properly into new variables?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

How to split text properly into new variables?

88videos


Hello again :)

I have base like this

ID,var1
1,a) żaba żabka albo żabeczka b) łapka
2,a) ryba rybka rybeńka maleńka (np.sledzik) b) kotek c) piesek d) chomieczek
3,a) zenon b) marian i hela c) alekasadra(ola)

and want to have like this.

ID, var2, var3, var4, var5
1, a) żaba żabka albo żabeczka, b) łapka
2, a) ryba rybka rybeńka maleńka (np.sledzik), b) kotek, c) piesek, d) chomieczek
3, a) zenon, b) marian i hela, c) alekasadra(ola),

To do this I run char.index to to find where "a)" , "b)", "c)" and "d)" were used and substr do split text.
It works if  don't use polish letters like "ż", "ł".
This is caused by the fact that substr count that letters as 2 characters.
Below example.



Maybe you can show me other method, how to do it and keep polish letters?



**********************************
*without polish letters
**********************************

data list list
/ID(f8.0) var1(a90).
begin data.
1 'a) zaba zabka albo zabeczka b) lapka'
2 'a) ryba rybka rybenka malenka (np.sledzik) b) kotek c) piesek d) chomieczek'
3 'a) zenon b) marian i hela c) alekasadra(ola)'
4
5
6
7
8
9
10
end data.
execute.
DATASET NAME base1.
DATASET ACTIVATE base1.

compute a=CHAR.INDEX(var1, 'a)').
compute b=CHAR.INDEX(var1, 'b)').
compute c=CHAR.INDEX(var1, 'c)').
compute d=CHAR.INDEX(var1, 'd)').
execute.

string var2 to var5(a60).

do if a<>0 and b<>0.
compute var2=SUBSTR(var1, a, b-a).
else if a<>0 and b=0.
compute var2=SUBSTR(var1, a, 90).
end if.
execute.


do if b<>0 and c<>0.
compute var3=SUBSTR(var1, b, c-b).
else if b<>0 and c=0.
compute var3=SUBSTR(var1, b, 90).
end if.
execute.


do if c<>0 and d<>0.
compute var4=SUBSTR(var1, c, d-c).
else if c<>0 and d=0.
compute var4=SUBSTR(var1, c, 90).
end if.
execute.


do if d<>0.
compute var5=SUBSTR(var1, d, 90).
end if.
execute.



**********************************
*with polish letters
**********************************

data list list
/ID(f8.0) var1(a90).
begin data.
1 'a) żaba żabka albo żabeczka b) łapka'
2 'a) ryba rybka rybeńka maleńka (np.sledzik) b) kotek c) piesek d) chomieczek'
3 'a) zenon b) marian i hela c) alekasadra(ola)'
4
5
6
7
8
9
10
end data.
execute.
DATASET NAME base2.
DATASET ACTIVATE base2.

compute a=CHAR.INDEX(var1, 'a)').
compute b=CHAR.INDEX(var1, 'b)').
compute c=CHAR.INDEX(var1, 'c)').
compute d=CHAR.INDEX(var1, 'd)').
execute.

string var2 to var5(a60).

do if a<>0 and b<>0.
compute var2=SUBSTR(var1, a, b-a).
else if a<>0 and b=0.
compute var2=SUBSTR(var1, a, 90).
end if.
execute.


do if b<>0 and c<>0.
compute var3=SUBSTR(var1, b, c-b).
else if b<>0 and c=0.
compute var3=SUBSTR(var1, b, 90).
end if.
execute.


do if c<>0 and d<>0.
compute var4=SUBSTR(var1, c, d-c).
else if c<>0 and d=0.
compute var4=SUBSTR(var1, c, 90).
end if.
execute.


do if d<>0.
compute var5=SUBSTR(var1, d, 90).
end if.
execute.


Wolny od wirusów. www.avast.com
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: How to split text properly into new variables?

Jon Peck
Use char.substr.  That works on characters regardless of number of bytes
​.  This could also be done with a regular expression and spssinc trans with much less code.​

On Tue, Jul 4, 2017 at 6:14 PM 88Videoclips . <[hidden email]> wrote:


Hello again :)

I have base like this

ID,var1
1,a) żaba żabka albo żabeczka b) łapka
2,a) ryba rybka rybeńka maleńka (np.sledzik) b) kotek c) piesek d) chomieczek
3,a) zenon b) marian i hela c) alekasadra(ola)

and want to have like this.

ID, var2, var3, var4, var5
1, a) żaba żabka albo żabeczka, b) łapka
2, a) ryba rybka rybeńka maleńka (np.sledzik), b) kotek, c) piesek, d) chomieczek
3, a) zenon, b) marian i hela, c) alekasadra(ola),

To do this I run char.index to to find where "a)" , "b)", "c)" and "d)" were used and substr do split text.
It works if  don't use polish letters like "ż", "ł".
This is caused by the fact that substr count that letters as 2 characters.
Below example.



Maybe you can show me other method, how to do it and keep polish letters?



**********************************
*without polish letters
**********************************

data list list
/ID(f8.0) var1(a90).
begin data.
1 'a) zaba zabka albo zabeczka b) lapka'
2 'a) ryba rybka rybenka malenka (np.sledzik) b) kotek c) piesek d) chomieczek'
3 'a) zenon b) marian i hela c) alekasadra(ola)'
4
5
6
7
8
9
10
end data.
execute.
DATASET NAME base1.
DATASET ACTIVATE base1.

compute a=CHAR.INDEX(var1, 'a)').
compute b=CHAR.INDEX(var1, 'b)').
compute c=CHAR.INDEX(var1, 'c)').
compute d=CHAR.INDEX(var1, 'd)').
execute.

string var2 to var5(a60).

do if a<>0 and b<>0.
compute var2=SUBSTR(var1, a, b-a).
else if a<>0 and b=0.
compute var2=SUBSTR(var1, a, 90).
end if.
execute.


do if b<>0 and c<>0.
compute var3=SUBSTR(var1, b, c-b).
else if b<>0 and c=0.
compute var3=SUBSTR(var1, b, 90).
end if.
execute.


do if c<>0 and d<>0.
compute var4=SUBSTR(var1, c, d-c).
else if c<>0 and d=0.
compute var4=SUBSTR(var1, c, 90).
end if.
execute.


do if d<>0.
compute var5=SUBSTR(var1, d, 90).
end if.
execute.



**********************************
*with polish letters
**********************************

data list list
/ID(f8.0) var1(a90).
begin data.
1 'a) żaba żabka albo żabeczka b) łapka'
2 'a) ryba rybka rybeńka maleńka (np.sledzik) b) kotek c) piesek d) chomieczek'
3 'a) zenon b) marian i hela c) alekasadra(ola)'
4
5
6
7
8
9
10
end data.
execute.
DATASET NAME base2.
DATASET ACTIVATE base2.

compute a=CHAR.INDEX(var1, 'a)').
compute b=CHAR.INDEX(var1, 'b)').
compute c=CHAR.INDEX(var1, 'c)').
compute d=CHAR.INDEX(var1, 'd)').
execute.

string var2 to var5(a60).

do if a<>0 and b<>0.
compute var2=SUBSTR(var1, a, b-a).
else if a<>0 and b=0.
compute var2=SUBSTR(var1, a, 90).
end if.
execute.


do if b<>0 and c<>0.
compute var3=SUBSTR(var1, b, c-b).
else if b<>0 and c=0.
compute var3=SUBSTR(var1, b, 90).
end if.
execute.


do if c<>0 and d<>0.
compute var4=SUBSTR(var1, c, d-c).
else if c<>0 and d=0.
compute var4=SUBSTR(var1, c, 90).
end if.
execute.


do if d<>0.
compute var5=SUBSTR(var1, d, 90).
end if.
execute.










Wolny od wirusów. www.avast.com





=====================

To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the

command. To leave the list, send the command

SIGNOFF SPSSX-L

For a list of commands to manage subscriptions, send the command

INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: How to split text properly into new variables?

Bruce Weaver
Administrator
In reply to this post by 88videos
Here is a shorter version of your syntax that appears to work.  

DO REPEAT v = a b c d / s = 'a)' 'b)' 'c)' 'd)'.
- COMPUTE v=CHAR.INDEX(var1,s).
END REPEAT.

STRING var2 to var5(a60).

DO REPEAT a = a b c / b = b c d / v = var2 var3 var4.
- DO IF a NE 0.
-  IF b NE 0 v=CHAR.SUBSTR(var1, a, b-a).
-  IF b EQ 0 v=CHAR.SUBSTR(var1, a, 90).
- END IF.
END REPEAT.
IF d NE 0 var5=CHAR.SUBSTR(var1, d, 90).
FORMATS a to d (F5.0).
LIST var2 to var5.


88videos wrote
Hello again :)

I have base like this

ID,var1
1,a) żaba żabka albo żabeczka b) łapka
2,a) ryba rybka rybeńka maleńka (np.sledzik) b) kotek c) piesek d)
chomieczek
3,a) zenon b) marian i hela c) alekasadra(ola)

and want to have like this.

ID, var2, var3, var4, var5
1, a) żaba żabka albo żabeczka, b) łapka
2, a) ryba rybka rybeńka maleńka (np.sledzik), b) kotek, c) piesek, d)
chomieczek
3, a) zenon, b) marian i hela, c) alekasadra(ola),

To do this I run char.index to to find where "a)" , "b)", "c)" and "d)"
were used and substr do split text.
It works if  don't use polish letters like "ż", "ł".
This is caused by the fact that substr count that letters as 2 characters.
Below example.



*Maybe you can show me other method, how to do it and keep polish letters?*



**********************************
*without polish letters
**********************************

data list list
/ID(f8.0) var1(a90).
begin data.
1 'a) zaba zabka albo zabeczka b) lapka'
2 'a) ryba rybka rybenka malenka (np.sledzik) b) kotek c) piesek d)
chomieczek'
3 'a) zenon b) marian i hela c) alekasadra(ola)'
4
5
6
7
8
9
10
end data.
execute.
DATASET NAME base1.
DATASET ACTIVATE base1.

compute a=CHAR.INDEX(var1, 'a)').
compute b=CHAR.INDEX(var1, 'b)').
compute c=CHAR.INDEX(var1, 'c)').
compute d=CHAR.INDEX(var1, 'd)').
execute.

string var2 to var5(a60).

do if a<>0 and b<>0.
compute var2=SUBSTR(var1, a, b-a).
else if a<>0 and b=0.
compute var2=SUBSTR(var1, a, 90).
end if.
execute.


do if b<>0 and c<>0.
compute var3=SUBSTR(var1, b, c-b).
else if b<>0 and c=0.
compute var3=SUBSTR(var1, b, 90).
end if.
execute.


do if c<>0 and d<>0.
compute var4=SUBSTR(var1, c, d-c).
else if c<>0 and d=0.
compute var4=SUBSTR(var1, c, 90).
end if.
execute.


do if d<>0.
compute var5=SUBSTR(var1, d, 90).
end if.
execute.



**********************************
*with polish letters
**********************************

data list list
/ID(f8.0) var1(a90).
begin data.
1 'a) żaba żabka albo żabeczka b) łapka'
2 'a) ryba rybka rybeńka maleńka (np.sledzik) b) kotek c) piesek d)
chomieczek'
3 'a) zenon b) marian i hela c) alekasadra(ola)'
4
5
6
7
8
9
10
end data.
execute.
DATASET NAME base2.
DATASET ACTIVATE base2.

compute a=CHAR.INDEX(var1, 'a)').
compute b=CHAR.INDEX(var1, 'b)').
compute c=CHAR.INDEX(var1, 'c)').
compute d=CHAR.INDEX(var1, 'd)').
execute.

string var2 to var5(a60).

do if a<>0 and b<>0.
compute var2=SUBSTR(var1, a, b-a).
else if a<>0 and b=0.
compute var2=SUBSTR(var1, a, 90).
end if.
execute.


do if b<>0 and c<>0.
compute var3=SUBSTR(var1, b, c-b).
else if b<>0 and c=0.
compute var3=SUBSTR(var1, b, 90).
end if.
execute.


do if c<>0 and d<>0.
compute var4=SUBSTR(var1, c, d-c).
else if c<>0 and d=0.
compute var4=SUBSTR(var1, c, 90).
end if.
execute.


do if d<>0.
compute var5=SUBSTR(var1, d, 90).
end if.
execute.


<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Wolny
od wirusów. www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: How to split text properly into new variables?

Jon Peck
These patterns are all at risk, though, of parentheses in the main text.  Case 3 has k).  If a-d) could appear in the text, a smarter algorithm that would ignore matching parentheses would be needed.

On Wed, Jul 5, 2017 at 6:38 AM, Bruce Weaver <[hidden email]> wrote:
Here is a shorter version of your syntax that appears to work.

DO REPEAT v = a b c d / s = 'a)' 'b)' 'c)' 'd)'.
- COMPUTE v=CHAR.INDEX(var1,s).
END REPEAT.

STRING var2 to var5(a60).

DO REPEAT a = a b c / b = b c d / v = var2 var3 var4.
- DO IF a NE 0.
-  IF b NE 0 v=CHAR.SUBSTR(var1, a, b-a).
-  IF b EQ 0 v=CHAR.SUBSTR(var1, a, 90).
- END IF.
END REPEAT.
IF d NE 0 var5=CHAR.SUBSTR(var1, d, 90).
FORMATS a to d (F5.0).
LIST var2 to var5.



88videos wrote
> Hello again :)
>
> I have base like this
>
> ID,var1
> 1,a) żaba żabka albo żabeczka b) łapka
> 2,a) ryba rybka rybeńka maleńka (np.sledzik) b) kotek c) piesek d)
> chomieczek
> 3,a) zenon b) marian i hela c) alekasadra(ola)
>
> and want to have like this.
>
> ID, var2, var3, var4, var5
> 1, a) żaba żabka albo żabeczka, b) łapka
> 2, a) ryba rybka rybeńka maleńka (np.sledzik), b) kotek, c) piesek, d)
> chomieczek
> 3, a) zenon, b) marian i hela, c) alekasadra(ola),
>
> To do this I run char.index to to find where "a)" , "b)", "c)" and "d)"
> were used and substr do split text.
> It works if  don't use polish letters like "ż", "ł".
> This is caused by the fact that substr count that letters as 2 characters.
> Below example.
>
>
>
> *Maybe you can show me other method, how to do it and keep polish
> letters?*
>
>
>
> **********************************
> *without polish letters
> **********************************
>
> data list list
> /ID(f8.0) var1(a90).
> begin data.
> 1 'a) zaba zabka albo zabeczka b) lapka'
> 2 'a) ryba rybka rybenka malenka (np.sledzik) b) kotek c) piesek d)
> chomieczek'
> 3 'a) zenon b) marian i hela c) alekasadra(ola)'
> 4
> 5
> 6
> 7
> 8
> 9
> 10
> end data.
> execute.
> DATASET NAME base1.
> DATASET ACTIVATE base1.
>
> compute a=CHAR.INDEX(var1, 'a)').
> compute b=CHAR.INDEX(var1, 'b)').
> compute c=CHAR.INDEX(var1, 'c)').
> compute d=CHAR.INDEX(var1, 'd)').
> execute.
>
> string var2 to var5(a60).
>
> do if a<>0 and b<>0.
> compute var2=SUBSTR(var1, a, b-a).
> else if a<>0 and b=0.
> compute var2=SUBSTR(var1, a, 90).
> end if.
> execute.
>
>
> do if b<>0 and c<>0.
> compute var3=SUBSTR(var1, b, c-b).
> else if b<>0 and c=0.
> compute var3=SUBSTR(var1, b, 90).
> end if.
> execute.
>
>
> do if c<>0 and d<>0.
> compute var4=SUBSTR(var1, c, d-c).
> else if c<>0 and d=0.
> compute var4=SUBSTR(var1, c, 90).
> end if.
> execute.
>
>
> do if d<>0.
> compute var5=SUBSTR(var1, d, 90).
> end if.
> execute.
>
>
>
> **********************************
> *with polish letters
> **********************************
>
> data list list
> /ID(f8.0) var1(a90).
> begin data.
> 1 'a) żaba żabka albo żabeczka b) łapka'
> 2 'a) ryba rybka rybeńka maleńka (np.sledzik) b) kotek c) piesek d)
> chomieczek'
> 3 'a) zenon b) marian i hela c) alekasadra(ola)'
> 4
> 5
> 6
> 7
> 8
> 9
> 10
> end data.
> execute.
> DATASET NAME base2.
> DATASET ACTIVATE base2.
>
> compute a=CHAR.INDEX(var1, 'a)').
> compute b=CHAR.INDEX(var1, 'b)').
> compute c=CHAR.INDEX(var1, 'c)').
> compute d=CHAR.INDEX(var1, 'd)').
> execute.
>
> string var2 to var5(a60).
>
> do if a<>0 and b<>0.
> compute var2=SUBSTR(var1, a, b-a).
> else if a<>0 and b=0.
> compute var2=SUBSTR(var1, a, 90).
> end if.
> execute.
>
>
> do if b<>0 and c<>0.
> compute var3=SUBSTR(var1, b, c-b).
> else if b<>0 and c=0.
> compute var3=SUBSTR(var1, b, 90).
> end if.
> execute.
>
>
> do if c<>0 and d<>0.
> compute var4=SUBSTR(var1, c, d-c).
> else if c<>0 and d=0.
> compute var4=SUBSTR(var1, c, 90).
> end if.
> execute.
>
>
> do if d<>0.
> compute var5=SUBSTR(var1, d, 90).
> end if.
> execute.
>
>
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
> Wolny
> od wirusów. www.avast.com
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD





-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/How-to-split-text-properly-into-new-variables-tp5734501p5734503.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: How to split text properly into new variables?

David Marso
Administrator
In reply to this post by 88videos
First of all use CHAR.SUBSTR and CHAR.INDEX.
Second, your embedded ) is boning you.
I suggest searching for ( and then locating the matching ).
Replace these with [ and ] respectively.

---
88videos wrote
Hello again :)

I have base like this

ID,var1
1,a) żaba żabka albo żabeczka b) łapka
2,a) ryba rybka rybeńka maleńka (np.sledzik) b) kotek c) piesek d)
chomieczek
3,a) zenon b) marian i hela c) alekasadra(ola)

and want to have like this.

ID, var2, var3, var4, var5
1, a) żaba żabka albo żabeczka, b) łapka
2, a) ryba rybka rybeńka maleńka (np.sledzik), b) kotek, c) piesek, d)
chomieczek
3, a) zenon, b) marian i hela, c) alekasadra(ola),

To do this I run char.index to to find where "a)" , "b)", "c)" and "d)"
were used and substr do split text.
It works if  don't use polish letters like "ż", "ł".
This is caused by the fact that substr count that letters as 2 characters.
Below example.



*Maybe you can show me other method, how to do it and keep polish letters?*



**********************************
*without polish letters
**********************************

data list list
/ID(f8.0) var1(a90).
begin data.
1 'a) zaba zabka albo zabeczka b) lapka'
2 'a) ryba rybka rybenka malenka (np.sledzik) b) kotek c) piesek d)
chomieczek'
3 'a) zenon b) marian i hela c) alekasadra(ola)'
4
5
6
7
8
9
10
end data.
execute.
DATASET NAME base1.
DATASET ACTIVATE base1.

compute a=CHAR.INDEX(var1, 'a)').
compute b=CHAR.INDEX(var1, 'b)').
compute c=CHAR.INDEX(var1, 'c)').
compute d=CHAR.INDEX(var1, 'd)').
execute.

string var2 to var5(a60).

do if a<>0 and b<>0.
compute var2=SUBSTR(var1, a, b-a).
else if a<>0 and b=0.
compute var2=SUBSTR(var1, a, 90).
end if.
execute.


do if b<>0 and c<>0.
compute var3=SUBSTR(var1, b, c-b).
else if b<>0 and c=0.
compute var3=SUBSTR(var1, b, 90).
end if.
execute.


do if c<>0 and d<>0.
compute var4=SUBSTR(var1, c, d-c).
else if c<>0 and d=0.
compute var4=SUBSTR(var1, c, 90).
end if.
execute.


do if d<>0.
compute var5=SUBSTR(var1, d, 90).
end if.
execute.



**********************************
*with polish letters
**********************************

data list list
/ID(f8.0) var1(a90).
begin data.
1 'a) żaba żabka albo żabeczka b) łapka'
2 'a) ryba rybka rybeńka maleńka (np.sledzik) b) kotek c) piesek d)
chomieczek'
3 'a) zenon b) marian i hela c) alekasadra(ola)'
4
5
6
7
8
9
10
end data.
execute.
DATASET NAME base2.
DATASET ACTIVATE base2.

compute a=CHAR.INDEX(var1, 'a)').
compute b=CHAR.INDEX(var1, 'b)').
compute c=CHAR.INDEX(var1, 'c)').
compute d=CHAR.INDEX(var1, 'd)').
execute.

string var2 to var5(a60).

do if a<>0 and b<>0.
compute var2=SUBSTR(var1, a, b-a).
else if a<>0 and b=0.
compute var2=SUBSTR(var1, a, 90).
end if.
execute.


do if b<>0 and c<>0.
compute var3=SUBSTR(var1, b, c-b).
else if b<>0 and c=0.
compute var3=SUBSTR(var1, b, 90).
end if.
execute.


do if c<>0 and d<>0.
compute var4=SUBSTR(var1, c, d-c).
else if c<>0 and d=0.
compute var4=SUBSTR(var1, c, 90).
end if.
execute.


do if d<>0.
compute var5=SUBSTR(var1, d, 90).
end if.
execute.


<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Wolny
od wirusów. www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: How to split text properly into new variables?

88videos
In reply to this post by 88videos
Like always I got very useful advices  here.Thanks!!

As Jon and David mentioned the most problematic issue is that "a)/b)/c)..." could be part of text and appear in case many times. It is not inconceivable that some work must be done under supervision.

2017-07-05 2:14 GMT+02:00 88Videoclips . <[hidden email]>:


Hello again :)

I have base like this

ID,var1
1,a) żaba żabka albo żabeczka b) łapka
2,a) ryba rybka rybeńka maleńka (np.sledzik) b) kotek c) piesek d) chomieczek
3,a) zenon b) marian i hela c) alekasadra(ola)

and want to have like this.

ID, var2, var3, var4, var5
1, a) żaba żabka albo żabeczka, b) łapka
2, a) ryba rybka rybeńka maleńka (np.sledzik), b) kotek, c) piesek, d) chomieczek
3, a) zenon, b) marian i hela, c) alekasadra(ola),

To do this I run char.index to to find where "a)" , "b)", "c)" and "d)" were used and substr do split text.
It works if  don't use polish letters like "ż", "ł".
This is caused by the fact that substr count that letters as 2 characters.
Below example.



Maybe you can show me other method, how to do it and keep polish letters?



**********************************
*without polish letters
**********************************

data list list
/ID(f8.0) var1(a90).
begin data.
1 'a) zaba zabka albo zabeczka b) lapka'
2 'a) ryba rybka rybenka malenka (np.sledzik) b) kotek c) piesek d) chomieczek'
3 'a) zenon b) marian i hela c) alekasadra(ola)'
4
5
6
7
8
9
10
end data.
execute.
DATASET NAME base1.
DATASET ACTIVATE base1.

compute a=CHAR.INDEX(var1, 'a)').
compute b=CHAR.INDEX(var1, 'b)').
compute c=CHAR.INDEX(var1, 'c)').
compute d=CHAR.INDEX(var1, 'd)').
execute.

string var2 to var5(a60).

do if a<>0 and b<>0.
compute var2=SUBSTR(var1, a, b-a).
else if a<>0 and b=0.
compute var2=SUBSTR(var1, a, 90).
end if.
execute.


do if b<>0 and c<>0.
compute var3=SUBSTR(var1, b, c-b).
else if b<>0 and c=0.
compute var3=SUBSTR(var1, b, 90).
end if.
execute.


do if c<>0 and d<>0.
compute var4=SUBSTR(var1, c, d-c).
else if c<>0 and d=0.
compute var4=SUBSTR(var1, c, 90).
end if.
execute.


do if d<>0.
compute var5=SUBSTR(var1, d, 90).
end if.
execute.



**********************************
*with polish letters
**********************************

data list list
/ID(f8.0) var1(a90).
begin data.
1 'a) żaba żabka albo żabeczka b) łapka'
2 'a) ryba rybka rybeńka maleńka (np.sledzik) b) kotek c) piesek d) chomieczek'
3 'a) zenon b) marian i hela c) alekasadra(ola)'
4
5
6
7
8
9
10
end data.
execute.
DATASET NAME base2.
DATASET ACTIVATE base2.

compute a=CHAR.INDEX(var1, 'a)').
compute b=CHAR.INDEX(var1, 'b)').
compute c=CHAR.INDEX(var1, 'c)').
compute d=CHAR.INDEX(var1, 'd)').
execute.

string var2 to var5(a60).

do if a<>0 and b<>0.
compute var2=SUBSTR(var1, a, b-a).
else if a<>0 and b=0.
compute var2=SUBSTR(var1, a, 90).
end if.
execute.


do if b<>0 and c<>0.
compute var3=SUBSTR(var1, b, c-b).
else if b<>0 and c=0.
compute var3=SUBSTR(var1, b, 90).
end if.
execute.


do if c<>0 and d<>0.
compute var4=SUBSTR(var1, c, d-c).
else if c<>0 and d=0.
compute var4=SUBSTR(var1, c, 90).
end if.
execute.


do if d<>0.
compute var5=SUBSTR(var1, d, 90).
end if.
execute.


Wolny od wirusów. www.avast.com

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: How to split text properly into new variables?

Jon Peck
Here is code that can handle grouped parentheses.  It assumes that a pair (...) should not match the markers such as a) even if it contains that string.  First it finds pairs and changes the parentheses to left and right chevrons, assuming that these would not occur in text.  Then it splits at a), b), ... and then puts the the original parentheses back.

The code follows, but if it gets mangled by the listserv, send me an email ([hidden email]), and I will send the code as a file.

Note that for test purposes I changed the input in case 2 to contain sledzia) instead of the original.  The spssinc trans command creates string variables v1...v4 of length 50.  Change the 50 as needed.  This code can easily be tweaked for any number of blocks.  If there are fewer than 4 blocks, the extra variables are blank.  Statistics should be in Unicode mode for this.

* Encoding: UTF-8.
data list list
/ID(f8.0) var1(a90).
begin data.
1 'a) żaba żabka albo żabeczka b) łapka' 
2 'a) ryba rybka rybeńka maleńka (np.sledzia) b) kotek c) piesek d) chomieczek'
3 'a) zenon b) marian i hela c) alekasadra(ola)'
4
5
6
7
8
9
10
end data.
DATASET NAME base2.

begin program.
import re
def splitter(s):
    s = re.sub(r"\((.*?)\)", unichr(171)+r"\1"+unichr(187), s)
    locs=re.findall(r"[a-d]\)",s)
    pos = [s.index(item) for item in locs]
    pos.append(len(s))
    parts = [s[pos[i]:pos[i+1]] for i in range(len(pos)-1)]
    parts = [re.sub(unichr(171), r"(", part)  for part in parts]
    parts = [re.sub(unichr(187), r")", part)  for part in parts]
    return parts
end program.

spssinc trans result=v1 v2 v3 v4 type=50
/formula "splitter(var1)".


On Thu, Jul 6, 2017 at 5:50 AM, 88Videoclips . <[hidden email]> wrote:
Like always I got very useful advices  here.Thanks!!

As Jon and David mentioned the most problematic issue is that "a)/b)/c)..." could be part of text and appear in case many times. It is not inconceivable that some work must be done under supervision.

2017-07-05 2:14 GMT+02:00 88Videoclips . <[hidden email]>:


Hello again :)

I have base like this

ID,var1
1,a) żaba żabka albo żabeczka b) łapka
2,a) ryba rybka rybeńka maleńka (np.sledzik) b) kotek c) piesek d) chomieczek
3,a) zenon b) marian i hela c) alekasadra(ola)

and want to have like this.

ID, var2, var3, var4, var5
1, a) żaba żabka albo żabeczka, b) łapka
2, a) ryba rybka rybeńka maleńka (np.sledzik), b) kotek, c) piesek, d) chomieczek
3, a) zenon, b) marian i hela, c) alekasadra(ola),

To do this I run char.index to to find where "a)" , "b)", "c)" and "d)" were used and substr do split text.
It works if  don't use polish letters like "ż", "ł".
This is caused by the fact that substr count that letters as 2 characters.
Below example.



Maybe you can show me other method, how to do it and keep polish letters?



**********************************
*without polish letters
**********************************

data list list
/ID(f8.0) var1(a90).
begin data.
1 'a) zaba zabka albo zabeczka b) lapka'
2 'a) ryba rybka rybenka malenka (np.sledzik) b) kotek c) piesek d) chomieczek'
3 'a) zenon b) marian i hela c) alekasadra(ola)'
4
5
6
7
8
9
10
end data.
execute.
DATASET NAME base1.
DATASET ACTIVATE base1.

compute a=CHAR.INDEX(var1, 'a)').
compute b=CHAR.INDEX(var1, 'b)').
compute c=CHAR.INDEX(var1, 'c)').
compute d=CHAR.INDEX(var1, 'd)').
execute.

string var2 to var5(a60).

do if a<>0 and b<>0.
compute var2=SUBSTR(var1, a, b-a).
else if a<>0 and b=0.
compute var2=SUBSTR(var1, a, 90).
end if.
execute.


do if b<>0 and c<>0.
compute var3=SUBSTR(var1, b, c-b).
else if b<>0 and c=0.
compute var3=SUBSTR(var1, b, 90).
end if.
execute.


do if c<>0 and d<>0.
compute var4=SUBSTR(var1, c, d-c).
else if c<>0 and d=0.
compute var4=SUBSTR(var1, c, 90).
end if.
execute.


do if d<>0.
compute var5=SUBSTR(var1, d, 90).
end if.
execute.



**********************************
*with polish letters
**********************************

data list list
/ID(f8.0) var1(a90).
begin data.
1 'a) żaba żabka albo żabeczka b) łapka'
2 'a) ryba rybka rybeńka maleńka (np.sledzik) b) kotek c) piesek d) chomieczek'
3 'a) zenon b) marian i hela c) alekasadra(ola)'
4
5
6
7
8
9
10
end data.
execute.
DATASET NAME base2.
DATASET ACTIVATE base2.

compute a=CHAR.INDEX(var1, 'a)').
compute b=CHAR.INDEX(var1, 'b)').
compute c=CHAR.INDEX(var1, 'c)').
compute d=CHAR.INDEX(var1, 'd)').
execute.

string var2 to var5(a60).

do if a<>0 and b<>0.
compute var2=SUBSTR(var1, a, b-a).
else if a<>0 and b=0.
compute var2=SUBSTR(var1, a, 90).
end if.
execute.


do if b<>0 and c<>0.
compute var3=SUBSTR(var1, b, c-b).
else if b<>0 and c=0.
compute var3=SUBSTR(var1, b, 90).
end if.
execute.


do if c<>0 and d<>0.
compute var4=SUBSTR(var1, c, d-c).
else if c<>0 and d=0.
compute var4=SUBSTR(var1, c, 90).
end if.
execute.


do if d<>0.
compute var5=SUBSTR(var1, d, 90).
end if.
execute.


Wolny od wirusów. www.avast.com

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD