SPSSX Discussion

parsing out letters from a string variable

Classic

List

Threaded

10 messages Options

msherman

parsing out letters from a string variable

Dear list: I have a data set (SOURCESOFINCOME: C=CHILDCARE VOUCHERS, F=FOOD STAMPS, J=FULL-TIME JOB, L =LIVING WITH FAMILY, ETC) EACH PARTICPANT CAN HAVE ANY WHERE FROM NO SOURCES OF INCOME UP TO 11 SOURCES, HENCE A PARTICPANT COULD HAVE 11 CHARACTERS FOR THIS VARIABLE. I WANT TO PARSE OUT EACH SO THAT I HAVE A COLUMN FOR EACH SOURCE. A COLUMN FOR C=CHILDCARE, A COLUMN FOR F=FOOD STAMPS, ETC. I HAVE LOOKED AT THE SPSSTOOLS WEB SITE BUT NONE OF THE EXAMPLES SEEM TO FIT. SUGGESTIONS APPRECIATED. MARTIN SHERMAN

THE that looks something like this

sources of income

FMT

FJM

CMPW

FMSW

CFMSW

FJM

CFMT

CJO

FMTW

FMSTW

FMPS

FMPT

FMT

FMPT

FMU

FMW

FMPT

CFMUW

FLMW

MPS

CFMUW

FMTW

CFMSW

FLMW

FMTW

CFLM

CFMOW

FMSTW

LMW

FMTW

LOW

LOP

FOS

FLMTW

FJM

FMT

CLMT

LMOW

FMPW

CFM

FMTW

FMW

FMSW

FMT

FMP

FJLMW

FMTW

FMT

LMTW

LMPW

LMO

FJMW

CFMUW

LMU

CUW

Albert-Jan Roskam

Re: parsing out letters from a string variable

Hello,

This will work, provided that you cannot have two same sources of income (e.g, two times vouchers). In that case you might want to count each.

data list free / invar (a11).
begin data
FMT
FMT
FM
FMT
M

P
M
FMT
FM
F
FM
FT
M
FM
FMT
FMT
FM
CMPW
FMSW
CU
L
P
CFMSW
FJM
CFMT
JL
CJO
FM
J
FMTW
FMTW
P
FMSTW
S
FT
FMPS
JM
J
FMPT
O
FMT
J
FMPT
LO
J
FMU
JL
FMW
P
CF
C
FMPT
CFMUW
FLMW
MPS
PW
S
CFMUW
FMTW
FU
P
CFMSW
end data.

do repeat #x = f m t p c f u w / #y = 'F' 'M' 'T' 'P' 'C' 'F' 'U' 'W'.
+compute #x = (index(invar, #y) > 0).
end repeat.
exe.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

From: Martin Sherman <[hidden email]>
To: [hidden email]
Sent: Thu, June 30, 2011 9:59:01 PM
Subject: [SPSSX-L] parsing out letters from a string variable

THE that looks something like this

sources of income

FMT

FJM

CMPW

FMSW

CFMSW

FJM

CFMT

CJO

FMTW

FMSTW

FMPS

FMPT

FMT

FMPT

FMU

FMW

FMPT

CFMUW

FLMW

MPS

CFMUW

FMTW

CFMSW

FLMW

FMTW

CFLM

CFMOW

FMSTW

LMW

FMTW

LOW

LOP

FOS

FLMTW

FJM

FMT

CLMT

LMOW

FMPW

CFM

FMTW

FMW

FMSW

FMT

FMP

FJLMW

FMTW

FMT

LMTW

LMPW

LMO

FJMW

CFMUW

LMU

CUW

David Marso

Re: parsing out letters from a string variable

Administrator

Slightly different approach.
Traverse the input string and map each character to a vector.
Also 'counts' each occurrence.
If the invar is only k: k<12 character this does k operations whereas the DO REPEAT exhaustively searches all 11 even if unnecessary.
---
NUMERIC c f j l m o p s t u w.
VECTOR Work=c TO w.
LOOP #=1 TO LENGTH(RTRIM(invar)).
+ COMPUTE #found=INDEX("CFJLMOPSTUW",CHAR.SUBSTR(invar,#,1)).
+ COMPUTE WORK(#found)=SUM(WORK(#found), #found>0).
END LOOP.
HTH, David
--

Albert-Jan Roskam wrote

Hello,

This will work, provided that you cannot have two same sources of income (e.g,
two times vouchers). In that case you might want to count each.

data list free / invar (a11).
begin data
FMT
FMT
FM
FMT
M

P
M
FMT
FM
F
FM
FT
M
FM
FMT
FMT
FM
CMPW
FMSW
CU
L
P
CFMSW
FJM
CFMT
JL
CJO
FM
J
FMTW
FMTW
P
FMSTW
S
FT
FMPS
JM
J
FMPT
O
FMT
J
FMPT
LO
J
FMU
JL
FMW
P
CF
C
FMPT
CFMUW
FLMW
MPS
PW
S
CFMUW
FMTW
FU
P
CFMSW
end data.

do repeat #x = f m t p c f u w / #y = 'F' 'M' 'T' 'P' 'C' 'F' 'U' 'W'.
+compute #x = (index(invar, #y) > 0).
end repeat.
exe.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public
order, irrigation, roads, a fresh water system, and public health, what have the
Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

________________________________
From: Martin Sherman <[hidden email]>
To: [hidden email]
Sent: Thu, June 30, 2011 9:59:01 PM
Subject: [SPSSX-L] parsing out letters from a string variable

Dear list: I have a data set (SOURCESOFINCOME: C=CHILDCARE VOUCHERS, F=FOOD
STAMPS, J=FULL-TIME JOB, L =LIVING WITH FAMILY, ETC) EACH PARTICPANT CAN HAVE
ANY WHERE FROM NO SOURCES OF INCOME UP TO 11 SOURCES, HENCE A PARTICPANT COULD
HAVE 11 CHARACTERS FOR THIS VARIABLE. I WANT TO PARSE OUT EACH SO THAT I HAVE A
COLUMN FOR EACH SOURCE. A COLUMN FOR C=CHILDCARE, A COLUMN FOR F=FOOD STAMPS,
ETC. I HAVE LOOKED AT THE SPSSTOOLS WEB SITE BUT NONE OF THE EXAMPLES SEEM TO
FIT. SUGGESTIONS APPRECIATED. MARTIN SHERMAN

THE that looks something like this
sources of income

FMT
FMT
FM
FMT
M

P
M

FMT
FM
F

FM
FT
M
FM
FMT
FMT
FM
M

FM
FM

FM
FMT
FM
P
FMT

FM
M
FMT
FM
FMT
FM
FM

MT
FM
FM

FM
FM
FMT

M

FJM

CMPW
FMSW
CU
L
P
CFMSW
FJM
CFMT
JL
CJO
FM
J
FMTW
FMTW
P
FMSTW
S
FT
FMPS
JM
J
FMPT
O

FMT
J
FMPT
LO
J
FMU
JL
FMW
P
CF
C
FMPT
CFMUW
FLMW
MPS
PW
S
CFMUW
FMTW
FU
P
CFMSW

FLMW
LP
FMTW
CFLM
J
CFMOW
FMSTW
LMW
FMTW
LOW
FT
FM
LOP
FOS
FLMTW
LU
FO
FP
L
FJM
FMT
LU
CLMT
J

LMOW
F
FMPW
FT
FO
FM
CFM

FMTW
S
LP

FMW
FMSW

OW
FMT
FMP
T
FMP

FJLMW

FS
LF
FMTW

J
FJ
CJ
L
FMT
S
P
LMTW
OP
LMPW
LMPW
J
JL

M
LMO
FO
L

FJMW
CFMUW
FM
CO
F
LMU
CUW

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Richard Ristow

Re: parsing out letters from a string variable

In reply to this post by msherman

At 03:59 PM 6/30/2011, Martin Sherman wrote:

>I have a data set [with] sourcesofincome:
>c=childcare vouchers,
>f=food stamps,
>j=full-time job,
>l =living with family,
>etc. Each participant can have any where from no sources of income
>up to 11 sources, hence a participant could have 11 characters for
>this variable. I want to parse out each so that I have a [variable]
>for each source.
>
>The [data] looks something like this:

|-----------------------------|---------------------------|
|Output Created |01-JUL-2011 17:01:02 |
|-----------------------------|---------------------------|
[TestData]

PcptID Sources

001 FMT
002 FMT
003 FM
004 FMT
005 M
006
007 P
008 M
009
010 FMT

Number of cases read: 10 Number of cases listed: 10

This problem can be solved fairly neatly, and without needing to know
beforehand all letter codes used, by unrolling to one record per
participant per source, and then rolling back up with CASESTOVARS.
The code is tested. It uses more DATASET commands than necessary, to
leave a trail for debugging.

* Spread the sources from one variable to 11 variables: ... .

VECTOR S(11,A1).
LOOP #Idx = 1 TO 11.
. COMPUTE S(#Idx) = SUBSTR(Sources,#Idx,1).
END LOOP.

* ... plus one dummy variable, so all cases have records: .

STRING S0 (A1).
COMPUTE S0 = '@'.

LIST /CASES=10.

List
|-----------------------------|---------------------------|
|Output Created |01-JUL-2011 17:01:03 |
|-----------------------------|---------------------------|
[Unroll]
PcptID Sources S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S0

001 FMT F M T @
002 FMT F M T @
003 FM F M @
004 FMT F M T @
005 M M @
006 @
007 P P @
008 M M @
009 @
010 FMT F M T @

Number of cases read: 10 Number of cases listed: 10

* Then unroll, to one record per participant per source: ... .

VARSTOCASES
/MAKE Source FROM S0 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11
/KEEP = PcptID
/NULL = DROP.

NUMERIC Got_It (F2).
COMPUTE Got_It = 1.

LIST /CASES=15.
List
|-----------------------------|---------------------------|
|Output Created |01-JUL-2011 17:01:03 |
|-----------------------------|---------------------------|
[Unroll]

PcptID Source Got_It

001 @ 1
001 F 1
001 M 1
001 T 1
002 @ 1
002 F 1
002 M 1
002 T 1
003 @ 1
003 F 1
003 M 1
004 @ 1
004 F 1
004 M 1
004 T 1

Number of cases read: 15 Number of cases listed: 15

DATASET ACTIVATE Unroll WINDOW=FRONT.
DATASET COPY Summary.
DATASET ACTIVATE Summary WINDOW=FRONT.

* Then roll up to one variable per TYPE of source: ... .

SORT CASES BY PcptID Source .
CASESTOVARS
/ID = PcptID
/INDEX = Source
/GROUPBY = VARIABLE
/AUTOFIX = NO.

RECODE ALL (SYSMIS=0).

LIST /CASES=10.

List
|-----------------------------|---------------------------|
|Output Created |01-JUL-2011 17:01:07 |
|-----------------------------|---------------------------|
[Summary]
PcptID @ C F J L M O P S T U W

001 1 0 1 0 0 1 0 0 0 1 0 0
002 1 0 1 0 0 1 0 0 0 1 0 0
003 1 0 1 0 0 1 0 0 0 0 0 0
004 1 0 1 0 0 1 0 0 0 1 0 0
005 1 0 0 0 0 1 0 0 0 0 0 0
006 1 0 0 0 0 0 0 0 0 0 0 0
007 1 0 0 0 0 0 0 1 0 0 0 0
008 1 0 0 0 0 1 0 0 0 0 0 0
009 1 0 0 0 0 0 0 0 0 0 0 0
010 1 0 1 0 0 1 0 0 0 1 0 0

Number of cases read: 10 Number of cases listed: 10
=========================================
APPENDIX: Test data, and code
(Contains all data from original posting)
=========================================
* C:\Documents and Settings\Richard\My Documents .
* \Technical\spssx-l\Z-2011\ .
* 2011-06-30 Sherman - parsing out letters from a string variable.SPS .

* In response to posting .
* Date: Thu, 30 Jun 2011 15:59:01 -0400 .
* From: Martin Sherman <[hidden email]> .
* Subject: parsing out letters from a string variable .
* To: [hidden email] .

* "I have a data set [with] sourcesofincome: .
* c=childcare vouchers, .
* f=food stamps, .
* j=full-time job, .
* l =living with family, .
* etc. Each participant can have any where from no sources of .
* income up to 11 sources, hence a participant could have 11 .
* characters for this variable. I want to parse out each so that .
* I have a [variable] for each source." .

NEW FILE.
PRESERVE.
SET MXWARNS=0.
INPUT PROGRAM.
. NUMERIC PcptID (N3).
. STRING Sources (A11).
. LEAVE PcptID.
. DATA LIST LIST / Sources.
. COMPUTE PcptID = PcptID + 1.
END INPUT PROGRAM.
BEGIN DATA
FMT
FMT
FM
FMT
M

P
M

FMT
FM
F

FM
FT
M
FM
FMT
FMT
FM
M

FM
FM

FM
FMT
FM
P
FMT

FM
M
FMT
FM
FMT
FM
FM

MT
FM
FM

FM
FM
FMT

M

FJM

CMPW
FMSW
CU
L
P
CFMSW
FJM
CFMT
JL
CJO
FM
J
FMTW
FMTW
P
FMSTW
S
FT
FMPS
JM
J
FMPT
O

FMT
J
FMPT
LO
J
FMU
JL
FMW
P
CF
C
FMPT
CFMUW
FLMW
MPS
PW
S
CFMUW
FMTW
FU
P
CFMSW

FLMW
LP
FMTW
CFLM
J
CFMOW
FMSTW
LMW
FMTW
LOW
FT
FM
LOP
FOS
FLMTW
LU
FO
FP
L
FJM
FMT
LU
CLMT
J

LMOW
F
FMPW
FT
FO
FM
CFM

FMTW
S
LP

FMW
FMSW

OW
FMT
FMP
T
FMP

FJLMW

FS
LF
FMTW

J
FJ
CJ
L
FMT
S
P
LMTW
OP
LMPW
LMPW
J
JL

M
LMO
FO
L

FJMW
CFMUW
FM
CO
F
LMU
CUW
END DATA.
RESTORE.
DATASET NAME TestData WINDOW=FRONT.

LIST /CASES=10.

DATASET COPY Unroll.
DATASET ACTIVATE Unroll WINDOW=FRONT.

* Spread the sources from one variable to 11 variables: ... .

VECTOR S(11,A1).
LOOP #Idx = 1 TO 11.
. COMPUTE S(#Idx) = SUBSTR(Sources,#Idx,1).
END LOOP.

* ... plus one dummy variable, so all cases have records: .

STRING S0 (A1).
COMPUTE S0 = '@'.

LIST /CASES=10.

* Then unroll, to one record per participant per source: ... .

VARSTOCASES
/MAKE Source FROM S0 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11
/KEEP = PcptID
/NULL = DROP.

NUMERIC Got_It (F2).
COMPUTE Got_It = 1.

LIST /CASES=15.

DATASET ACTIVATE Unroll WINDOW=FRONT.
DATASET COPY Summary.
DATASET ACTIVATE Summary WINDOW=FRONT.

LIST /CASES=15.

* Then roll up to one variable per TYPE of source: ... .

SORT CASES BY PcptID Source .
CASESTOVARS
/ID = PcptID
/INDEX = Source
/GROUPBY = VARIABLE
/AUTOFIX = NO.

RECODE ALL (SYSMIS=0).

LIST /CASES=10.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Richard Ristow

Re: parsing out letters from a string variable

At 05:25 PM 7/1/2011, Martin Sherman wrote, off-list:

>Richard: Neat. That works. I now have [a second problem].
>
>The problem now is that I have folks who apply to the various
>programs more than one time. So I have all of the basic demographic
>data again on a second record (third or fourth or fifth, etc). The
>program they applied to, along with the date, and then their
>reported sources of income. I want to have a final data set that
>has the basic demographics and then data for program 1, date, and
>the sources of income (var1 to var11), then program 2, date and
>sources of income, then program 3, date, and sources of income, etc.

IF you can identify which of your application records refer to the
same person, and IF the demographic data is always consistent on
every record for the same person, it's a straightforward CASESTOVARS.
First, you create a variable like "PcptID", that is always the same
for the same person and different for different people, something
like this (code in this post is not tested):

DATASET NAME Applications WINDOW=FRONT.

SORT CASES BY {all demographic variables}.

DATASET DECLARE Participants.

AGGREGATE OUTFILE=Participants
/BREAK = {all demographic variables}
/NAppl 'No. of times this person has applied' = NU.

DATASET ACTIVATE Participants WINDOW=FRONT.
NUMERIC PcptID (N5).
VAR LABEL PcptID 'Arbitrary unique ID number for participants'.
COMPUTE PcptID = $CASENUM.

MATCH FILES
/TABLE=Participants
/FILE =Applications
/BY {all demographic variables}
/KEEP =PcptID NAppl {all demographic variables} ALL.

DATASET NAME TaggedAppl WINDOW=FRONT.

Then, run CASESTOVARS with "PcptID" as the ID variable. It may be as simple as

CASESTOVARS
/ID=PcptID.

That leaves AUTOFIX=ON, the default, and you should get the
demographic variables treated as fixed, and the variables dealing
with programs rolled up to one set of variables for each program applied to.

It may not be clear which variables are fixed within the data and
which aren't, so you may want to control the process more tightly:

CASESTOVARS
/ID =PcptID.
/FIXED ={all demographic variables}
/AUTOFIX=NO.

That's the easy case.

You'll have more trouble if names and demographic variables are
sometimes recorded differently on different records for the same
person; and this is only too likely. Then, telling which records
belong to the same person becomes a fuzzy-match problem. How you
solve it depends on the size of your file (is visual inspection
practical?), and on the number of identifying and demographic
variables you have, and the general accuracy of your data.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

David Marso

Re: parsing out letters from a string variable

Administrator

In reply to this post by Richard Ristow

Let's make this version even simpler.
No need to create the 11 variables.
Just do the VARSTOCASES to disk with XSAVE.

STRING Source(A1).
COMPUTE Got_It=1.
DO IF Sources=" ".
+ XSAVE OUTFILE "Temp" / KEEP PcptID Source Got_It.
ELSE
+ LOOP #Idx = 1 TO LENGTH(RTRIM(Sources),#Idx,1).
+ XSAVE OUTFILE "Temp" / KEEP PcptID Source Got_It.
+ END LOOP.
END IF.

GET FILE "TEMP".
SORT CASES BY PcptID Source .
CASESTOVARS
/ID = PcptID
/INDEX = Source
/GROUPBY = VARIABLE
/AUTOFIX = NO.

ERASE FILE "TEMP".

Richard Ristow wrote

At 03:59 PM 6/30/2011, Martin Sherman wrote:

>I have a data set [with] sourcesofincome:
>c=childcare vouchers,
>f=food stamps,
>j=full-time job,
>l =living with family,
>etc. Each participant can have any where from no sources of income
>up to 11 sources, hence a participant could have 11 characters for
>this variable. I want to parse out each so that I have a [variable]
>for each source.
>
>The [data] looks something like this:
|-----------------------------|---------------------------|
|Output Created |01-JUL-2011 17:01:02 |
|-----------------------------|---------------------------|
[TestData]

PcptID Sources

001 FMT
002 FMT
003 FM
004 FMT
005 M
006
007 P
008 M
009
010 FMT

Number of cases read: 10 Number of cases listed: 10

This problem can be solved fairly neatly, and without needing to know
beforehand all letter codes used, by unrolling to one record per
participant per source, and then rolling back up with CASESTOVARS.
The code is tested. It uses more DATASET commands than necessary, to
leave a trail for debugging.

* Spread the sources from one variable to 11 variables: ... .

VECTOR S(11,A1).
LOOP #Idx = 1 TO 11.
. COMPUTE S(#Idx) = SUBSTR(Sources,#Idx,1).
END LOOP.

* ... plus one dummy variable, so all cases have records: .

STRING S0 (A1).
COMPUTE S0 = '@'.

LIST /CASES=10.

List
|-----------------------------|---------------------------|
|Output Created |01-JUL-2011 17:01:03 |
|-----------------------------|---------------------------|
[Unroll]
PcptID Sources S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S0

001 FMT F M T @
002 FMT F M T @
003 FM F M @
004 FMT F M T @
005 M M @
006 @
007 P P @
008 M M @
009 @
010 FMT F M T @

Number of cases read: 10 Number of cases listed: 10

* Then unroll, to one record per participant per source: ... .

VARSTOCASES
/MAKE Source FROM S0 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11
/KEEP = PcptID
/NULL = DROP.

NUMERIC Got_It (F2).
COMPUTE Got_It = 1.

LIST /CASES=15.
List
|-----------------------------|---------------------------|
|Output Created |01-JUL-2011 17:01:03 |
|-----------------------------|---------------------------|
[Unroll]

PcptID Source Got_It

001 @ 1
001 F 1
001 M 1
001 T 1
002 @ 1
002 F 1
002 M 1
002 T 1
003 @ 1
003 F 1
003 M 1
004 @ 1
004 F 1
004 M 1
004 T 1

Number of cases read: 15 Number of cases listed: 15

DATASET ACTIVATE Unroll WINDOW=FRONT.
DATASET COPY Summary.
DATASET ACTIVATE Summary WINDOW=FRONT.

* Then roll up to one variable per TYPE of source: ... .

SORT CASES BY PcptID Source .
CASESTOVARS
/ID = PcptID
/INDEX = Source
/GROUPBY = VARIABLE
/AUTOFIX = NO.

RECODE ALL (SYSMIS=0).

LIST /CASES=10.

List
|-----------------------------|---------------------------|
|Output Created |01-JUL-2011 17:01:07 |
|-----------------------------|---------------------------|
[Summary]
PcptID @ C F J L M O P S T U W

001 1 0 1 0 0 1 0 0 0 1 0 0
002 1 0 1 0 0 1 0 0 0 1 0 0
003 1 0 1 0 0 1 0 0 0 0 0 0
004 1 0 1 0 0 1 0 0 0 1 0 0
005 1 0 0 0 0 1 0 0 0 0 0 0
006 1 0 0 0 0 0 0 0 0 0 0 0
007 1 0 0 0 0 0 0 1 0 0 0 0
008 1 0 0 0 0 1 0 0 0 0 0 0
009 1 0 0 0 0 0 0 0 0 0 0 0
010 1 0 1 0 0 1 0 0 0 1 0 0

Number of cases read: 10 Number of cases listed: 10
=========================================
APPENDIX: Test data, and code
(Contains all data from original posting)
=========================================
* C:\Documents and Settings\Richard\My Documents .
* \Technical\spssx-l\Z-2011\ .
* 2011-06-30 Sherman - parsing out letters from a string variable.SPS .

* In response to posting .
* Date: Thu, 30 Jun 2011 15:59:01 -0400 .
* From: Martin Sherman <[hidden email]> .
* Subject: parsing out letters from a string variable .
* To: [hidden email] .

* "I have a data set [with] sourcesofincome: .
* c=childcare vouchers, .
* f=food stamps, .
* j=full-time job, .
* l =living with family, .
* etc. Each participant can have any where from no sources of .
* income up to 11 sources, hence a participant could have 11 .
* characters for this variable. I want to parse out each so that .
* I have a [variable] for each source." .

NEW FILE.
PRESERVE.
SET MXWARNS=0.
INPUT PROGRAM.
. NUMERIC PcptID (N3).
. STRING Sources (A11).
. LEAVE PcptID.
. DATA LIST LIST / Sources.
. COMPUTE PcptID = PcptID + 1.
END INPUT PROGRAM.
BEGIN DATA
FMT
FMT
FM
FMT
M

P
M

FMT
FM
F

FM
FT
M
FM
FMT
FMT
FM
M

FM
FM

FM
FMT
FM
P
FMT

FM
M
FMT
FM
FMT
FM
FM

MT
FM
FM

FM
FM
FMT

M

FJM

CMPW
FMSW
CU
L
P
CFMSW
FJM
CFMT
JL
CJO
FM
J
FMTW
FMTW
P
FMSTW
S
FT
FMPS
JM
J
FMPT
O

FMT
J
FMPT
LO
J
FMU
JL
FMW
P
CF
C
FMPT
CFMUW
FLMW
MPS
PW
S
CFMUW
FMTW
FU
P
CFMSW

FLMW
LP
FMTW
CFLM
J
CFMOW
FMSTW
LMW
FMTW
LOW
FT
FM
LOP
FOS
FLMTW
LU
FO
FP
L
FJM
FMT
LU
CLMT
J

LMOW
F
FMPW
FT
FO
FM
CFM

FMTW
S
LP

FMW
FMSW

OW
FMT
FMP
T
FMP

FJLMW

FS
LF
FMTW

J
FJ
CJ
L
FMT
S
P
LMTW
OP
LMPW
LMPW
J
JL

M
LMO
FO
L

FJMW
CFMUW
FM
CO
F
LMU
CUW
END DATA.
RESTORE.
DATASET NAME TestData WINDOW=FRONT.

LIST /CASES=10.

DATASET COPY Unroll.
DATASET ACTIVATE Unroll WINDOW=FRONT.

* Spread the sources from one variable to 11 variables: ... .

VECTOR S(11,A1).
LOOP #Idx = 1 TO 11.
. COMPUTE S(#Idx) = SUBSTR(Sources,#Idx,1).
END LOOP.

* ... plus one dummy variable, so all cases have records: .

STRING S0 (A1).
COMPUTE S0 = '@'.

LIST /CASES=10.

* Then unroll, to one record per participant per source: ... .

VARSTOCASES
/MAKE Source FROM S0 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11
/KEEP = PcptID
/NULL = DROP.

NUMERIC Got_It (F2).
COMPUTE Got_It = 1.

LIST /CASES=15.

DATASET ACTIVATE Unroll WINDOW=FRONT.
DATASET COPY Summary.
DATASET ACTIVATE Summary WINDOW=FRONT.

LIST /CASES=15.

* Then roll up to one variable per TYPE of source: ... .

SORT CASES BY PcptID Source .
CASESTOVARS
/ID = PcptID
/INDEX = Source
/GROUPBY = VARIABLE
/AUTOFIX = NO.

RECODE ALL (SYSMIS=0).

LIST /CASES=10.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

David Marso

Re: parsing out letters from a string variable

Administrator

Oops, remind me not to post untested code on the 4th of July after a few cold ones and standing too close to the smoker ;-)
STRING Source(A1).
COMPUTE Got_It=1.
COMPUTE #sLen = LENGTH(RTRIM(Sourcesofincome)).
LOOP #=0 TO #slen-1.
COMPUTE Source=SUBSTR(Sourcesofincome,#+1,1).
+ XSAVE OUTFILE "Temp"
/ KEEP participantuniqueidentifier Source Got_It.
END LOOP.
EXECUTE.
GET FILE "TEMP".
SORT CASES BY participantuniqueidentifier Source .
* From Richard R's code below.
CASESTOVARS
/ID = participantuniqueidentifier
/INDEX = Source
/GROUPBY = VARIABLE
/AUTOFIX = NO.

ERASE FILE "TEMP".

Bruce Weaver

Re: parsing out letters from a string variable

Administrator

For God's sake, man! Don't post untested code on the 4th of July after a few cold ones and standing too close to the smoker! :-|

David Marso wrote

Oops, remind me not to post untested code on the 4th of July after a few cold ones and standing too close to the smoker ;-)
STRING Source(A1).
COMPUTE Got_It=1.
COMPUTE #sLen = LENGTH(RTRIM(Sourcesofincome)).
LOOP #=0 TO #slen-1.
COMPUTE Source=SUBSTR(Sourcesofincome,#+1,1).
+ XSAVE OUTFILE "Temp"
/ KEEP participantuniqueidentifier Source Got_It.
END LOOP.
EXECUTE.
GET FILE "TEMP".
SORT CASES BY participantuniqueidentifier Source .
* From Richard R's code below.
CASESTOVARS
/ID = participantuniqueidentifier
/INDEX = Source
/GROUPBY = VARIABLE
/AUTOFIX = NO.

ERASE FILE "TEMP".

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

David Marso

Re: parsing out letters from a string variable

Administrator

Thanks Bruce (I think)...
OTOH, I'm surprised nobody else caught the brain fart and opened a window ;-)

Bruce Weaver wrote

For God's sake, man! Don't post untested code on the 4th of July after a few cold ones and standing too close to the smoker! :-|

David Marso wrote

Oops, remind me not to post untested code on the 4th of July after a few cold ones and standing too close to the smoker ;-)
STRING Source(A1).
COMPUTE Got_It=1.
COMPUTE #sLen = LENGTH(RTRIM(Sourcesofincome)).
LOOP #=0 TO #slen-1.
COMPUTE Source=SUBSTR(Sourcesofincome,#+1,1).
+ XSAVE OUTFILE "Temp"
/ KEEP participantuniqueidentifier Source Got_It.
END LOOP.
EXECUTE.
GET FILE "TEMP".
SORT CASES BY participantuniqueidentifier Source .
* From Richard R's code below.
CASESTOVARS
/ID = participantuniqueidentifier
/INDEX = Source
/GROUPBY = VARIABLE
/AUTOFIX = NO.

ERASE FILE "TEMP".

Richard Ristow

Re: parsing out letters from a string variable

In reply to this post by Richard Ristow

At 05:25 PM 7/1/2011-11:55 AM 7/5/2011, Martin Sherman wrote, off-list:

>I have the data with the sources of income spread across the 11
>variables; each source is a separate variable now. But I have folks
>who apply to the various programs more than one time.

>I want to have a final data set that has the basic demographics and
>then data for program 1, date, and the sources of income (var1 to
>var11), then program 2, date and sources of income, then program 3,
>date, and sources of income, etc. For example, for ID = 3959, she
>was enrolled in the Alumnae program and had four sources of income
>which got expressed as 4 records. She was also in the CWF program
>and [her records] list the four sources of income while enrolled in
>that program:
|-----------------------------|---------------------------|
|Output Created |08-JUL-2011 13:20:50 |
|-----------------------------|---------------------------|
[Sherman]
PcptID program StartDate PfdPgm IncSrce IncCode

3959 Alumnae 30-OCT-2005 Upholstery Food stamps 3
3959 Alumnae 30-OCT-2005 Upholstery Med. Assist 6
3959 Alumnae 30-OCT-2005 Upholstery Part time job 9
3959 Alumnae 30-OCT-2005 Upholstery TANF/TCA 13
3959 CWF 01-OCT-2005 Upholstery Food stamps 3
3959 CWF 01-OCT-2005 Upholstery Med. Assist 6
3959 CWF 01-OCT-2005 Upholstery Part time job 9
3959 CWF 01-OCT-2005 Upholstery TANF/TCA 13
3960 CWF 24-OCT-2005 Upholstery (none) 17
3963 Upholstery(dis) 19-APR-2005 Upholstery (none) 17
3964 Upholstery(dis) 06-SEP-2005 Upholstery Food stamps 3
3964 Upholstery(dis) 06-SEP-2005 Upholstery Med. Assist 6
3964 Upholstery(dis) 06-SEP-2005 Upholstery TANF/TCA 13
3966 Alumnae 20-DEC-2005 Upholstery (none) 17
3966 CWF 01-OCT-2005 Upholstery (none) 17
3966 Upholstery(dis) 01-AUG-2005 Upholstery (none) 17

Number of cases read: 16 Number of cases listed: 16

>I want to create a single row/record for each participant which
>allows me to see which programs they were in, [and] sources of
>income while in that program.

As I've remarked in other contexts, this is probably a bad idea. Most
analyses you might want to do are easier with 'long' form. Rolling up
the data twice, as you ask, gives a very 'wide' and correspondingly
awkward set.

However, it can be done with a two CASESTOVARS commands:

STRING IncVname (A07).
VAR LABEL IncVname 'Income source, named suitably for CASESTOVARS'.
RECODE IncSrce
('Child support ' = 'ChldSup')
('Empl. Exchg. ' = 'EmpExch')
('Food stamps ' = 'FoodStm')
('Full time job ' = 'FT_job')
('Live w/famly ' = 'LivFmly')
('Med. Assist ' = 'MedAsst')
('No income ' = 'No_Incm')
('Other person ' = 'OthPers')
('Part time job ' = 'PT_job')
('Soc. Sec. ' = 'SoclSec')
('SSDI ' = 'SSDI')
('SSI ' = 'SSI')
('TANF/TCA ' = 'TANF')
('TEHMA ' = 'TEHMA')
('Unemployment ' = 'Unmplym')
('WIC ' = 'WIC')
('(none) ' = 'Z.none')
INTO IncVname.

STRING Has_It (A1).
VAR LABEL Has_It 'Participant has this source of income'.
RECODE IncVname
('Z.none'= '-')
(ELSE = 'Y')
INTO Has_It.

SELECT IF IncVname NE ''.

SORT CASES BY PcptID program StartDate IncVname .
CASESTOVARS
/ID = PcptID program StartDate
/INDEX = IncVname
/FIXED = PfdPgm
/DROP = IncSrce IncCode
/GROUPBY = VARIABLE
/AUTOFIX = NO.

Cases to Variables
|-----------------------------|---------------------------|
|Output Created |08-JUL-2011 13:20:51 |
|-----------------------------|---------------------------|
[Sherman]

Warnings
|---------------------------------------------------------|
|Variable Has_It is constant in every case group, but was |
|not specified in the FIXED subcommand. |
|---------------------------------------------------------|

Generated Variables [list suppressed]
Processing Statistics
|---------------|-----|
|Cases In |24043|
|Cases Out |13249|
|---------------|-----|
|Cases In/Cases |1.8 |
|Out | |
|---------------|-----|
|Variables In |8 |
|Variables Out |21 |
|---------------|-----|
|Index Values |17 |
|---------------|-----|

FORMATS StartDate (DATE9).
LIST /CASES=8.

List
|-----------------------------|---------------------------|
|Output Created |08-JUL-2011 13:20:51 |
|-----------------------------|---------------------------|
[Sherman]

C E F L M N O S
h m o F i e o t P
o Z
l p o T v d _ h T
c T .
d E d _ F A I P _ l
S T E Un n
S x S j m s n e j S S
S A H mp W o
PcptI u c t o l s c r o e D S
N M ly I n
D program StartDate PfdPgm p h m b y t m s b c I I
F A m C e

3959 Alumnae 30-OCT-05 Upholstery Y Y Y Y
3959 CWF 01-OCT-05 Upholstery Y Y Y Y
3960 CWF 24-OCT-05
Upholstery -
3963 Upholstery(dis) 19-APR-05
Upholstery -
3964 Upholstery(dis) 06-SEP-05 Upholstery Y Y Y
3966 Alumnae 20-DEC-05
Upholstery -
3966 CWF 01-OCT-05
Upholstery -
3966 Upholstery(dis) 01-AUG-05
Upholstery -

Number of cases read: 8 Number of cases listed: 8

FORMATS StartDAte (DATE11).

CASESTOVARS
/ID = PcptID
/DROP = PfdPgm
/GROUPBY = INDEX
/AUTOFIX = NO.

Cases to Variables
|-----------------------------|---------------------------|
|Output Created |08-JUL-2011 13:20:51 |
|-----------------------------|---------------------------|
[Sherman]

Warnings [suppressed]
Generated Variables [list suppressed]
Processing Statistics
|---------------|-----|
|Cases In |13249|
|---------------|-----|
|Cases Out |8781 |
|---------------|-----|
|Cases In/Cases |1.5 |
|Out | |
|---------------|-----|
|Variables In |21 |
|---------------|-----|
|Variables Out |191 |
|---------------|-----|
|Index Values |10 |
|---------------|-----|

TEMPORARY.
STRING SPACE (A50).
LIST /CASES=5
/VARIABLES = PcptID SPACE
program.1 TO Z.none.10 .

List
|-----------------------------|---------------------------|
|Output Created |08-JUL-2011 13:20:52 |
|-----------------------------|---------------------------|
[Sherman]

The variables are listed in the following order:

LINE 1: PcptID SPACE

LINE 2: program.1 StartDate.1 ChldSup.1 EmpExch.1 FoodStm.1
FT_job.1 LivFmly.1
MedAsst.1 No_Incm.1 OthPers.1 PT_job.1 SoclSec.1 SSDI.1 SSI.1 TANF.1
TEHMA.1 Unmplym.1 WIC.1 Z.none.1

LINE 3: program.2 StartDate.2 ChldSup.2 EmpExch.2 FoodStm.2
FT_job.2 LivFmly.2
MedAsst.2 No_Incm.2 OthPers.2 PT_job.2 SoclSec.2 SSDI.2 SSI.2 TANF.2
TEHMA.2 Unmplym.2 WIC.2 Z.none.2

LINE 4: program.3 StartDate.3 ChldSup.3 EmpExch.3 FoodStm.3
FT_job.3 LivFmly.3
MedAsst.3 No_Incm.3 OthPers.3 PT_job.3 SoclSec.3 SSDI.3 SSI.3 TANF.3
TEHMA.3 Unmplym.3 WIC.3 Z.none.3

LINE 5: program.4 StartDate.4 ChldSup.4 EmpExch.4 FoodStm.4
FT_job.4 LivFmly.4
MedAsst.4 No_Incm.4 OthPers.4 PT_job.4 SoclSec.4 SSDI.4 SSI.4 TANF.4
TEHMA.4 Unmplym.4 WIC.4 Z.none.4

LINE 6: program.5 StartDate.5 ChldSup.5 EmpExch.5 FoodStm.5
FT_job.5 LivFmly.5
MedAsst.5 No_Incm.5 OthPers.5 PT_job.5 SoclSec.5 SSDI.5 SSI.5 TANF.5
TEHMA.5 Unmplym.5 WIC.5 Z.none.5

LINE 7: program.6 StartDate.6 ChldSup.6 EmpExch.6 FoodStm.6
FT_job.6 LivFmly.6
MedAsst.6 No_Incm.6 OthPers.6 PT_job.6 SoclSec.6 SSDI.6 SSI.6 TANF.6
TEHMA.6 Unmplym.6 WIC.6 Z.none.6

LINE 8: program.7 StartDate.7 ChldSup.7 EmpExch.7 FoodStm.7
FT_job.7 LivFmly.7
MedAsst.7 No_Incm.7 OthPers.7 PT_job.7 SoclSec.7 SSDI.7 SSI.7 TANF.7
TEHMA.7 Unmplym.7 WIC.7 Z.none.7

LINE 9: program.8 StartDate.8 ChldSup.8 EmpExch.8 FoodStm.8
FT_job.8 LivFmly.8
MedAsst.8 No_Incm.8 OthPers.8 PT_job.8 SoclSec.8 SSDI.8 SSI.8 TANF.8
TEHMA.8 Unmplym.8 WIC.8 Z.none.8

LINE 10: program.9 StartDate.9 ChldSup.9 EmpExch.9 FoodStm.9
FT_job.9 LivFmly.9
MedAsst.9 No_Incm.9 OthPers.9 PT_job.9 SoclSec.9 SSDI.9 SSI.9 TANF.9
TEHMA.9 Unmplym.9 WIC.9 Z.none.9

LINE 11: program.10 StartDate.10 ChldSup.10 EmpExch.10 FoodStm.10 FT_job.10
LivFmly.10 MedAsst.10 No_Incm.10 OthPers.10 PT_job.10 SoclSec.10
SSDI.10 SSI.10 TANF.10 TEHMA.10 Unmplym.10 WIC.10 Z.none.10

PcptID: 3959
program.1: Alumnae 30-OCT-2005 Y Y Y Y
program.2: CWF 01-OCT-2005 Y Y Y Y
program.3: .
program.4: .
program.5: .
program.6: .
program.7: .
program.8: .
program.9: .
program.10: .

PcptID: 3960
program.1: CWF 24-OCT-2005 -
program.2: .
program.3: .
program.4: .
program.5: .
program.6: .
program.7: .
program.8: .
program.9: .
program.10: .

PcptID: 3963
program.1: Upholstery(dis) 19-APR-2005 -
program.2: .
program.3: .
program.4: .
program.5: .
program.6: .
program.7: .
program.8: .
program.9: .
program.10: .

PcptID: 3964
program.1: Upholstery(dis) 06-SEP-2005 Y Y Y
program.2: .
program.3: .
program.4: .
program.5: .
program.6: .
program.7: .
program.8: .
program.9: .
program.10: .

PcptID: 3966
program.1: Alumnae 20-DEC-2005 -
program.2: CWF 01-OCT-2005 -
program.3: Upholstery(dis) 01-AUG-2005 -
program.4: .
program.5: .
program.6: .
program.7: .
program.8: .
program.9: .
program.10: .

Number of cases read: 5 Number of cases listed: 5

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD