|
I have a massive file, over 10 million cases, that contains data about mental health services recipients. The recipients are identified by a unique number (see variable in the middle, highlighted in yellow). Race and ethnicity information was appended to the original file (see 2 variables on the end). Recipients are tracked over several quarters. The race/eth codes are displayed for the first instance of the recipient number, but subsequent cases do not list the number. See example below. I could copy and paste the information, but for over 10 million cases, that might take awhile. Is there a code that I could write in syntax that would add the race and ethnicity data for each line?
Yr/qtr County Cat Sex Med A MCO Recip ID Begin date End date DOB Race Eth 20074 25 J 00 M Y 25 00000xx26 20071001 20071231 1963xx11 5 1 20081 25 J 00 M Y 25 00000xx26 20080101 20080331 1963xx11 . . 20082 25 J 00 M Y 25 00000xx26 20080401 20080630 1963xx11 . . 20083 25 J 00 M Y 25 00000xx26 20080701 20080930 1963xx11 . . 20084 25 J 00 M Y 25 00000xx26 20081001 20081231 1963xx11 . . 20074 39 D 00 F N 39 00000xx56 20071012 20071025 1964xx18 5 1 . 00000xx68 . . . 1 1 . 00000xx55 . . . 1 1 . 00000xx57 . . . 5 1 20074 02 J 00 F N 02 00000xx65 20071001 20071231 1949xx07 5 1 20081 02 J 00 F N 02 00000xx65 20080101 20080331 1949xx07 . . Vicki L. Stirkey Program Analyst 3 OMHSAS Office of Mental Health and Substance Abuse Services Division of Systems Management 717-705-8198 Fax: 717-772-6737
|
IF MISSING(Race) Race=LAG(Race). IF MISSING(Eth) Eth=LAG(Eth). HTH, Marta García-Granero -- For miscellaneous SPSS related statistical stuff, visit: http://gjyp.nl/marta/===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Stirkey, Vicki
Problem has been resolved. Thanks everyone for their input.
Vicki L. Stirkey Program Analyst 3 OMHSAS Office of Mental Health and Substance Abuse Services Division of Systems Management 717-705-8198 Fax: 717-772-6737
|
|
In reply to this post by Stirkey, Vicki
something like this untested syntax should do it assuming that the data
was entered correctly and there is only 1 case with the race and eth
variables and if one is missing the other is also. be sure to save the
results as a new file.
sort cases by recip_id (a) race (d) eth (d). do if recip_id eq lag(recip_id) and (missing(race) or Missing(eth)). compute race= lag(race). compute eth = lag(eth). end if. Art Kendall Social Research Consultants Stirkey, Vicki wrote:
Art Kendall
Social Research Consultants |
|
In reply to this post by Stirkey, Vicki
Marta’s response was fine except
that it does not check for missing data in the first record for an individual.
I would use a slight variation (note that your
sample suggests that records are sorted by ID and date): IF ($casenum>1 and RecipID=LAG(RecipID)
and MISSING(Race)) Race = LAG(Race) . IF ($casenum>1 and RecipID=LAG(RecipID)
and MISSING(Eth)) Eth = LAG(Eth) . Dennis Deck From: Stirkey, Vicki
[mailto:[hidden email]] I have a massive file, over 10 million cases, that contains
data about mental health services recipients. The recipients are identified by
a unique number (see variable in the middle, highlighted in yellow). Race and
ethnicity information was appended to the original file (see 2 variables on the
end). Recipients are tracked over several quarters. The race/eth codes are
displayed for the first instance of the recipient number, but subsequent cases
do not list the number. See example below. I could copy and paste the
information, but for over 10 million cases, that might take awhile. Is there a
code that I could write in syntax that would add the race and ethnicity data
for each line? Yr/qtr
County Cat
Sex Med
A MCO Recip ID
Begin
date
End date
DOB
Race Eth 20074 25
J 00
M
Y
25 00000xx26
20071001
20071231
1963xx11
5
1 20081 25
J 00
M
Y
25 00000xx26
20080101
20080331
1963xx11
.
. 20082 25
J 00
M
Y
25 00000xx26
20080401
20080630
1963xx11
.
. 20083 25
J 00
M
Y
25 00000xx26
20080701
20080930
1963xx11
.
. 20084 25
J 00
M
Y
25 00000xx26
20081001
20081231
1963xx11
.
. 20074 39
D 00
F
N
39 00000xx56
20071012
20071025
1964xx18
5
1 .
00000xx68
.
.
.
1
1 .
00000xx55
.
.
.
1
1 .
00000xx57
.
.
.
5
1 20074 02
J 00
F
N
02 00000xx65
20071001
20071231
1949xx07
5
1 20081 02
J 00
F
N
02
00000xx65
20080101
20080331
1949xx07
.
. Vicki L. Stirkey Program Analyst 3 OMHSAS Office of Mental Health and Substance Abuse Services Division of Systems Management 717-705-8198 Fax: 717-772-6737 |
|
Dennis Deck wrote:
> > Marta’s response was fine except that it does not check for missing > data in the first record for an individual. > Odds are good that there is some missing data in the file, in which > case you end up carrying forward a value from the last individual who > did have a value. > > I would use a slight variation (note that your sample suggests that > records are sorted by ID and date): > > IF ($casenum>1 and RecipID=LAG(RecipID) and MISSING(Race)) Race = > LAG(Race) . > > IF ($casenum>1 and RecipID=LAG(RecipID) and MISSING(Eth)) Eth = LAG(Eth) . > welcome, BTW, Richard). I was perfectly aware of that, but, given the sample dataset and the general information (data already sorted) on the dataset that Vicki provided, I assumed that no "first values" were missing for Race and Eth, and kept the response as simpler as possible (off line I got a response from Vicki stating the code had worked perfectly). Anyway, it would have been a good idea to add a comment (to my original message) that problems might arise if: - dataset wasn't properly sorted - some cases had missing values in their first records. BTW, Dennis, I think your modification would not solve for the second problem. Consider the last two records (RecipID=00000xx65), if Race is missing for the first, the value will be copied from the previous record (00000xx57). I like Art Kendall's suggestion better (he sent it to the list a couple of days ago): IF RecipID=LAG(RecipID) AND MISSING.... Regards, Marta > ------------------------------------------------------------------------ > > *From:* Stirkey, Vicki [mailto:[hidden email]] > *Sent:* Friday, April 24, 2009 6:38 AM > *Subject:* copying and pasting information to a large file > > I have a massive file, over 10 million cases, that contains data about > mental health services recipients. The recipients are identified by a > unique number (see variable in the middle, highlighted in yellow). > Race and ethnicity information was appended to the original file (see > 2 variables on the end). Recipients are tracked over several quarters. > The race/eth codes are displayed for the first instance of the > recipient number, but subsequent cases do not list the number. See > example below. I could copy and paste the information, but for over 10 > million cases, that might take awhile. Is there a code that I could > write in syntax that would add the race and ethnicity data for each line? > > *Yr/qtr County Cat Sex Med A MCO Recip ID Begin date End date DOB Race > Eth* > > 20074 25 J 00 M Y 25 00000xx26 20071001 20071231 1963xx11 5 1 > > 20081 25 J 00 M Y 25 00000xx26 20080101 20080331 1963xx11 . . > > 20082 25 J 00 M Y 25 00000xx26 20080401 20080630 1963xx11 . . > > 20083 25 J 00 M Y 25 00000xx26 20080701 20080930 1963xx11 . . > > 20084 25 J 00 M Y 25 00000xx26 20081001 20081231 1963xx11 . . > > 20074 39 D 00 F N 39 00000xx56 20071012 20071025 1964xx18 5 1 > > . 00000xx68 . . . 1 1 > > . 00000xx55 . . . 1 1 > > . 00000xx57 . . . 5 1 > > 20074 02 J 00 F N 02 00000xx65 20071001 20071231 1949xx07 5 1 > > 20081 02 J 00 F N 02 00000xx65 20080101 20080331 1949xx07 . . > > **Vicki L. Stirkey** > > **Program Analyst 3** > > **OMHSAS** > > **Office of Mental Health and Substance Abuse Services** > > **Division of Systems Management** > > **717-705-8198** > > **Fax: 717-772-6737** > -- For miscellaneous SPSS related statistical stuff, visit: http://gjyp.nl/marta/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Dennis Deck
At 02:14 PM 4/27/2009, Dennis Deck wrote, correctly,
Martas response was fine except that it does not check for missing data in the first record for an individual. and gives a solution. Dennis is right about the problem. I'd recommend solving it using AGGREGATE, which in addition to handling missing data, works properly if the data is present, but is in some record other than the first -- another highly possible occurrence: |-----------------------------|---------------------------| |Output Created |25-APR-2009 23:42:42 | |-----------------------------|---------------------------| Recip.ID Begin.date End.date DOB Race Eth 00000xx26 20071001 20071231 1963xx11 5 1 00000xx26 20080101 20080331 1963xx11 . . 00000xx26 20080401 20080630 1963xx11 . . 00000xx26 20080701 20080930 1963xx11 . . 00000xx26 20081001 20081231 1963xx11 . . 00000xx56 20071012 20071025 1964xx18 5 1 00000xx68 . . . 1 1 00000xx55 . . . 1 1 00000xx57 . . . 5 1 00000xx65 20071001 20071231 1949xx07 5 1 00000xx65 20080101 20080331 1949xx07 . . Number of cases read: 11 Number of cases listed: 11 AGGREGATE OUTFILE=* MODE=ADDVARS OVERWRITE=YES /BREAK=Recip.ID /Race =FIRST(Race) /Eth =FIRST(Eth). LIST. List |-----------------------------|---------------------------| |Output Created |25-APR-2009 23:44:48 | |-----------------------------|---------------------------| Recip.ID Begin.date End.date DOB Race Eth 00000xx26 20071001 20071231 1963xx11 5 1 00000xx26 20080101 20080331 1963xx11 5 1 00000xx26 20080401 20080630 1963xx11 5 1 00000xx26 20080701 20080930 1963xx11 5 1 00000xx26 20081001 20081231 1963xx11 5 1 00000xx56 20071012 20071025 1964xx18 5 1 00000xx68 . . . 1 1 00000xx55 . . . 1 1 00000xx57 . . . 5 1 00000xx65 20071001 20071231 1949xx07 5 1 00000xx65 20080101 20080331 1949xx07 5 1 Number of cases read: 11 Number of cases listed: 11 ============================= APPENDIX: Test data, and code ============================= DATA LIST LIST/ Recip.ID Begin.date End.date DOB Race Eth (A9, F8, F8, A8, F2, F2). BEGIN DATA 00000xx26 20071001 20071231 1963xx11 5 1 00000xx26 20080101 20080331 1963xx11 . . 00000xx26 20080401 20080630 1963xx11 . . 00000xx26 20080701 20080930 1963xx11 . . 00000xx26 20081001 20081231 1963xx11 . . 00000xx56 20071012 20071025 1964xx18 5 1 00000xx68 . . . 1 1 00000xx55 . . . 1 1 00000xx57 . . . 5 1 00000xx65 20071001 20071231 1949xx07 5 1 00000xx65 20080101 20080331 1949xx07 . . END DATA. LIST. AGGREGATE OUTFILE=* MODE=ADDVARS OVERWRITE=YES /BREAK=Recip.ID /Race =FIRST(Race) /Eth =FIRST(Eth). LIST. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
That would work - and would solve the
ordering problem if it is there - but you would need to incorporate results MATCH FILES File= OriginalFile /Table= AggFile /By Recip.ID /Map
. Dennis From: Richard Ristow
[mailto:[hidden email]] At 02:14 PM 4/27/2009, Dennis Deck wrote, correctly, Marta’s response was fine except
that it does not check for missing data in the first record for an
individual.
|
|
At 03:17 PM 4/29/2009, Dennis Deck wrote:
[Aggregate] would work - and would solve the ordering problem if it is there - but you would need to incorporate results back into the source file for most analyses, something like: AGGREGATE with MODE=ADDVARIABLES effectively does the aggregation and matching in a single step; see "Race" and "Eth", before and after, in the example code I posted. ("DOB" isn't 'spread' like that, because I forgot to put that on the AGGREGATE list.) Notice that the AGGREGATE specifies AGGREGATE OUTFILE=* MODE=ADDVARS OVERWRITE=YES That allows the values that are 'spread' over all cases to have the same names as the original values that were present in only one case. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
| Free forum by Nabble | Edit this page |
