copying and pasting information to a large file

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

copying and pasting information to a large file

Stirkey, Vicki

I have a massive file, over 10 million cases, that contains data about mental health services recipients. The recipients are identified by a unique number (see variable in the middle, highlighted in yellow). Race and ethnicity information was appended to the original file (see 2 variables on the end). Recipients are tracked over several quarters. The race/eth codes are displayed for the first instance of the recipient number, but subsequent cases do not list the number. See example below. I could copy and paste the information, but for over 10 million cases, that might take awhile. Is there a code that I could write in syntax that would add the race and ethnicity data for each line?

 

Yr/qtr       County   Cat         Sex         Med A     MCO        Recip ID                Begin date             End date               DOB                    Race      Eth

20074    25           J  00       M             Y              25           00000xx26            20071001             20071231             1963xx11               5              1

20081    25           J  00       M             Y              25           00000xx26            20080101             20080331             1963xx11               .               .

20082    25           J  00       M             Y              25           00000xx26            20080401             20080630             1963xx11               .               .

20083    25           J  00       M             Y              25           00000xx26            20080701             20080930             1963xx11               .               .

20084    25           J  00       M             Y              25           00000xx26            20081001             20081231             1963xx11               .               .

20074    39           D  00      F              N             39           00000xx56            20071012             20071025             1964xx18               5              1

.                                                                                               00000xx68            .               .               .               1              1

.                                                                                               00000xx55            .               .               .               1              1

.                                                                                               00000xx57            .               .               .               5              1

20074    02           J  00       F              N             02           00000xx65            20071001             20071231             1949xx07               5              1

20081    02           J  00       F              N             02           00000xx65            20080101             20080331             1949xx07               .               .

Vicki L. Stirkey

Program Analyst 3

OMHSAS

Office of Mental Health and Substance Abuse Services

Division of Systems Management

717-705-8198

Fax: 717-772-6737

 

Reply | Threaded
Open this post in threaded view
|

Re: copying and pasting information to a large file

Marta Garcia-Granero
Stirkey, Vicki wrote:

I have a massive file, over 10 million cases, that contains data about mental health services recipients. The recipients are identified by a unique number (see variable in the middle, highlighted in yellow). Race and ethnicity information was appended to the original file (see 2 variables on the end). Recipients are tracked over several quarters. The race/eth codes are displayed for the first instance of the recipient number, but subsequent cases do not list the number. See example below. I could copy and paste the information, but for over 10 million cases, that might take awhile. Is there a code that I could write in syntax that would add the race and ethnicity data for each line?

 

Yr/qtr       County   Cat         Sex         Med A     MCO        Recip ID                Begin date             End date               DOB                    Race      Eth

20074    25           J  00       M             Y              25           00000xx26            20071001             20071231             1963xx11               5              1

20081    25           J  00       M             Y              25           00000xx26            20080101             20080331             1963xx11               .               .

20082    25           J  00       M             Y              25           00000xx26            20080401             20080630             1963xx11               .               .

20083    25           J  00       M             Y              25           00000xx26            20080701             20080930             1963xx11               .               .

20084    25           J  00       M             Y              25           00000xx26            20081001             20081231             1963xx11               .               .

20074    39           D  00      F              N             39           00000xx56            20071012             20071025             1964xx18               5              1

.                                                                                               00000xx68            .               .               .               1              1

.                                                                                               00000xx55            .               .               .               1              1

.                                                                                               00000xx57            .               .               .               5              1

20074    02           J  00       F              N             02           00000xx65            20071001             20071231             1949xx07               5              1

20081    02           J  00       F              N             02           00000xx65            20080101             20080331             1949xx07               .               .


IF MISSING(Race) Race=LAG(Race).
IF MISSING(Eth) Eth=LAG(Eth).

HTH,
Marta García-Granero
--
For miscellaneous SPSS related statistical stuff, visit:
http://gjyp.nl/marta/
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Copying and pasting information to a large file

Stirkey, Vicki
In reply to this post by Stirkey, Vicki

Problem has been resolved. Thanks everyone for their input.

 

Vicki L. Stirkey

Program Analyst 3

OMHSAS

Office of Mental Health and Substance Abuse Services

Division of Systems Management

717-705-8198

Fax: 717-772-6737

 

Reply | Threaded
Open this post in threaded view
|

Re: copying and pasting information to a large file

Art Kendall
In reply to this post by Stirkey, Vicki
something like this untested syntax should do it assuming that the data was entered correctly and there is only 1 case with the race and eth variables and if one is missing the other is also. be sure to save the results as a new file.

sort cases by recip_id (a) race (d) eth (d).
do if recip_id eq lag(recip_id) and (missing(race) or Missing(eth)).
compute race= lag(race).
compute eth  = lag(eth).
end if.


Art Kendall
Social Research Consultants

Stirkey, Vicki wrote:

I have a massive file, over 10 million cases, that contains data about mental health services recipients. The recipients are identified by a unique number (see variable in the middle, highlighted in yellow). Race and ethnicity information was appended to the original file (see 2 variables on the end). Recipients are tracked over several quarters. The race/eth codes are displayed for the first instance of the recipient number, but subsequent cases do not list the number. See example below. I could copy and paste the information, but for over 10 million cases, that might take awhile. Is there a code that I could write in syntax that would add the race and ethnicity data for each line?

 

Yr/qtr       County   Cat         Sex         Med A     MCO        Recip ID                Begin date             End date               DOB                    Race      Eth

20074    25           J  00       M             Y              25           00000xx26            20071001             20071231             1963xx11               5              1

20081    25           J  00       M             Y              25           00000xx26            20080101             20080331             1963xx11               .               .

20082    25           J  00       M             Y              25           00000xx26            20080401             20080630             1963xx11               .               .

20083    25           J  00       M             Y              25           00000xx26            20080701             20080930             1963xx11               .               .

20084    25           J  00       M             Y              25           00000xx26            20081001             20081231             1963xx11               .               .

20074    39           D  00      F              N             39           00000xx56            20071012             20071025             1964xx18               5              1

.                                                                                               00000xx68            .               .               .               1              1

.                                                                                               00000xx55            .               .               .               1              1

.                                                                                               00000xx57            .               .               .               5              1

20074    02           J  00       F              N             02           00000xx65            20071001             20071231             1949xx07               5              1

20081    02           J  00       F              N             02           00000xx65            20080101             20080331             1949xx07               .               .

Vicki L. Stirkey

Program Analyst 3

OMHSAS

Office of Mental Health and Substance Abuse Services

Division of Systems Management

717-705-8198

Fax: 717-772-6737

 

Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: copying and pasting information to a large file

Dennis Deck
In reply to this post by Stirkey, Vicki

Marta’s response was fine except that it does not check for missing data in the first record for an individual. 
Odds are good that there is some missing data in the file, in which case you end up carrying forward a value from the last individual who did have a value.

 

I would use a slight variation (note that your sample suggests that records are sorted by ID and date):

 

IF ($casenum>1 and RecipID=LAG(RecipID) and MISSING(Race)) Race = LAG(Race) .

IF ($casenum>1 and RecipID=LAG(RecipID) and MISSING(Eth))    Eth   = LAG(Eth) .

 

Dennis Deck
RMC Research

 


From: Stirkey, Vicki [mailto:[hidden email]]
Sent: Friday, April 24, 2009 6:38 AM
Subject: copying and pasting information to a large file

 

I have a massive file, over 10 million cases, that contains data about mental health services recipients. The recipients are identified by a unique number (see variable in the middle, highlighted in yellow). Race and ethnicity information was appended to the original file (see 2 variables on the end). Recipients are tracked over several quarters. The race/eth codes are displayed for the first instance of the recipient number, but subsequent cases do not list the number. See example below. I could copy and paste the information, but for over 10 million cases, that might take awhile. Is there a code that I could write in syntax that would add the race and ethnicity data for each line?

 

Yr/qtr       County   Cat         Sex         Med A     MCO        Recip ID                Begin date             End date               DOB                    Race      Eth

20074    25           J  00       M             Y              25           00000xx26            20071001             20071231             1963xx11               5              1

20081    25           J  00       M             Y              25           00000xx26            20080101             20080331             1963xx11               .               .

20082    25           J  00       M             Y              25           00000xx26            20080401             20080630             1963xx11               .               .

20083    25           J  00       M             Y              25           00000xx26            20080701             20080930             1963xx11               .               .

20084    25           J  00       M             Y              25           00000xx26            20081001             20081231             1963xx11               .               .

20074    39           D  00      F              N             39           00000xx56            20071012             20071025             1964xx18               5              1

.                                                                                               00000xx68            .               .               .               1              1

.                                                                                               00000xx55            .               .               .               1              1

.                                                                                               00000xx57            .               .               .               5              1

20074    02           J  00       F              N             02           00000xx65            20071001             20071231             1949xx07               5              1

20081    02           J  00       F              N             02           00000xx65            20080101             20080331             1949xx07               .               .

Vicki L. Stirkey

Program Analyst 3

OMHSAS

Office of Mental Health and Substance Abuse Services

Division of Systems Management

717-705-8198

Fax: 717-772-6737

 

Reply | Threaded
Open this post in threaded view
|

Re: copying and pasting information to a large file

Marta Garcia-Granero
Dennis Deck wrote:

>
> Marta’s response was fine except that it does not check for missing
> data in the first record for an individual.
> Odds are good that there is some missing data in the file, in which
> case you end up carrying forward a value from the last individual who
> did have a value.
>
> I would use a slight variation (note that your sample suggests that
> records are sorted by ID and date):
>
> IF ($casenum>1 and RecipID=LAG(RecipID) and MISSING(Race)) Race =
> LAG(Race) .
>
> IF ($casenum>1 and RecipID=LAG(RecipID) and MISSING(Eth)) Eth = LAG(Eth) .
>
Yes, I got the same comment from Richard Ristow (your mails are always
welcome, BTW, Richard). I was perfectly aware of that, but, given the
sample dataset and the general information (data already sorted) on the
dataset that Vicki provided, I assumed that no "first values" were
missing for Race and Eth, and kept the response as simpler as possible
(off line I got a response from Vicki stating the code had worked
perfectly). Anyway, it would have been a good idea to add a comment (to
my original message) that problems might arise if:

- dataset wasn't properly sorted
- some cases had missing values in their first records.

BTW, Dennis, I think your modification would not solve for the second
problem. Consider the last two records (RecipID=00000xx65), if Race is
missing for the first, the value will be copied from the previous record
(00000xx57). I like Art Kendall's suggestion better (he sent it to the
list a couple of days ago): IF RecipID=LAG(RecipID) AND MISSING....

Regards,
Marta

> ------------------------------------------------------------------------
>
> *From:* Stirkey, Vicki [mailto:[hidden email]]
> *Sent:* Friday, April 24, 2009 6:38 AM
> *Subject:* copying and pasting information to a large file
>
> I have a massive file, over 10 million cases, that contains data about
> mental health services recipients. The recipients are identified by a
> unique number (see variable in the middle, highlighted in yellow).
> Race and ethnicity information was appended to the original file (see
> 2 variables on the end). Recipients are tracked over several quarters.
> The race/eth codes are displayed for the first instance of the
> recipient number, but subsequent cases do not list the number. See
> example below. I could copy and paste the information, but for over 10
> million cases, that might take awhile. Is there a code that I could
> write in syntax that would add the race and ethnicity data for each line?
>
> *Yr/qtr County Cat Sex Med A MCO Recip ID Begin date End date DOB Race
> Eth*
>
> 20074 25 J 00 M Y 25 00000xx26 20071001 20071231 1963xx11 5 1
>
> 20081 25 J 00 M Y 25 00000xx26 20080101 20080331 1963xx11 . .
>
> 20082 25 J 00 M Y 25 00000xx26 20080401 20080630 1963xx11 . .
>
> 20083 25 J 00 M Y 25 00000xx26 20080701 20080930 1963xx11 . .
>
> 20084 25 J 00 M Y 25 00000xx26 20081001 20081231 1963xx11 . .
>
> 20074 39 D 00 F N 39 00000xx56 20071012 20071025 1964xx18 5 1
>
> . 00000xx68 . . . 1 1
>
> . 00000xx55 . . . 1 1
>
> . 00000xx57 . . . 5 1
>
> 20074 02 J 00 F N 02 00000xx65 20071001 20071231 1949xx07 5 1
>
> 20081 02 J 00 F N 02 00000xx65 20080101 20080331 1949xx07 . .
>
> **Vicki L. Stirkey**
>
> **Program Analyst 3**
>
> **OMHSAS**
>
> **Office of Mental Health and Substance Abuse Services**
>
> **Division of Systems Management**
>
> **717-705-8198**
>
> **Fax: 717-772-6737**
>


--
For miscellaneous SPSS related statistical stuff, visit:
http://gjyp.nl/marta/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: copying and pasting information to a large file

Richard Ristow
In reply to this post by Dennis Deck
At 02:14 PM 4/27/2009, Dennis Deck wrote, correctly,

Marta’s response was fine except that it does not check for missing data in the first record for an individual. 

and gives a solution.

Dennis is right about the problem. I'd recommend solving it using AGGREGATE, which in addition to handling missing data, works properly if the data is present, but is in some record other than the first -- another highly possible occurrence:

|-----------------------------|---------------------------|
|Output Created               |25-APR-2009 23:42:42       |
|-----------------------------|---------------------------|
Recip.ID  Begin.date End.date DOB      Race Eth

00000xx26  20071001  20071231 1963xx11   5    1
00000xx26  20080101  20080331 1963xx11   .    .
00000xx26  20080401  20080630 1963xx11   .    .
00000xx26  20080701  20080930 1963xx11   .    .
00000xx26  20081001  20081231 1963xx11   .    .
00000xx56  20071012  20071025 1964xx18   5    1
00000xx68         .         . .          1    1
00000xx55         .         . .          1    1
00000xx57         .         . .          5    1
00000xx65  20071001  20071231 1949xx07   5    1
00000xx65  20080101  20080331 1949xx07   .    .

Number of cases read:  11    Number of cases listed:  11

 
AGGREGATE OUTFILE=* MODE=ADDVARS OVERWRITE=YES
   /BREAK=Recip.ID
   /Race =FIRST(Race)
   /Eth  =FIRST(Eth).

LIST.
List
|-----------------------------|---------------------------|
|Output Created               |25-APR-2009 23:44:48       |
|-----------------------------|---------------------------|
Recip.ID  Begin.date End.date DOB      Race Eth

00000xx26  20071001  20071231 1963xx11   5    1
00000xx26  20080101  20080331 1963xx11   5    1
00000xx26  20080401  20080630 1963xx11   5    1
00000xx26  20080701  20080930 1963xx11   5    1
00000xx26  20081001  20081231 1963xx11   5    1
00000xx56  20071012  20071025 1964xx18   5    1
00000xx68         .         . .          1    1
00000xx55         .         . .          1    1
00000xx57         .         . .          5    1
00000xx65  20071001  20071231 1949xx07   5    1
00000xx65  20080101  20080331 1949xx07   5    1

Number of cases read:  11    Number of cases listed:  11
=============================
APPENDIX: Test data, and code
=============================
DATA LIST LIST/
     Recip.ID   Begin.date  End.date  DOB       Race  Eth
     (A9,       F8,         F8,       A8,       F2,   F2).
BEGIN DATA    
     00000xx26  20071001    20071231  1963xx11  5     1      
     00000xx26  20080101    20080331  1963xx11  .     .      
     00000xx26  20080401    20080630  1963xx11  .     .      
     00000xx26  20080701    20080930  1963xx11  .     .      
     00000xx26  20081001    20081231  1963xx11  .     .      
     00000xx56  20071012    20071025  1964xx18  5     1      
     00000xx68  .           .         .         1     1      
     00000xx55  .           .         .         1     1      
     00000xx57  .           .         .         5     1      
     00000xx65  20071001    20071231  1949xx07  5     1      
     00000xx65  20080101    20080331  1949xx07  .     .
END DATA.    

LIST.

AGGREGATE OUTFILE=* MODE=ADDVARS OVERWRITE=YES
   /BREAK=Recip.ID
   /Race =FIRST(Race)
   /Eth  =FIRST(Eth).

LIST.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: copying and pasting information to a large file

Dennis Deck

That would work - and would solve the ordering problem if it is there - but you would need to incorporate results
back into the source file for most analyses,  something like:

 

MATCH FILES

   File= OriginalFile

  /Table= AggFile  

    /By Recip.ID  /Map .

 

 

Dennis


From: Richard Ristow [mailto:[hidden email]]
Sent: Wednesday, April 29, 2009 10:13 AM
To: Dennis Deck; [hidden email]
Subject: Re: copying and pasting information to a large file

 

At 02:14 PM 4/27/2009, Dennis Deck wrote, correctly,


Marta’s response was fine except that it does not check for missing data in the first record for an individual. 


and gives a solution.

Dennis is right about the problem. I'd recommend solving it using AGGREGATE, which in addition to handling missing data, works properly if the data is present, but is in some record other than the first -- another highly possible occurrence:

|-----------------------------|---------------------------|
|Output Created               |25-APR-2009 23:42:42       |
|-----------------------------|---------------------------|
Recip.ID  Begin.date End.date DOB      Race Eth

00000xx26  20071001  20071231 1963xx11   5    1
00000xx26  20080101  20080331 1963xx11   .    .
00000xx26  20080401  20080630 1963xx11   .    .
00000xx26  20080701  20080930 1963xx11   .    .
00000xx26  20081001  20081231 1963xx11   .    .
00000xx56  20071012  20071025 1964xx18   5    1
00000xx68         .         . .          1    1
00000xx55         .         . .          1    1
00000xx57         .         . .          5    1
00000xx65  20071001  20071231 1949xx07   5    1
00000xx65  20080101  20080331 1949xx07   .    .

Number of cases read:  11    Number of cases listed:  11

 
AGGREGATE OUTFILE=* MODE=ADDVARS OVERWRITE=YES
   /BREAK=Recip.ID
   /Race =FIRST(Race)
   /Eth  =FIRST(Eth).

LIST.
List
|-----------------------------|---------------------------|
|Output Created               |25-APR-2009 23:44:48       |
|-----------------------------|---------------------------|
Recip.ID  Begin.date End.date DOB      Race Eth

00000xx26  20071001  20071231 1963xx11   5    1
00000xx26  20080101  20080331 1963xx11   5    1
00000xx26  20080401  20080630 1963xx11   5    1
00000xx26  20080701  20080930 1963xx11   5    1
00000xx26  20081001  20081231 1963xx11   5    1
00000xx56  20071012  20071025 1964xx18   5    1
00000xx68         .         . .          1    1
00000xx55         .         . .          1    1
00000xx57         .         . .          5    1
00000xx65  20071001  20071231 1949xx07   5    1
00000xx65  20080101  20080331 1949xx07   5    1

Number of cases read:  11    Number of cases listed:  11
=============================
APPENDIX: Test data, and code
=============================
DATA LIST LIST/
     Recip.ID   Begin.date  End.date  DOB       Race  Eth
     (A9,       F8,         F8,       A8,       F2,   F2).
BEGIN DATA    
     00000xx26  20071001    20071231  1963xx11  5     1      
     00000xx26  20080101    20080331  1963xx11  .     .      
     00000xx26  20080401    20080630  1963xx11  .     .      
     00000xx26  20080701    20080930  1963xx11  .     .      
     00000xx26  20081001    20081231  1963xx11  .     .      
     00000xx56  20071012    20071025  1964xx18  5     1      
     00000xx68  .           .         .         1     1      
     00000xx55  .           .         .         1     1      
     00000xx57  .           .         .         5     1      
     00000xx65  20071001    20071231  1949xx07  5     1      
     00000xx65  20080101    20080331  1949xx07  .     .
END DATA.    

LIST.

AGGREGATE OUTFILE=* MODE=ADDVARS OVERWRITE=YES
   /BREAK=Recip.ID
   /Race =FIRST(Race)
   /Eth  =FIRST(Eth).

LIST.


Reply | Threaded
Open this post in threaded view
|

Re: copying and pasting information to a large file

Richard Ristow
At 03:17 PM 4/29/2009, Dennis Deck wrote:

[Aggregate] would work - and would solve the ordering problem if it is there - but you would need to incorporate results back into the source file for most analyses,   something like:
 
MATCH FILES
   File= OriginalFile
  /Table= AggFile 
    /By Recip.ID  /Map

AGGREGATE with MODE=ADDVARIABLES effectively does the aggregation and matching in a single step; see "Race" and "Eth", before and after, in the example code I posted. ("DOB" isn't 'spread' like that, because I forgot to put that on the AGGREGATE list.)

Notice that the AGGREGATE specifies

AGGREGATE OUTFILE=* MODE=ADDVARS OVERWRITE=YES

That allows the values that are 'spread' over all cases to have the same names as the original values that were present in only one case.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD