Sequence over keys, and sum of prior records

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Sequence over keys, and sum of prior records

Rebecca Barber
I am trying to find a way to summarize millions of records, over relatively unique keys.  Let me provide an example:

The file has a 4 variable key for each individual.  These variables are numeric.  Each individual can then have 1 to many 10 sub rows based on a 5th field (date) that can be ordered.

I've summarized each user on the FIVE fields:
Institution ID Type Code StartDate 
So that I now have a count of records for each as well as a couple of other summarized variables.

What I need is, with each "Institution ID Type Code" is a cumulative sum of the number of records and the other variables, not including the current record.

So, for example:
Institution ID Type Code StartDate Count Sum
1              1   1      1      1/1/10     1        1
1              1   1      1      1/15/10   1        1
1              1   1      1      2/1/10     2        .5
1              1   1      1      3/1/10     1        0
1              1   1      1      3/15/10    1       1
1              2   1      1      1/13/10    2       2

The first 5 rows are a single individual, as can be seen in the unique 4 fields.  The data is then ordered by date.  My final result should look something like this:
Institution ID Type Code StartDate Count Sum Ratio
1              1   1      1      1/1/10     1        1     .
1              1   1      1      1/15/10   1        1      1
1              1   1      1      2/1/10     2        1.5   1
1              1   1      1      3/1/10     1        0     .875  /* Notice that this is the ratio of SUM/COUNT for the 3 rows
1              1   1      1      3/15/10    1       1     .7
1              2   1      1      1/13/10    2       2
1              2   1      1      2/15/10    1       1     1

Key elements:
- Lagged cumulative value over the unique indentifiers (could be merged if necessary)
- Raw values are acceptable rather than the ratio, like this:

Institution ID Type Code StartDate Count Sum TCount TSum
1              1   1      1      1/1/10     1        1     .
1              1   1      1      1/15/10   1        1      1         1
1              1   1      1      2/1/10     2        1.5   2         2
1              1   1      1      3/1/10     1        0      4        3.5
1              1   1      1      3/15/10    1       1       5        3.5
1              2   1      1      1/13/10    2       2
1              2   1      1      2/15/10    1       1       1         1

I am familiar with lag, but I need the cumulative sum of the records for this individual, lagged by 1, not just the values from the prior record.

Can anyone help?

Rebecca
Reply | Threaded
Open this post in threaded view
|

Re: Sequence over keys, and sum of prior records

Ruben Geert van den Berg
Dear Rebecca,

If I understood you correctly, the following code should work for you.

Best,

Ruben

new file.

Data list free/Institution ID Type Code          Count Sum(6f3.1) StartDate(adate10).
begin data
1              1   1      1        1       1     1/1/10   
1              1   1      1       1       1      1/15/10  
1              1   1      1        2       1.5   2/1/10   
1              1   1      1        1       0     3/1/10   
1              1   1      1        1      1      3/15/10  
1              2   1      1        2      2   1/13/10  
1              2   1      1        1      1      2/15/10  
end data.

sort cases id startdate.

if id=lag(id)and institution=lag(institution)and type=lag(type)and code=lag(code)ccount=sum(lag(ccount),lag(count)).
if id=lag(id)and institution=lag(institution)and type=lag(type)and code=lag(code)csum=sum(lag(csum),lag(sum)).
compute ratio=csum/ccount.
exe.


Date: Sun, 1 Jul 2012 21:22:42 +0000
From: [hidden email]
Subject: Sequence over keys, and sum of prior records
To: [hidden email]

I am trying to find a way to summarize millions of records, over relatively unique keys.  Let me provide an example:

The file has a 4 variable key for each individual.  These variables are numeric.  Each individual can then have 1 to many 10 sub rows based on a 5th field (date) that can be ordered.

I've summarized each user on the FIVE fields:
Institution ID Type Code StartDate 
So that I now have a count of records for each as well as a couple of other summarized variables.

What I need is, with each "Institution ID Type Code" is a cumulative sum of the number of records and the other variables, not including the current record.

So, for example:
Institution ID Type Code StartDate Count Sum
1              1   1      1      1/1/10     1        1
1              1   1      1      1/15/10   1        1
1              1   1      1      2/1/10     2        .5
1              1   1      1      3/1/10     1        0
1              1   1      1      3/15/10    1       1
1              2   1      1      1/13/10    2       2

The first 5 rows are a single individual, as can be seen in the unique 4 fields.  The data is then ordered by date.  My final result should look something like this:
Institution ID Type Code StartDate Count Sum Ratio
1              1   1      1      1/1/10     1        1     .
1              1   1      1      1/15/10   1        1      1
1              1   1      1      2/1/10     2        1.5   1
1              1   1      1      3/1/10     1        0     .875  /* Notice that this is the ratio of SUM/COUNT for the 3 rows
1              1   1      1      3/15/10    1       1     .7
1              2   1      1      1/13/10    2       2
1              2   1      1      2/15/10    1       1     1

Key elements:
- Lagged cumulative value over the unique indentifiers (could be merged if necessary)
- Raw values are acceptable rather than the ratio, like this:

Institution ID Type Code StartDate Count Sum TCount TSum
1              1   1      1      1/1/10     1        1     .
1              1   1      1      1/15/10   1        1      1         1
1              1   1      1      2/1/10     2        1.5   2         2
1              1   1      1      3/1/10     1        0      4        3.5
1              1   1      1      3/15/10    1       1       5        3.5
1              2   1      1      1/13/10    2       2
1              2   1      1      2/15/10    1       1       1         1

I am familiar with lag, but I need the cumulative sum of the records for this individual, lagged by 1, not just the values from the prior record.

Can anyone help?

Rebecca