I am trying to find a way to summarize millions of records, over relatively unique keys. Let me provide an example:
The file has a 4 variable key for each individual. These variables are numeric. Each individual can then have 1 to many 10 sub rows based on a 5th field (date) that can be ordered.
I've summarized each user on the FIVE fields:
Institution ID Type Code StartDate
So that I now have a count of records for each as well as a couple of other summarized variables.
What I need is, with each "Institution ID Type Code" is a cumulative sum of the number of records and the other variables, not including the current record.
So, for example:
Institution ID Type Code StartDate Count Sum
1 1 1 1 1/1/10 1 1
1 1 1 1 1/15/10 1 1
1 1 1 1 2/1/10 2 .5
1 1 1 1 3/1/10 1 0
1 1 1 1 3/15/10 1 1
1 2 1 1 1/13/10 2 2
The first 5 rows are a single individual, as can be seen in the unique 4 fields. The data is then ordered by date. My final result should look something like this:
Institution ID Type Code StartDate Count Sum Ratio
1 1 1 1 1/1/10 1 1 .
1 1 1 1 1/15/10 1 1 1
1 1 1 1 2/1/10 2 1.5 1
1 1 1 1 3/1/10 1 0 .875 /* Notice that this is the ratio of SUM/COUNT for the 3 rows
1 1 1 1 3/15/10 1 1 .7
1 2 1 1 1/13/10 2 2
1 2 1 1 2/15/10 1 1 1
Key elements:
- Lagged cumulative value over the unique indentifiers (could be merged if necessary)
- Raw values are acceptable rather than the ratio, like this:
Institution ID Type Code StartDate Count Sum TCount TSum
1 1 1 1 1/1/10 1 1 .
1 1 1 1 1/15/10 1 1 1 1
1 1 1 1 2/1/10 2 1.5 2 2
1 1 1 1 3/1/10 1 0 4 3.5
1 1 1 1 3/15/10 1 1 5 3.5
1 2 1 1 1/13/10 2 2
1 2 1 1 2/15/10 1 1 1 1
I am familiar with lag, but I need the cumulative sum of the records for this individual, lagged by 1, not just the values from the prior record.
Can anyone help?
Rebecca
|
Dear Rebecca,
If I understood you correctly, the following code should work for you. Best, Ruben new file. Data list free/Institution ID Type Code Count Sum(6f3.1) StartDate(adate10). begin data 1 1 1 1 1 1 1/1/10 1 1 1 1 1 1 1/15/10 1 1 1 1 2 1.5 2/1/10 1 1 1 1 1 0 3/1/10 1 1 1 1 1 1 3/15/10 1 2 1 1 2 2 1/13/10 1 2 1 1 1 1 2/15/10 end data. sort cases id startdate. if id=lag(id)and institution=lag(institution)and type=lag(type)and code=lag(code)ccount=sum(lag(ccount),lag(count)). if id=lag(id)and institution=lag(institution)and type=lag(type)and code=lag(code)csum=sum(lag(csum),lag(sum)). compute ratio=csum/ccount. exe. Date: Sun, 1 Jul 2012 21:22:42 +0000 From: [hidden email] Subject: Sequence over keys, and sum of prior records To: [hidden email] I am trying to find a way to summarize millions of records, over relatively unique keys. Let me provide an example:
The file has a 4 variable key for each individual. These variables are numeric. Each individual can then have 1 to many 10 sub rows based on a 5th field (date) that can be ordered.
I've summarized each user on the FIVE fields:
Institution ID Type Code StartDate
So that I now have a count of records for each as well as a couple of other summarized variables.
What I need is, with each "Institution ID Type Code" is a cumulative sum of the number of records and the other variables, not including the current record.
So, for example:
Institution ID Type Code StartDate Count Sum
1 1 1 1 1/1/10 1 1
1 1 1 1 1/15/10 1 1
1 1 1 1 2/1/10 2 .5
1 1 1 1 3/1/10 1 0
1 1 1 1 3/15/10 1 1
1 2 1 1 1/13/10 2 2
The first 5 rows are a single individual, as can be seen in the unique 4 fields. The data is then ordered by date. My final result should look something like this:
Institution ID Type Code StartDate Count Sum Ratio
1 1 1 1 1/1/10 1 1 .
1 1 1 1 1/15/10 1 1 1
1 1 1 1 2/1/10 2 1.5 1
1 1 1 1 3/1/10 1 0 .875 /* Notice that this is the ratio of SUM/COUNT for the 3 rows
1 1 1 1 3/15/10 1 1 .7
1 2 1 1 1/13/10 2 2
1 2 1 1 2/15/10 1 1 1
Key elements:
- Lagged cumulative value over the unique indentifiers (could be merged if necessary)
- Raw values are acceptable rather than the ratio, like this:
Institution ID Type Code StartDate Count Sum TCount TSum
1 1 1 1 1/1/10 1 1 .
1 1 1 1 1/15/10 1 1 1 1
1 1 1 1 2/1/10 2 1.5 2 2
1 1 1 1 3/1/10 1 0 4 3.5
1 1 1 1 3/15/10 1 1 5 3.5
1 2 1 1 1/13/10 2 2
1 2 1 1 2/15/10 1 1 1 1
I am familiar with lag, but I need the cumulative sum of the records for this individual, lagged by 1, not just the values from the prior record.
Can anyone help?
Rebecca
|
Free forum by Nabble | Edit this page |