Using CSUM and LAG within groups and within time period

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Using CSUM and LAG within groups and within time period

Snuffy Dog
SPSS friends,
 
Given the following dataset (excluding the desired output column).
 
I want to calculate the cumulitive sum for each row using the Test_score variable and I want to do this within each id.  That bit I have managed to solve, however to make things more complicated I only want to use those Test_Score values for each case from those taken from the preceeding 12 months. The desired output column shows you the output that I am trying to calculate. In essesence I am using CSUM within a group defined by ID and also within a time frame of the last 12 months.
 
I can work out the CSUM bit but not the lag for the preceeding 12 months.
 
sort cases by ID (A) date (A).
split file by ID.
create cum = CSUM(Test_Score). *but only use values for the last 12 months in the csum
split file off.
 
ID Test_Score Date   Desired_output 
1 0.33 01.02.2007 0.33
1 0.50 20.03.2007 0.83
1 1.00 01.05.2008 1.00
1 1.00 15.05.2008 2.00
1 0.22 01.02.2008 2.22
1 0.22 20.03.2008 2.44
1 1.00 01.05.2008 3.44
1 1.00 15.05.2008 4.44
3 0.10 01.02.2006 0.10
3 0.33 20.03.2006 0.43
3 1.00 01.05.2006 1.43
3 0.98 15.05.2008 0.98
4 0.98 01.02.2008 0.98
4 0.77 02.03.2008 1.75
4 1.00 01.05.2010 1.00
 
Help - this is a tricky one (for me at least)
 
Many thanks,
 
Jack Thomas
Reply | Threaded
Open this post in threaded view
|

Re: Using CSUM and LAG within groups and within time period

Art Kendall
I tried to understand your post and tried some things.   I do not understand your meaning of what you want to include.
Copy the syntax below into a new instance of SPSS and see if it helps to clarify what you are asking.


data list list/
ID (f1)     score(f5.2)     MyDate(date10)       desire (f5.2).
begin data
1     0.33     01.02.2007     0.33
1     0.50     20.03.2007     0.83
1     1.00     01.05.2008     1.00
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
1     0.22     20.03.2008     2.44
1     1.00     01.05.2008     3.44
1     1.00     15.05.2008     4.44
3     0.10     01.02.2006     0.10
3     0.33     20.03.2006     0.43
3     1.00     01.05.2006     1.43
3     0.98     15.05.2008     0.98
4     0.98     01.02.2008     0.98
4     0.77     02.03.2008     1.75
4     1.00     01.05.2010     1.00
end data.
*sort cases by ID (A) mydate (A).
compute seq = $casenum.
formats seq (f3).
match files file=* /keep seq all.
split file by ID.
create cum = CSUM(score). /*but only use values for the last 12 months in the csum.
split file off.
aggregate outfile=* mode=addvariables overwrite = yes /break = id
 /earliest = min(mydate)
 /latest = max(mydate).
formats earliest latest(date10).
compute gap1 = datediff(latest, mydate, "months").
if id eq lag(id) gap2 = datediff(mydate, lag(mydate), "months").
formats gap1 gap2 (f4).
sort cases by seq(d).
compute gap3 = datediff(mydate, earliest, "months").
formats gap3 (f4).
sort cases by seq (a).
if gap1 le 12 use1=score.
if gap2 le 12 use2=score.
if gap3 le 12 use3=score.
split file by id.
create
   try1 = CSUM(use1)
  /try2 = CSUM(use2)
  /try3 = CSUM(use3).
split file off.
list.

Art Kendall
Social Research Consultants

On 12/4/2010 1:47 PM, Snuffy Dog wrote:
SPSS friends,
 
Given the following dataset (excluding the desired output column).
 
I want to calculate the cumulitive sum for each row using the Test_score variable and I want to do this within each id.  That bit I have managed to solve, however to make things more complicated I only want to use those Test_Score values for each case from those taken from the preceeding 12 months. The desired output column shows you the output that I am trying to calculate. In essesence I am using CSUM within a group defined by ID and also within a time frame of the last 12 months.
 
I can work out the CSUM bit but not the lag for the preceeding 12 months.
 
sort cases by ID (A) date (A).
split file by ID.
create cum = CSUM(Test_Score). *but only use values for the last 12 months in the csum
split file off.
 
ID Test_Score Date   Desired_output 
1 0.33 01.02.2007 0.33
1 0.50 20.03.2007 0.83
1 1.00 01.05.2008 1.00
1 1.00 15.05.2008 2.00
1 0.22 01.02.2008 2.22
1 0.22 20.03.2008 2.44
1 1.00 01.05.2008 3.44
1 1.00 15.05.2008 4.44
3 0.10 01.02.2006 0.10
3 0.33 20.03.2006 0.43
3 1.00 01.05.2006 1.43
3 0.98 15.05.2008 0.98
4 0.98 01.02.2008 0.98
4 0.77 02.03.2008 1.75
4 1.00 01.05.2010 1.00
 
Help - this is a tricky one (for me at least)
 
Many thanks,
 
Jack Thomas
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Using CSUM and LAG within groups and within time period

David Marso
Administrator
In reply to this post by Snuffy Dog
Hey Dawg,
Here is another way to do this.
Restructure the data from long to wide and then use vectors.
Finally restructure from wide to long.
Modify the index values appropriately (8 change to the largest number of values per id).
BTW:  Several of your data values are out of order, so your "desire" column is incorrect.
OR maybe you should have ID as 2.  Anyway, my solution should do the trick.
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
----------------------------------------------------------------------------------------------------
DATA LIST LIST/ ID (f1) score(f5.2) MyDate(date10) desire (f5.2).
BEGIN DATA
1     0.33     01.02.2007     0.33
1     0.50     20.03.2007     0.83
1     1.00     01.05.2008     1.00
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
1     0.22     20.03.2008     2.44
1     1.00     01.05.2008     3.44
1     1.00     15.05.2008     4.44
3     0.10     01.02.2006     0.10
3     0.33     20.03.2006     0.43
3     1.00     01.05.2006     1.43
3     0.98     15.05.2008     0.98
4     0.98     01.02.2008     0.98
4     0.77     02.03.2008     1.75
4     1.00     01.05.2010     1.00
END DATA.
SORT CASES BY id mydate.
IF $CASENUM=1 OR (ID NE lag(ID)) SEQ=1.
IF SYSMIS(SEQ) SEQ=LAG(SEQ)+1.
VECTOR DATES(8) SCORES(8).
COMPUTE DATES(SEQ)=mydate.
COMPUTE SCORES(SEQ)=score.
AGGREGATE OUTFILE *
        / BREAK ID
        / DATES1 TO DATES8 SCORES1 TO SCORES8
     =MAX(DATES1 TO DATES8 SCORES1 TO SCORES8).
FORMATS  DATES1 TO DATES8 (EDATE).

VECTOR DATES=DATES1 TO DATES8 .
VECTOR SCORES=SCORES1 TO SCORES8.
VECTOR CUMS(8).
LOOP #=1 TO 8.
+  COMPUTE CUMS(#)=0.
+  LOOP ##=# TO 1 BY -1.
+    IF CTIME.DAYS(DATES(#)-DATES(##)) LE 365.25 CUMS(#)=CUMS(#) + SCORES(##).
+  END LOOP.
END LOOP.

LOOP #=1 to 8.
+  COMPUTE SCORE=SCORES(#).
+  COMPUTE MYDATE=DATES(#).
+  COMPUTE CUM=CUMS(#).
+  DO IF NOT MISSING(SCORE).
+    XSAVE OUTFILE "TMP.SAV" / KEEP ID SCORE MYDATE CUM.
+  END IF.
END LOOP.
EXE.
GET FILE "TMP.SAV" .

------------------------------------------------------
Another version using CASESTOVARS and VARSTOCASES.
The core is identical.
DATA LIST LIST/ ID (f1) score(f5.2) MyDate(date10) desire (f5.2).
BEGIN DATA
1     0.33     01.02.2007     0.33
1     0.50     20.03.2007     0.83
1     1.00     01.05.2008     1.00
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
1     0.22     20.03.2008     2.44
1     1.00     01.05.2008     3.44
1     1.00     15.05.2008     4.44
3     0.10     01.02.2006     0.10
3     0.33     20.03.2006     0.43
3     1.00     01.05.2006     1.43
3     0.98     15.05.2008     0.98
4     0.98     01.02.2008     0.98
4     0.77     02.03.2008     1.75
4     1.00     01.05.2010     1.00
END DATA.
SORT CASES BY id mydate.
IF $CASENUM=1 OR (ID NE lag(ID)) SEQ=1.
IF SYSMIS(SEQ) SEQ=LAG(SEQ)+1.
FORMATS SEQ(F2.0).
CASESTOVARS /ID = id /INDEX = seq /GROUPBY = VARIABLE .
VECTOR DATES=MYDATE.1 TO MYDATE.8 .
VECTOR SCORES=SCORE.1 TO SCORE.8.
VECTOR CUMS(8).
LOOP #=1 TO 8.
+  COMPUTE CUMS(#)=0.
+  LOOP ##=# TO 1 BY -1.
+    IF CTIME.DAYS(DATES(#)-DATES(##)) LE 365.25 CUMS(#)=CUMS(#) + SCORES(##).
+  END LOOP.
END LOOP.
VARSTOCASES  
  /MAKE mydate FROM mydate.1 TO mydate.8
  /MAKE Score  FROM score.1  TO score.8
  /MAKE cum    FROM cums1    TO cums8
  /KEEP =  id
  /NULL = DROP.
SELECT IF NOT MISSING(SCORE).
LIST.

ID     MYDATE  SCORE      CUM

 1 01-FEB-07     .33      .33
 1 20-MAR-07     .50      .83
 1 01-FEB-08     .22     1.05
 1 20-MAR-08     .22      .44
 1 01-MAY-08    1.00     1.44
 1 01-MAY-08    1.00     2.44
 1 15-MAY-08    1.00     3.44
 1 15-MAY-08    1.00     4.44
 3 01-FEB-06     .10      .10
 3 20-MAR-06     .33      .43
 3 01-MAY-06    1.00     1.43
 3 15-MAY-08     .98      .98
 4 01-FEB-08     .98      .98
 4 02-MAR-08     .77     1.75
 4 01-MAY-10    1.00     1.00


Number of cases read:  15    Number of cases listed:  15



Snuffy Dog wrote
SPSS friends,

Given the following dataset (excluding the desired output column).

I want to calculate the cumulitive sum for each row using the Test_score
variable and I want to do this within each id.  That bit I have managed to
solve, however to make things more complicated I only want to use those
Test_Score values for each case from those taken from the preceeding 12
months. The desired output column shows you the output that I am trying to
calculate. In essesence I am using CSUM within a group defined by ID and
also within a time frame of the last 12 months.

I can work out the CSUM bit but not the lag for the preceeding 12 months.

sort cases by ID (A) date (A).
split file by ID.
create cum = CSUM(Test_Score). *but only use values for the last 12 months
in the csum
split file off.

     ID Test_Score Date   Desired_output  1 0.33 01.02.2007 0.33 1 0.50
20.03.2007 0.83 1 1.00 01.05.2008 1.00 1 1.00 15.05.2008 2.00 1 0.22
01.02.2008 2.22 1 0.22 20.03.2008 2.44 1 1.00 01.05.2008 3.44 1 1.00
15.05.2008 4.44 3 0.10 01.02.2006 0.10 3 0.33 20.03.2006 0.43 3 1.00
01.05.2006 1.43 3 0.98 15.05.2008 0.98 4 0.98 01.02.2008 0.98 4 0.77
02.03.2008 1.75 4 1.00 01.05.2010 1.00

Help - this is a tricky one (for me at least)

Many thanks,

Jack Thomas
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Using CSUM and LAG within groups and within time period

MaxJasper
In reply to this post by Snuffy Dog
Hi Jack,

Here is a simple solution. If you need more definite date differences then you can change 12 month to 365 days.

sc : sum of score for each id during last 12 months.

SORT CASES BY  id(A) date(A).
compute sc =score.
DO IF ($casenum =1 | id > lag(id)).
   compute sc = score.
ELSE if (id = lag(id) & (DATEDIFF(date,lag(date),'month') < 12)) .
   compute sc = sum(lag(sc),  score).
END IF.
EXECUTE.



Max.
Reply | Threaded
Open this post in threaded view
|

Re: Using CSUM and LAG within groups and within time period

MaxJasper
Then output result is as follows:
output of the syntax

Your desire values are different because you did not sort cases by id & date.
Reply | Threaded
Open this post in threaded view
|

Re: Using CSUM and LAG within groups and within time period

David Marso
Administrator
Nice try, but... Sorry, wrong answer!
OP req was that the cum be windowed within one year, not that consecutive values be within 1 year.
I suspect that the simplest solution is what I previously posted.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Using CSUM and LAG within groups and within time period

Snuffy Dog
In reply to this post by David Marso
Hi all,
 
Supplimental question.  David's solution works best however what I neglected to add was that the maximum number of rows per ID grouping variable may change and in some cases goes into the hundreds.  Currently the solution uses 8 vectors, is there an automatic way to work out the the maximum number of vectors required and write the program to dynamically reflect this?
 
Kind regard,
 
Jack

On Sun, Dec 5, 2010 at 12:22 PM, David Marso <[hidden email]> wrote:
Hey Dawg,
Here is another way to do this.
Restructure the data from long to wide and then use vectors.
Finally restructure from wide to long.
Modify the index values appropriately (8 change to the largest number of
values per id).
BTW:  Several of your data values are out of order, so your "desire" column
is incorrect.
OR maybe you should have ID as 2.  Anyway, my solution should do the trick.
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
----------------------------------------------------------------------------------------------------
DATA LIST LIST/ ID (f1) score(f5.2) MyDate(date10) desire (f5.2).
BEGIN DATA
1     0.33     01.02.2007     0.33
1     0.50     20.03.2007     0.83
1     1.00     01.05.2008     1.00
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
1     0.22     20.03.2008     2.44
1     1.00     01.05.2008     3.44
1     1.00     15.05.2008     4.44
3     0.10     01.02.2006     0.10
3     0.33     20.03.2006     0.43
3     1.00     01.05.2006     1.43
3     0.98     15.05.2008     0.98
4     0.98     01.02.2008     0.98
4     0.77     02.03.2008     1.75
4     1.00     01.05.2010     1.00
END DATA.
SORT CASES BY id mydate.
IF $CASENUM=1 OR (ID NE lag(ID)) SEQ=1.
IF SYSMIS(SEQ) SEQ=LAG(SEQ)+1.
VECTOR DATES(8) SCORES(8).
COMPUTE DATES(SEQ)=mydate.
COMPUTE SCORES(SEQ)=score.
AGGREGATE OUTFILE *
       / BREAK ID
       / DATES1 TO DATES8 SCORES1 TO SCORES8
    =MAX(DATES1 TO DATES8 SCORES1 TO SCORES8).
FORMATS  DATES1 TO DATES8 (EDATE).

VECTOR DATES=DATES1 TO DATES8 .
VECTOR SCORES=SCORES1 TO SCORES8.
VECTOR CUMS(8).
LOOP #=1 TO 8.
+  COMPUTE CUMS(#)=0.
+  LOOP ##=# TO 1 BY -1.
+    IF CTIME.DAYS(DATES(#)-DATES(##)) LE 365.25 CUMS(#)=CUMS(#) +
SCORES(##).
+  END LOOP.
END LOOP.

LOOP #=1 to 8.
+  COMPUTE SCORE=SCORES(#).
+  COMPUTE MYDATE=DATES(#).
+  COMPUTE CUM=CUMS(#).
+  DO IF NOT MISSING(SCORE).
+    XSAVE OUTFILE "TMP.SAV" / KEEP ID SCORE MYDATE CUM.
+  END IF.
END LOOP.
EXE.
GET FILE "TMP.SAV" .

------------------------------------------------------
Another version using CASESTOVARS and VARSTOCASES.
The core is identical.
DATA LIST LIST/ ID (f1) score(f5.2) MyDate(date10) desire (f5.2).
BEGIN DATA
1     0.33     01.02.2007     0.33
1     0.50     20.03.2007     0.83
1     1.00     01.05.2008     1.00
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
1     0.22     20.03.2008     2.44
1     1.00     01.05.2008     3.44
1     1.00     15.05.2008     4.44
3     0.10     01.02.2006     0.10
3     0.33     20.03.2006     0.43
3     1.00     01.05.2006     1.43
3     0.98     15.05.2008     0.98
4     0.98     01.02.2008     0.98
4     0.77     02.03.2008     1.75
4     1.00     01.05.2010     1.00
END DATA.
SORT CASES BY id mydate.
IF $CASENUM=1 OR (ID NE lag(ID)) SEQ=1.
IF SYSMIS(SEQ) SEQ=LAG(SEQ)+1.
FORMATS SEQ(F2.0).
CASESTOVARS /ID = id /INDEX = seq /GROUPBY = VARIABLE .
VECTOR DATES=MYDATE.1 TO MYDATE.8 .
VECTOR SCORES=SCORE.1 TO SCORE.8.
VECTOR CUMS(8).
LOOP #=1 TO 8.
+  COMPUTE CUMS(#)=0.
+  LOOP ##=# TO 1 BY -1.
+    IF CTIME.DAYS(DATES(#)-DATES(##)) LE 365.25 CUMS(#)=CUMS(#) +
SCORES(##).
+  END LOOP.
END LOOP.
VARSTOCASES
 /MAKE mydate FROM mydate.1 TO mydate.8
 /MAKE Score  FROM score.1  TO score.8
 /MAKE cum    FROM cums1    TO cums8
 /KEEP =  id
 /NULL = DROP.
SELECT IF NOT MISSING(SCORE).
LIST.

ID     MYDATE  SCORE      CUM

 1 01-FEB-07     .33      .33
 1 20-MAR-07     .50      .83
 1 01-FEB-08     .22     1.05
 1 20-MAR-08     .22      .44
 1 01-MAY-08    1.00     1.44
 1 01-MAY-08    1.00     2.44
 1 15-MAY-08    1.00     3.44
 1 15-MAY-08    1.00     4.44
 3 01-FEB-06     .10      .10
 3 20-MAR-06     .33      .43
 3 01-MAY-06    1.00     1.43
 3 15-MAY-08     .98      .98
 4 01-FEB-08     .98      .98
 4 02-MAR-08     .77     1.75
 4 01-MAY-10    1.00     1.00


Number of cases read:  15    Number of cases listed:  15




Snuffy Dog wrote:
>
> SPSS friends,
>
> Given the following dataset (excluding the desired output column).
>
> I want to calculate the cumulitive sum for each row using the Test_score
> variable and I want to do this within each id.  That bit I have managed to
> solve, however to make things more complicated I only want to use those
> Test_Score values for each case from those taken from the preceeding 12
> months. The desired output column shows you the output that I am trying to
> calculate. In essesence I am using CSUM within a group defined by ID and
> also within a time frame of the last 12 months.
>
> I can work out the CSUM bit but not the lag for the preceeding 12 months.
>
> sort cases by ID (A) date (A).
> split file by ID.
> create cum = CSUM(Test_Score). *but only use values for the last 12 months
> in the csum
> split file off.
>
>      ID Test_Score Date   Desired_output  1 0.33 01.02.2007 0.33 1 0.50
> 20.03.2007 0.83 1 1.00 01.05.2008 1.00 1 1.00 15.05.2008 2.00 1 0.22
> 01.02.2008 2.22 1 0.22 20.03.2008 2.44 1 1.00 01.05.2008 3.44 1 1.00
> 15.05.2008 4.44 3 0.10 01.02.2006 0.10 3 0.33 20.03.2006 0.43 3 1.00
> 01.05.2006 1.43 3 0.98 15.05.2008 0.98 4 0.98 01.02.2008 0.98 4 0.77
> 02.03.2008 1.75 4 1.00 01.05.2010 1.00
>
> Help - this is a tricky one (for me at least)
>
> Many thanks,
>
> Jack Thomas
>
>

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Using-CSUM-and-LAG-within-groups-and-within-time-period-tp3292384p3292649.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: Using CSUM and LAG within groups and within time period

Jon K Peck
Here is a solution using programmability with no constraints on the number of records or long to wide conversion.
It uses the SPSSINC TRANS extension command available from SPSS Developer Central.
For simplicity, it is using a nominal 365-day year, but it could be made to use exact calendar differences with a little more work.

It wasn't clear to me whether to include a score from exactly 12 months ago.  In the test data, based on the desired output values, I excluded such a value, but by changing the "<" in the comparison to "<=" it would work the other way.

I can send the program off-line if the email messes up the indentation below.

data list free/ID(F2.0) Test_Score(F8.2) Date(EDATE10) Desired_output.(F8.2).
begin data
1         0.33         01.02.2007         0.33
1         0.50         20.03.2007         0.83
1         1.00         01.05.2008         1.00
1         1.00         15.05.2008         2.00
1         0.22         01.02.2008         2.22
1         0.22         20.03.2008         2.44
1         1.00         01.05.2008         3.44
1         1.00         15.05.2008         4.44
3         0.10         01.02.2006         0.10
3         0.33         20.03.2006         0.43
3         1.00         01.05.2006         1.43
3         0.98         15.05.2008         0.98
4         0.98         01.02.2008         0.98
4         0.77         02.03.2008         1.75
4         1.00         01.05.2010         1.00
end data.
exec.

* This program will be called for each case.
begin program.
import spss, sys
nominalyear = 365*24*60*60

def csum(ID, Test_Score, Date):
  global history, previd
  if not hasattr (sys.modules[__name__], "previd") or ID != previd:
    previd = ID
    history = set()
  history.add((Date, Test_Score))
  print history
  thecsum = 0
  for item in history:
    if Date - item[0] < nominalyear:
      thecsum += item[1]
  return thecsum
end program.

SPSSINC TRANS result=csum
/FORMULA csum(ID, Test_Score, Date).

HTH,


Jon Peck
Senior Software Engineer, IBM
[hidden email]
312-651-3435




From:        Snuffy Dog <[hidden email]>
To:        [hidden email]
Date:        12/05/2010 05:20 AM
Subject:        Re: [SPSSX-L] Using CSUM and LAG within groups and within time              period
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Hi all,
 
Supplimental question.  David's solution works best however what I neglected to add was that the maximum number of rows per ID grouping variable may change and in some cases goes into the hundreds.  Currently the solution uses 8 vectors, is there an automatic way to work out the the maximum number of vectors required and write the program to dynamically reflect this?
 
Kind regard,
 
Jack

On Sun, Dec 5, 2010 at 12:22 PM, David Marso <david.marso@...> wrote:
Hey Dawg,
Here is another way to do this.
Restructure the data from long to wide and then use vectors.
Finally restructure from wide to long.
Modify the index values appropriately (8 change to the largest number of
values per id).
BTW:  Several of your data values are out of order, so your "desire" column
is incorrect.
OR maybe you should have ID as 2.  Anyway, my solution should do the trick.

1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22

----------------------------------------------------------------------------------------------------
DATA LIST LIST/ ID (f1) score(f5.2) MyDate(date10) desire (f5.2).
BEGIN DATA

1     0.33     01.02.2007     0.33
1     0.50     20.03.2007     0.83
1     1.00     01.05.2008     1.00
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
1     0.22     20.03.2008     2.44
1     1.00     01.05.2008     3.44
1     1.00     15.05.2008     4.44
3     0.10     01.02.2006     0.10
3     0.33     20.03.2006     0.43
3     1.00     01.05.2006     1.43
3     0.98     15.05.2008     0.98
4     0.98     01.02.2008     0.98
4     0.77     02.03.2008     1.75
4     1.00     01.05.2010     1.00

END DATA.
SORT CASES BY id mydate.
IF $CASENUM=1 OR (ID NE lag(ID)) SEQ=1.
IF SYSMIS(SEQ) SEQ=LAG(SEQ)+1.
VECTOR DATES(8) SCORES(8).
COMPUTE DATES(SEQ)=mydate.
COMPUTE SCORES(SEQ)=score.
AGGREGATE OUTFILE *
       / BREAK ID
       / DATES1 TO DATES8 SCORES1 TO SCORES8
    =MAX(DATES1 TO DATES8 SCORES1 TO SCORES8).
FORMATS  DATES1 TO DATES8 (EDATE).

VECTOR DATES=DATES1 TO DATES8 .
VECTOR SCORES=SCORES1 TO SCORES8.
VECTOR CUMS(8).
LOOP #=1 TO 8.
+  COMPUTE CUMS(#)=0.
+  LOOP ##=# TO 1 BY -1.
+    IF CTIME.DAYS(DATES(#)-DATES(##)) LE 365.25 CUMS(#)=CUMS(#) +
SCORES(##).
+  END LOOP.
END LOOP.

LOOP #=1 to 8.
+  COMPUTE SCORE=SCORES(#).
+  COMPUTE MYDATE=DATES(#).
+  COMPUTE CUM=CUMS(#).
+  DO IF NOT MISSING(SCORE).
+    XSAVE OUTFILE "TMP.SAV" / KEEP ID SCORE MYDATE CUM.
+  END IF.
END LOOP.
EXE.
GET FILE "TMP.SAV" .

------------------------------------------------------
Another version using CASESTOVARS and VARSTOCASES.
The core is identical.
DATA LIST LIST/ ID (f1) score(f5.2) MyDate(date10) desire (f5.2).
BEGIN DATA

1     0.33     01.02.2007     0.33
1     0.50     20.03.2007     0.83
1     1.00     01.05.2008     1.00
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
1     0.22     20.03.2008     2.44
1     1.00     01.05.2008     3.44
1     1.00     15.05.2008     4.44
3     0.10     01.02.2006     0.10
3     0.33     20.03.2006     0.43
3     1.00     01.05.2006     1.43
3     0.98     15.05.2008     0.98
4     0.98     01.02.2008     0.98
4     0.77     02.03.2008     1.75
4     1.00     01.05.2010     1.00

END DATA.
SORT CASES BY id mydate.
IF $CASENUM=1 OR (ID NE lag(ID)) SEQ=1.
IF SYSMIS(SEQ) SEQ=LAG(SEQ)+1.
FORMATS SEQ(F2.0).
CASESTOVARS /ID = id /INDEX = seq /GROUPBY = VARIABLE .
VECTOR DATES=MYDATE.1 TO MYDATE.8 .
VECTOR SCORES=SCORE.1 TO SCORE.8.
VECTOR CUMS(8).
LOOP #=1 TO 8.
+  COMPUTE CUMS(#)=0.
+  LOOP ##=# TO 1 BY -1.
+    IF CTIME.DAYS(DATES(#)-DATES(##)) LE 365.25 CUMS(#)=CUMS(#) +
SCORES(##).
+  END LOOP.
END LOOP.
VARSTOCASES
 /MAKE mydate FROM mydate.1 TO mydate.8
 /MAKE Score  FROM score.1  TO score.8
 /MAKE cum    FROM cums1    TO cums8
 /KEEP =  id
 /NULL = DROP.
SELECT IF NOT MISSING(SCORE).
LIST.

ID     MYDATE  SCORE      CUM

 1 01-FEB-07     .33      .33
 1 20-MAR-07     .50      .83
 1 01-FEB-08     .22     1.05
 1 20-MAR-08     .22      .44
 1 01-MAY-08    1.00     1.44
 1 01-MAY-08    1.00     2.44
 1 15-MAY-08    1.00     3.44
 1 15-MAY-08    1.00     4.44
 3 01-FEB-06     .10      .10
 3 20-MAR-06     .33      .43
 3 01-MAY-06    1.00     1.43
 3 15-MAY-08     .98      .98
 4 01-FEB-08     .98      .98
 4 02-MAR-08     .77     1.75
 4 01-MAY-10    1.00     1.00


Number of cases read:  15    Number of cases listed:  15





Snuffy Dog wrote:
>
> SPSS friends,
>
> Given the following dataset (excluding the desired output column).
>
> I want to calculate the cumulitive sum for each row using the Test_score
> variable and I want to do this within each id.  That bit I have managed to
> solve, however to make things more complicated I only want to use those
> Test_Score values for each case from those taken from the preceeding 12
> months. The desired output column shows you the output that I am trying to
> calculate. In essesence I am using CSUM within a group defined by ID and
> also within a time frame of the last 12 months.
>
> I can work out the CSUM bit but not the lag for the preceeding 12 months.
>
> sort cases by ID (A) date (A).
> split file by ID.
> create cum = CSUM(Test_Score). *but only use values for the last 12 months
> in the csum
> split file off.
>
>      ID Test_Score Date   Desired_output  1 0.33 01.02.2007 0.33 1 0.50
> 20.03.2007 0.83 1 1.00 01.05.2008 1.00 1 1.00 15.05.2008 2.00 1 0.22
> 01.02.2008 2.22 1 0.22 20.03.2008 2.44 1 1.00 01.05.2008 3.44 1 1.00
> 15.05.2008 4.44 3 0.10 01.02.2006 0.10 3 0.33 20.03.2006 0.43 3 1.00
> 01.05.2006 1.43 3 0.98 15.05.2008 0.98 4 0.98 01.02.2008 0.98 4 0.77
> 02.03.2008 1.75 4 1.00 01.05.2010 1.00
>
> Help - this is a tricky one (for me at least)
>
> Many thanks,
>
> Jack Thomas
>
>

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Using-CSUM-and-LAG-within-groups-and-within-time-period-tp3292384p3292649.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to

LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: Using CSUM and LAG within groups and within time period

Snuffy Dog
Ok I worked out my Python integration was faulty and this is now fixed. 
 
Jon your solution works well except that is does not sum scores within the same ID that have the same date.  Note for example that I have two dates ( 01-05-2008 and 15-05-2008) which are duplicated over two rows within the ID=1 group.  The program is only taking one score per date where I need it to take a score from each date.
 
 
On Mon, Dec 6, 2010 at 5:13 AM, Snuffy Dog <[hidden email]> wrote:
Thanks Jon thats perfect, although it is casuing me some problems.  I got it to run once successfully however when I attempted to run this a second time I got an error message

>Error # 6890.  Command name: begin program
>Configuration file spssdxcfg.ini is invalid.
>Execution of this command stops.
Configration file spssdxcfg.ini is invalid because the LIB_NAME is NULL.
 
SPSSINC TRANS result=csum
 
>Error # 1.  Command name: SPSSINC
>The first word in the line is not recognized as an SPSS Statistics command.

/FORMULA csum(ID, Test_Score, Date).

 
 
 
 
 
On Mon, Dec 6, 2010 at 3:04 AM, Jon K Peck <[hidden email]> wrote:
Here is a solution using programmability with no constraints on the number of records or long to wide conversion.
It uses the SPSSINC TRANS extension command available from SPSS Developer Central.
For simplicity, it is using a nominal 365-day year, but it could be made to use exact calendar differences with a little more work.

It wasn't clear to me whether to include a score from exactly 12 months ago.  In the test data, based on the desired output values, I excluded such a value, but by changing the "<" in the comparison to "<=" it would work the other way.

I can send the program off-line if the email messes up the indentation below.

data list free/ID(F2.0) Test_Score(F8.2) Date(EDATE10) Desired_output.(F8.2).
begin data
1         0.33         01.02.2007         0.33
1         0.50         20.03.2007         0.83
1         1.00         01.05.2008         1.00
1         1.00         15.05.2008         2.00
1         0.22         01.02.2008         2.22
1         0.22         20.03.2008         2.44
1         1.00         01.05.2008         3.44
1         1.00         15.05.2008         4.44
3         0.10         01.02.2006         0.10
3         0.33         20.03.2006         0.43
3         1.00         01.05.2006         1.43
3         0.98         15.05.2008         0.98
4         0.98         01.02.2008         0.98
4         0.77         02.03.2008         1.75
4         1.00         01.05.2010         1.00
end data.
exec.

* This program will be called for each case.
begin program.
import spss, sys
nominalyear = 365*24*60*60

def csum(ID, Test_Score, Date):
  global history, previd
  if not hasattr (sys.modules[__name__], "previd") or ID != previd:
    previd = ID
    history = set()
  history.add((Date, Test_Score))
  print history
  thecsum = 0
  for item in history:
    if Date - item[0] < nominalyear:
      thecsum += item[1]
  return thecsum
end program.

SPSSINC TRANS result=csum
/FORMULA csum(ID, Test_Score, Date).

HTH,


Jon Peck
Senior Software Engineer, IBM
[hidden email]
312-651-3435




From:        Snuffy Dog <[hidden email]>
To:        [hidden email]
Date:        12/05/2010 05:20 AM
Subject:        Re: [SPSSX-L] Using CSUM and LAG within groups and within time              period
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Hi all,
 
Supplimental question.  David's solution works best however what I neglected to add was that the maximum number of rows per ID grouping variable may change and in some cases goes into the hundreds.  Currently the solution uses 8 vectors, is there an automatic way to work out the the maximum number of vectors required and write the program to dynamically reflect this?
 
Kind regard,
 
Jack

On Sun, Dec 5, 2010 at 12:22 PM, David Marso <[hidden email]> wrote:
Hey Dawg,
Here is another way to do this.
Restructure the data from long to wide and then use vectors.
Finally restructure from wide to long.
Modify the index values appropriately (8 change to the largest number of
values per id).
BTW:  Several of your data values are out of order, so your "desire" column
is incorrect.
OR maybe you should have ID as 2.  Anyway, my solution should do the trick.

1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22

----------------------------------------------------------------------------------------------------
DATA LIST LIST/ ID (f1) score(f5.2) MyDate(date10) desire (f5.2).
BEGIN DATA

1     0.33     01.02.2007     0.33
1     0.50     20.03.2007     0.83
1     1.00     01.05.2008     1.00
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
1     0.22     20.03.2008     2.44
1     1.00     01.05.2008     3.44
1     1.00     15.05.2008     4.44
3     0.10     01.02.2006     0.10
3     0.33     20.03.2006     0.43
3     1.00     01.05.2006     1.43
3     0.98     15.05.2008     0.98
4     0.98     01.02.2008     0.98
4     0.77     02.03.2008     1.75
4     1.00     01.05.2010     1.00

END DATA.
SORT CASES BY id mydate.
IF $CASENUM=1 OR (ID NE lag(ID)) SEQ=1.
IF SYSMIS(SEQ) SEQ=LAG(SEQ)+1.
VECTOR DATES(8) SCORES(8).
COMPUTE DATES(SEQ)=mydate.
COMPUTE SCORES(SEQ)=score.
AGGREGATE OUTFILE *
       / BREAK ID
       / DATES1 TO DATES8 SCORES1 TO SCORES8
    =MAX(DATES1 TO DATES8 SCORES1 TO SCORES8).
FORMATS  DATES1 TO DATES8 (EDATE).

VECTOR DATES=DATES1 TO DATES8 .
VECTOR SCORES=SCORES1 TO SCORES8.
VECTOR CUMS(8).
LOOP #=1 TO 8.
+  COMPUTE CUMS(#)=0.
+  LOOP ##=# TO 1 BY -1.
+    IF CTIME.DAYS(DATES(#)-DATES(##)) LE 365.25 CUMS(#)=CUMS(#) +
SCORES(##).
+  END LOOP.
END LOOP.

LOOP #=1 to 8.
+  COMPUTE SCORE=SCORES(#).
+  COMPUTE MYDATE=DATES(#).
+  COMPUTE CUM=CUMS(#).
+  DO IF NOT MISSING(SCORE).
+    XSAVE OUTFILE "TMP.SAV" / KEEP ID SCORE MYDATE CUM.
+  END IF.
END LOOP.
EXE.
GET FILE "TMP.SAV" .

------------------------------------------------------
Another version using CASESTOVARS and VARSTOCASES.
The core is identical.
DATA LIST LIST/ ID (f1) score(f5.2) MyDate(date10) desire (f5.2).
BEGIN DATA

1     0.33     01.02.2007     0.33
1     0.50     20.03.2007     0.83
1     1.00     01.05.2008     1.00
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
1     0.22     20.03.2008     2.44
1     1.00     01.05.2008     3.44
1     1.00     15.05.2008     4.44
3     0.10     01.02.2006     0.10
3     0.33     20.03.2006     0.43
3     1.00     01.05.2006     1.43
3     0.98     15.05.2008     0.98
4     0.98     01.02.2008     0.98
4     0.77     02.03.2008     1.75
4     1.00     01.05.2010     1.00

END DATA.
SORT CASES BY id mydate.
IF $CASENUM=1 OR (ID NE lag(ID)) SEQ=1.
IF SYSMIS(SEQ) SEQ=LAG(SEQ)+1.
FORMATS SEQ(F2.0).
CASESTOVARS /ID = id /INDEX = seq /GROUPBY = VARIABLE .
VECTOR DATES=MYDATE.1 TO MYDATE.8 .
VECTOR SCORES=SCORE.1 TO SCORE.8.
VECTOR CUMS(8).
LOOP #=1 TO 8.
+  COMPUTE CUMS(#)=0.
+  LOOP ##=# TO 1 BY -1.
+    IF CTIME.DAYS(DATES(#)-DATES(##)) LE 365.25 CUMS(#)=CUMS(#) +
SCORES(##).
+  END LOOP.
END LOOP.
VARSTOCASES
 /MAKE mydate FROM mydate.1 TO mydate.8
 /MAKE Score  FROM score.1  TO score.8
 /MAKE cum    FROM cums1    TO cums8
 /KEEP =  id
 /NULL = DROP.
SELECT IF NOT MISSING(SCORE).
LIST.

ID     MYDATE  SCORE      CUM

 1 01-FEB-07     .33      .33
 1 20-MAR-07     .50      .83
 1 01-FEB-08     .22     1.05
 1 20-MAR-08     .22      .44
 1 01-MAY-08    1.00     1.44
 1 01-MAY-08    1.00     2.44
 1 15-MAY-08    1.00     3.44
 1 15-MAY-08    1.00     4.44
 3 01-FEB-06     .10      .10
 3 20-MAR-06     .33      .43
 3 01-MAY-06    1.00     1.43
 3 15-MAY-08     .98      .98
 4 01-FEB-08     .98      .98
 4 02-MAR-08     .77     1.75
 4 01-MAY-10    1.00     1.00


Number of cases read:  15    Number of cases listed:  15





Snuffy Dog wrote:
>
> SPSS friends,
>
> Given the following dataset (excluding the desired output column).
>
> I want to calculate the cumulitive sum for each row using the Test_score
> variable and I want to do this within each id.  That bit I have managed to
> solve, however to make things more complicated I only want to use those
> Test_Score values for each case from those taken from the preceeding 12
> months. The desired output column shows you the output that I am trying to
> calculate. In essesence I am using CSUM within a group defined by ID and
> also within a time frame of the last 12 months.
>
> I can work out the CSUM bit but not the lag for the preceeding 12 months.
>
> sort cases by ID (A) date (A).
> split file by ID.
> create cum = CSUM(Test_Score). *but only use values for the last 12 months
> in the csum
> split file off.
>
>      ID Test_Score Date   Desired_output  1 0.33 01.02.2007 0.33 1 0.50
> 20.03.2007 0.83 1 1.00 01.05.2008 1.00 1 1.00 15.05.2008 2.00 1 0.22
> 01.02.2008 2.22 1 0.22 20.03.2008 2.44 1 1.00 01.05.2008 3.44 1 1.00
> 15.05.2008 4.44 3 0.10 01.02.2006 0.10 3 0.33 20.03.2006 0.43 3 1.00
> 01.05.2006 1.43 3 0.98 15.05.2008 0.98 4 0.98 01.02.2008 0.98 4 0.77
> 02.03.2008 1.75 4 1.00 01.05.2010 1.00
>
> Help - this is a tricky one (for me at least)
>
> Many thanks,
>
> Jack Thomas
>
>

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Using-CSUM-and-LAG-within-groups-and-within-time-period-tp3292384p3292649.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD




Reply | Threaded
Open this post in threaded view
|

Re: Using CSUM and LAG within groups and within time period

Snuffy Dog
Thank you Jon that worked perfectly.
 
I did work on a numpty solution to the duplicate date problem by creating a dequence counter within each group and then adding this as a time seconds value to the date variable so that each date was unquie by at least 1 second but your solution is so much better -
 
Thank you very much for your help, I am both grateful and impressed.
 
Jack

On Mon, Dec 6, 2010 at 8:17 AM, Jon K Peck <[hidden email]> wrote:
I didn't realize that you could have more than one instance on the same date.  This variation should deal with that.
begin program.
import spss, sys
nominalyear = 365*24*60*60

def csum(ID, Test_Score, Date):
  global history, previd
  if not hasattr (sys.modules[__name__], "previd") or ID != previd:
    previd = ID
    history = []
  history.append((Date, Test_Score))
  thecsum = 0
  for item in history:
    if Date - item[0] < nominalyear:
      thecsum += item[1]
  return thecsum
end program.

SPSSINC TRANS result=csum
/FORMULA csum(ID, Test_Score, Date).

p.s. I saw your note that you got programmability working.


Jon Peck
Senior Software Engineer, IBM
[hidden email]
312-651-3435




From:        Snuffy Dog <[hidden email]>
To:        [hidden email]
Date:        12/05/2010 02:06 PM
Subject:        Re: [SPSSX-L] Using CSUM and LAG within groups and within time              period
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Ok I worked out my Python integration was faulty and this is now fixed. 
 
Jon your solution works well except that is does not sum scores within the same ID that have the same date.  Note for example that I have two dates ( 01-05-2008 and 15-05-2008) which are duplicated over two rows within the ID=1 group.  The program is only taking one score per date where I need it to take a score from each date.
 
 
On Mon, Dec 6, 2010 at 5:13 AM, Snuffy Dog <[hidden email]> wrote:
Thanks Jon thats perfect, although it is casuing me some problems.  I got it to run once successfully however when I attempted to run this a second time I got an error message

>Error # 6890.  Command name: begin program
>Configuration file spssdxcfg.ini is invalid.
>Execution of this command stops.
Configration file spssdxcfg.ini is invalid because the LIB_NAME is NULL.
 
SPSSINC TRANS result=csum
 
>Error # 1.  Command name: SPSSINC
>The first word in the line is not recognized as an SPSS Statistics command.

/FORMULA csum(ID, Test_Score, Date).

 
 
 
 
 
On Mon, Dec 6, 2010 at 3:04 AM, Jon K Peck <[hidden email]> wrote:
Here is a solution using programmability with no constraints on the number of records or long to wide conversion.
It uses the SPSSINC TRANS extension command available from SPSS Developer Central.

For simplicity, it is using a nominal 365-day year, but it could be made to use exact calendar differences with a little more work.


It wasn't clear to me whether to include a score from exactly 12 months ago.  In the test data, based on the desired output values, I excluded such a value, but by changing the "<" in the comparison to "<=" it would work the other way.


I can send the program off-line if the email messes up the indentation below.


data list free/ID(F2.0) Test_Score(F8.2) Date(EDATE10) Desired_output.(F8.2).

begin data

1         0.33         01.02.2007         0.33
1         0.50         20.03.2007         0.83

1         1.00         01.05.2008         1.00

1         1.00         15.05.2008         2.00

1         0.22         01.02.2008         2.22

1         0.22         20.03.2008         2.44

1         1.00         01.05.2008         3.44

1         1.00         15.05.2008         4.44

3         0.10         01.02.2006         0.10

3         0.33         20.03.2006         0.43

3         1.00         01.05.2006         1.43

3         0.98         15.05.2008         0.98

4         0.98         01.02.2008         0.98

4         0.77         02.03.2008         1.75

4         1.00         01.05.2010         1.00

end data.
exec.


* This program will be called for each case.

begin program.

import spss, sys

nominalyear = 365*24*60*60


def csum(ID, Test_Score, Date):

  global history, previd

  if not hasattr (sys.modules[__name__], "previd") or ID != previd:

    previd = ID

    history = set()

  history.add((Date, Test_Score))

  print history

  thecsum = 0

  for item in history:

    if Date - item[0] < nominalyear:

      thecsum += item[1]

  return thecsum

end program.


SPSSINC TRANS result=csum

/FORMULA csum(ID, Test_Score, Date).


HTH,



Jon Peck
Senior Software Engineer, IBM

[hidden email]
312-651-3435




From:        
Snuffy Dog <[hidden email]>
To:        
[hidden email]
Date:        
12/05/2010 05:20 AM
Subject:        
Re: [SPSSX-L] Using CSUM and LAG within groups and within time              period
Sent by:        
"SPSSX(r) Discussion" <[hidden email]>






Hi all,
 
Supplimental question.  David's solution works best however what I neglected to add was that the maximum number of rows per ID grouping variable may change and in some cases goes into the hundreds.  Currently the solution uses 8 vectors, is there an automatic way to work out the the maximum number of vectors required and write the program to dynamically reflect this?
 
Kind regard,
 
Jack

On Sun, Dec 5, 2010 at 12:22 PM, David Marso <
[hidden email]> wrote:
Hey Dawg,
Here is another way to do this.
Restructure the data from long to wide and then use vectors.
Finally restructure from wide to long.
Modify the index values appropriately (8 change to the largest number of
values per id).
BTW:  Several of your data values are out of order, so your "desire" column
is incorrect.
OR maybe you should have ID as 2.  Anyway, my solution should do the trick.
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
----------------------------------------------------------------------------------------------------
DATA LIST LIST/ ID (f1) score(f5.2) MyDate(date10) desire (f5.2).
BEGIN DATA
1     0.33     01.02.2007     0.33
1     0.50     20.03.2007     0.83
1     1.00     01.05.2008     1.00
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
1     0.22     20.03.2008     2.44
1     1.00     01.05.2008     3.44
1     1.00     15.05.2008     4.44
3     0.10     01.02.2006     0.10
3     0.33     20.03.2006     0.43
3     1.00     01.05.2006     1.43
3     0.98     15.05.2008     0.98
4     0.98     01.02.2008     0.98
4     0.77     02.03.2008     1.75
4     1.00     01.05.2010     1.00
END DATA.
SORT CASES BY id mydate.
IF $CASENUM=1 OR (ID NE lag(ID)) SEQ=1.
IF SYSMIS(SEQ) SEQ=LAG(SEQ)+1.
VECTOR DATES(8) SCORES(8).
COMPUTE DATES(SEQ)=mydate.
COMPUTE SCORES(SEQ)=score.
AGGREGATE OUTFILE *
       / BREAK ID
       / DATES1 TO DATES8 SCORES1 TO SCORES8
    =MAX(DATES1 TO DATES8 SCORES1 TO SCORES8).
FORMATS  DATES1 TO DATES8 (EDATE).

VECTOR DATES=DATES1 TO DATES8 .
VECTOR SCORES=SCORES1 TO SCORES8.
VECTOR CUMS(8).
LOOP #=1 TO 8.
+  COMPUTE CUMS(#)=0.
+  LOOP ##=# TO 1 BY -1.
+    IF CTIME.DAYS(DATES(#)-DATES(##)) LE 365.25 CUMS(#)=CUMS(#) +
SCORES(##).
+  END LOOP.
END LOOP.

LOOP #=1 to 8.
+  COMPUTE SCORE=SCORES(#).
+  COMPUTE MYDATE=DATES(#).
+  COMPUTE CUM=CUMS(#).
+  DO IF NOT MISSING(SCORE).
+    XSAVE OUTFILE "TMP.SAV" / KEEP ID SCORE MYDATE CUM.
+  END IF.
END LOOP.
EXE.
GET FILE "TMP.SAV" .

------------------------------------------------------
Another version using CASESTOVARS and VARSTOCASES.
The core is identical.
DATA LIST LIST/ ID (f1) score(f5.2) MyDate(date10) desire (f5.2).
BEGIN DATA
1     0.33     01.02.2007     0.33
1     0.50     20.03.2007     0.83
1     1.00     01.05.2008     1.00
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
1     0.22     20.03.2008     2.44
1     1.00     01.05.2008     3.44
1     1.00     15.05.2008     4.44
3     0.10     01.02.2006     0.10
3     0.33     20.03.2006     0.43
3     1.00     01.05.2006     1.43
3     0.98     15.05.2008     0.98
4     0.98     01.02.2008     0.98
4     0.77     02.03.2008     1.75
4     1.00     01.05.2010     1.00
END DATA.
SORT CASES BY id mydate.
IF $CASENUM=1 OR (ID NE lag(ID)) SEQ=1.
IF SYSMIS(SEQ) SEQ=LAG(SEQ)+1.
FORMATS SEQ(F2.0).
CASESTOVARS /ID = id /INDEX = seq /GROUPBY = VARIABLE .
VECTOR DATES=MYDATE.1 TO MYDATE.8 .
VECTOR SCORES=SCORE.1 TO SCORE.8.
VECTOR CUMS(8).
LOOP #=1 TO 8.
+  COMPUTE CUMS(#)=0.
+  LOOP ##=# TO 1 BY -1.
+    IF CTIME.DAYS(DATES(#)-DATES(##)) LE 365.25 CUMS(#)=CUMS(#) +
SCORES(##).
+  END LOOP.
END LOOP.
VARSTOCASES
 /MAKE mydate FROM mydate.1 TO mydate.8
 /MAKE Score  FROM score.1  TO score.8
 /MAKE cum    FROM cums1    TO cums8
 /KEEP =  id
 /NULL = DROP.
SELECT IF NOT MISSING(SCORE).
LIST.

ID     MYDATE  SCORE      CUM

 1 01-FEB-07     .33      .33
 1 20-MAR-07     .50      .83
 1 01-FEB-08     .22     1.05
 1 20-MAR-08     .22      .44
 1 01-MAY-08    1.00     1.44
 1 01-MAY-08    1.00     2.44
 1 15-MAY-08    1.00     3.44
 1 15-MAY-08    1.00     4.44
 3 01-FEB-06     .10      .10
 3 20-MAR-06     .33      .43
 3 01-MAY-06    1.00     1.43
 3 15-MAY-08     .98      .98
 4 01-FEB-08     .98      .98
 4 02-MAR-08     .77     1.75
 4 01-MAY-10    1.00     1.00


Number of cases read:  15    Number of cases listed:  15




Snuffy Dog wrote:
>
> SPSS friends,
>
> Given the following dataset (excluding the desired output column).
>
> I want to calculate the cumulitive sum for each row using the Test_score
> variable and I want to do this within each id.  That bit I have managed to
> solve, however to make things more complicated I only want to use those
> Test_Score values for each case from those taken from the preceeding 12
> months. The desired output column shows you the output that I am trying to
> calculate. In essesence I am using CSUM within a group defined by ID and
> also within a time frame of the last 12 months.
>
> I can work out the CSUM bit but not the lag for the preceeding 12 months.
>
> sort cases by ID (A) date (A).
> split file by ID.
> create cum = CSUM(Test_Score). *but only use values for the last 12 months
> in the csum
> split file off.
>
>      ID Test_Score Date   Desired_output  1 0.33 01.02.2007 0.33 1 0.50
> 20.03.2007 0.83 1 1.00 01.05.2008 1.00 1 1.00 15.05.2008 2.00 1 0.22
> 01.02.2008 2.22 1 0.22 20.03.2008 2.44 1 1.00 01.05.2008 3.44 1 1.00
> 15.05.2008 4.44 3 0.10 01.02.2006 0.10 3 0.33 20.03.2006 0.43 3 1.00
> 01.05.2006 1.43 3 0.98 15.05.2008 0.98 4 0.98 01.02.2008 0.98 4 0.77
> 02.03.2008 1.75 4 1.00 01.05.2010 1.00
>
> Help - this is a tricky one (for me at least)
>
> Many thanks,
>
> Jack Thomas
>
>

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Using-CSUM-and-LAG-within-groups-and-within-time-period-tp3292384p3292649.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD




Reply | Threaded
Open this post in threaded view
|

Re: Using CSUM and LAG within groups and within time period

Snuffy Dog
Oh bummer - when using the example data file for this problem the solution worked fine but when using my whole data set I encounter the following error:
 
Error: The formula references an undefined variable or could not be evaluated:
name 'Date' is not defined
 
Thoughts?

On Mon, Dec 6, 2010 at 8:22 AM, Snuffy Dog <[hidden email]> wrote:
Thank you Jon that worked perfectly.
 
I did work on a numpty solution to the duplicate date problem by creating a dequence counter within each group and then adding this as a time seconds value to the date variable so that each date was unquie by at least 1 second but your solution is so much better -
 
Thank you very much for your help, I am both grateful and impressed.
 
Jack

On Mon, Dec 6, 2010 at 8:17 AM, Jon K Peck <[hidden email]> wrote:
I didn't realize that you could have more than one instance on the same date.  This variation should deal with that.
begin program.
import spss, sys
nominalyear = 365*24*60*60

def csum(ID, Test_Score, Date):
  global history, previd
  if not hasattr (sys.modules[__name__], "previd") or ID != previd:
    previd = ID
    history = []
  history.append((Date, Test_Score))
  thecsum = 0
  for item in history:
    if Date - item[0] < nominalyear:
      thecsum += item[1]
  return thecsum
end program.

SPSSINC TRANS result=csum
/FORMULA csum(ID, Test_Score, Date).

p.s. I saw your note that you got programmability working.


Jon Peck
Senior Software Engineer, IBM
[hidden email]
312-651-3435




From:        Snuffy Dog <[hidden email]>
To:        [hidden email]
Date:        12/05/2010 02:06 PM
Subject:        Re: [SPSSX-L] Using CSUM and LAG within groups and within time              period
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Ok I worked out my Python integration was faulty and this is now fixed. 
 
Jon your solution works well except that is does not sum scores within the same ID that have the same date.  Note for example that I have two dates ( 01-05-2008 and 15-05-2008) which are duplicated over two rows within the ID=1 group.  The program is only taking one score per date where I need it to take a score from each date.
 
 
On Mon, Dec 6, 2010 at 5:13 AM, Snuffy Dog <[hidden email]> wrote:
Thanks Jon thats perfect, although it is casuing me some problems.  I got it to run once successfully however when I attempted to run this a second time I got an error message

>Error # 6890.  Command name: begin program
>Configuration file spssdxcfg.ini is invalid.
>Execution of this command stops.
Configration file spssdxcfg.ini is invalid because the LIB_NAME is NULL.
 
SPSSINC TRANS result=csum
 
>Error # 1.  Command name: SPSSINC
>The first word in the line is not recognized as an SPSS Statistics command.

/FORMULA csum(ID, Test_Score, Date).

 
 
 
 
 
On Mon, Dec 6, 2010 at 3:04 AM, Jon K Peck <[hidden email]> wrote:
Here is a solution using programmability with no constraints on the number of records or long to wide conversion.
It uses the SPSSINC TRANS extension command available from SPSS Developer Central.

For simplicity, it is using a nominal 365-day year, but it could be made to use exact calendar differences with a little more work.


It wasn't clear to me whether to include a score from exactly 12 months ago.  In the test data, based on the desired output values, I excluded such a value, but by changing the "<" in the comparison to "<=" it would work the other way.


I can send the program off-line if the email messes up the indentation below.


data list free/ID(F2.0) Test_Score(F8.2) Date(EDATE10) Desired_output.(F8.2).

begin data

1         0.33         01.02.2007         0.33
1         0.50         20.03.2007         0.83

1         1.00         01.05.2008         1.00

1         1.00         15.05.2008         2.00

1         0.22         01.02.2008         2.22

1         0.22         20.03.2008         2.44

1         1.00         01.05.2008         3.44

1         1.00         15.05.2008         4.44

3         0.10         01.02.2006         0.10

3         0.33         20.03.2006         0.43

3         1.00         01.05.2006         1.43

3         0.98         15.05.2008         0.98

4         0.98         01.02.2008         0.98

4         0.77         02.03.2008         1.75

4         1.00         01.05.2010         1.00

end data.
exec.


* This program will be called for each case.

begin program.

import spss, sys

nominalyear = 365*24*60*60


def csum(ID, Test_Score, Date):

  global history, previd

  if not hasattr (sys.modules[__name__], "previd") or ID != previd:

    previd = ID

    history = set()

  history.add((Date, Test_Score))

  print history

  thecsum = 0

  for item in history:

    if Date - item[0] < nominalyear:

      thecsum += item[1]

  return thecsum

end program.


SPSSINC TRANS result=csum

/FORMULA csum(ID, Test_Score, Date).


HTH,



Jon Peck
Senior Software Engineer, IBM

[hidden email]
312-651-3435




From:        
Snuffy Dog <[hidden email]>
To:        
[hidden email]
Date:        
12/05/2010 05:20 AM
Subject:        
Re: [SPSSX-L] Using CSUM and LAG within groups and within time              period
Sent by:        
"SPSSX(r) Discussion" <[hidden email]>






Hi all,
 
Supplimental question.  David's solution works best however what I neglected to add was that the maximum number of rows per ID grouping variable may change and in some cases goes into the hundreds.  Currently the solution uses 8 vectors, is there an automatic way to work out the the maximum number of vectors required and write the program to dynamically reflect this?
 
Kind regard,
 
Jack

On Sun, Dec 5, 2010 at 12:22 PM, David Marso <
[hidden email]> wrote:
Hey Dawg,
Here is another way to do this.
Restructure the data from long to wide and then use vectors.
Finally restructure from wide to long.
Modify the index values appropriately (8 change to the largest number of
values per id).
BTW:  Several of your data values are out of order, so your "desire" column
is incorrect.
OR maybe you should have ID as 2.  Anyway, my solution should do the trick.
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
----------------------------------------------------------------------------------------------------
DATA LIST LIST/ ID (f1) score(f5.2) MyDate(date10) desire (f5.2).
BEGIN DATA
1     0.33     01.02.2007     0.33
1     0.50     20.03.2007     0.83
1     1.00     01.05.2008     1.00
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
1     0.22     20.03.2008     2.44
1     1.00     01.05.2008     3.44
1     1.00     15.05.2008     4.44
3     0.10     01.02.2006     0.10
3     0.33     20.03.2006     0.43
3     1.00     01.05.2006     1.43
3     0.98     15.05.2008     0.98
4     0.98     01.02.2008     0.98
4     0.77     02.03.2008     1.75
4     1.00     01.05.2010     1.00
END DATA.
SORT CASES BY id mydate.
IF $CASENUM=1 OR (ID NE lag(ID)) SEQ=1.
IF SYSMIS(SEQ) SEQ=LAG(SEQ)+1.
VECTOR DATES(8) SCORES(8).
COMPUTE DATES(SEQ)=mydate.
COMPUTE SCORES(SEQ)=score.
AGGREGATE OUTFILE *
       / BREAK ID
       / DATES1 TO DATES8 SCORES1 TO SCORES8
    =MAX(DATES1 TO DATES8 SCORES1 TO SCORES8).
FORMATS  DATES1 TO DATES8 (EDATE).

VECTOR DATES=DATES1 TO DATES8 .
VECTOR SCORES=SCORES1 TO SCORES8.
VECTOR CUMS(8).
LOOP #=1 TO 8.
+  COMPUTE CUMS(#)=0.
+  LOOP ##=# TO 1 BY -1.
+    IF CTIME.DAYS(DATES(#)-DATES(##)) LE 365.25 CUMS(#)=CUMS(#) +
SCORES(##).
+  END LOOP.
END LOOP.

LOOP #=1 to 8.
+  COMPUTE SCORE=SCORES(#).
+  COMPUTE MYDATE=DATES(#).
+  COMPUTE CUM=CUMS(#).
+  DO IF NOT MISSING(SCORE).
+    XSAVE OUTFILE "TMP.SAV" / KEEP ID SCORE MYDATE CUM.
+  END IF.
END LOOP.
EXE.
GET FILE "TMP.SAV" .

------------------------------------------------------
Another version using CASESTOVARS and VARSTOCASES.
The core is identical.
DATA LIST LIST/ ID (f1) score(f5.2) MyDate(date10) desire (f5.2).
BEGIN DATA
1     0.33     01.02.2007     0.33
1     0.50     20.03.2007     0.83
1     1.00     01.05.2008     1.00
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
1     0.22     20.03.2008     2.44
1     1.00     01.05.2008     3.44
1     1.00     15.05.2008     4.44
3     0.10     01.02.2006     0.10
3     0.33     20.03.2006     0.43
3     1.00     01.05.2006     1.43
3     0.98     15.05.2008     0.98
4     0.98     01.02.2008     0.98
4     0.77     02.03.2008     1.75
4     1.00     01.05.2010     1.00
END DATA.
SORT CASES BY id mydate.
IF $CASENUM=1 OR (ID NE lag(ID)) SEQ=1.
IF SYSMIS(SEQ) SEQ=LAG(SEQ)+1.
FORMATS SEQ(F2.0).
CASESTOVARS /ID = id /INDEX = seq /GROUPBY = VARIABLE .
VECTOR DATES=MYDATE.1 TO MYDATE.8 .
VECTOR SCORES=SCORE.1 TO SCORE.8.
VECTOR CUMS(8).
LOOP #=1 TO 8.
+  COMPUTE CUMS(#)=0.
+  LOOP ##=# TO 1 BY -1.
+    IF CTIME.DAYS(DATES(#)-DATES(##)) LE 365.25 CUMS(#)=CUMS(#) +
SCORES(##).
+  END LOOP.
END LOOP.
VARSTOCASES
 /MAKE mydate FROM mydate.1 TO mydate.8
 /MAKE Score  FROM score.1  TO score.8
 /MAKE cum    FROM cums1    TO cums8
 /KEEP =  id
 /NULL = DROP.
SELECT IF NOT MISSING(SCORE).
LIST.

ID     MYDATE  SCORE      CUM

 1 01-FEB-07     .33      .33
 1 20-MAR-07     .50      .83
 1 01-FEB-08     .22     1.05
 1 20-MAR-08     .22      .44
 1 01-MAY-08    1.00     1.44
 1 01-MAY-08    1.00     2.44
 1 15-MAY-08    1.00     3.44
 1 15-MAY-08    1.00     4.44
 3 01-FEB-06     .10      .10
 3 20-MAR-06     .33      .43
 3 01-MAY-06    1.00     1.43
 3 15-MAY-08     .98      .98
 4 01-FEB-08     .98      .98
 4 02-MAR-08     .77     1.75
 4 01-MAY-10    1.00     1.00


Number of cases read:  15    Number of cases listed:  15




Snuffy Dog wrote:
>
> SPSS friends,
>
> Given the following dataset (excluding the desired output column).
>
> I want to calculate the cumulitive sum for each row using the Test_score
> variable and I want to do this within each id.  That bit I have managed to
> solve, however to make things more complicated I only want to use those
> Test_Score values for each case from those taken from the preceeding 12
> months. The desired output column shows you the output that I am trying to
> calculate. In essesence I am using CSUM within a group defined by ID and
> also within a time frame of the last 12 months.
>
> I can work out the CSUM bit but not the lag for the preceeding 12 months.
>
> sort cases by ID (A) date (A).
> split file by ID.
> create cum = CSUM(Test_Score). *but only use values for the last 12 months
> in the csum
> split file off.
>
>      ID Test_Score Date   Desired_output  1 0.33 01.02.2007 0.33 1 0.50
> 20.03.2007 0.83 1 1.00 01.05.2008 1.00 1 1.00 15.05.2008 2.00 1 0.22
> 01.02.2008 2.22 1 0.22 20.03.2008 2.44 1 1.00 01.05.2008 3.44 1 1.00
> 15.05.2008 4.44 3 0.10 01.02.2006 0.10 3 0.33 20.03.2006 0.43 3 1.00
> 01.05.2006 1.43 3 0.98 15.05.2008 0.98 4 0.98 01.02.2008 0.98 4 0.77
> 02.03.2008 1.75 4 1.00 01.05.2010 1.00
>
> Help - this is a tricky one (for me at least)
>
> Many thanks,
>
> Jack Thomas
>
>

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Using-CSUM-and-LAG-within-groups-and-within-time-period-tp3292384p3292649.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD





Reply | Threaded
Open this post in threaded view
|

Re: Using CSUM and LAG within groups and within time period

Snuffy Dog
Really? Ok I made the variable names match on case and now it works
 
Thank You so Much!!!!!!!!!

On Mon, Dec 6, 2010 at 9:58 AM, Jon K Peck <[hidden email]> wrote:
Does the case match?  While variable name case does not matter in Statistics, it does matter here.

Jon Peck
Senior Software Engineer, IBM
[hidden email]
312-651-3435




From:        Snuffy Dog <[hidden email]>
To:        [hidden email]
Date:        12/05/2010 02:45 PM
Subject:        Re: [SPSSX-L] Using CSUM and LAG within groups and within time              period
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Oh bummer - when using the example data file for this problem the solution worked fine but when using my whole data set I encounter the following error:
 
Error: The formula references an undefined variable or could not be evaluated:
name 'Date' is not defined

 
Thoughts?

On Mon, Dec 6, 2010 at 8:22 AM, Snuffy Dog <[hidden email]> wrote:
Thank you Jon that worked perfectly.
 
I did work on a numpty solution to the duplicate date problem by creating a dequence counter within each group and then adding this as a time seconds value to the date variable so that each date was unquie by at least 1 second but your solution is so much better -
 
Thank you very much for your help, I am both grateful and impressed.
 
Jack

On Mon, Dec 6, 2010 at 8:17 AM, Jon K Peck <[hidden email]> wrote:
I didn't realize that you could have more than one instance on the same date.  This variation should deal with that.
begin program.
import spss, sys

nominalyear = 365*24*60*60


def csum(ID, Test_Score, Date):

  global history, previd

  if not hasattr (sys.modules[__name__], "previd") or ID != previd:

    previd = ID

    history = []
  history.append((Date, Test_Score))

  thecsum = 0
  for item in history:

    if Date - item[0] < nominalyear:

      thecsum += item[1]

  return thecsum

end program.


SPSSINC TRANS result=csum

/FORMULA csum(ID, Test_Score, Date).


p.s. I saw your note that you got programmability working.


Jon Peck
Senior Software Engineer, IBM

[hidden email]
312-651-3435




From:        
Snuffy Dog <[hidden email]>
To:        
[hidden email]
Date:        12/05/2010 02:06 PM
Subject:        Re: [SPSSX-L] Using CSUM and LAG within groups and within time              period
Sent by:        
"SPSSX(r) Discussion" <[hidden email]>




Ok I worked out my Python integration was faulty and this is now fixed. 
 
Jon your solution works well except that is does not sum scores within the same ID that have the same date.  Note for example that I have two dates ( 01-05-2008 and 15-05-2008) which are duplicated over two rows within the ID=1 group.  The program is only taking one score per date where I need it to take a score from each date.
 
 
On Mon, Dec 6, 2010 at 5:13 AM, Snuffy Dog <
[hidden email]> wrote:
Thanks Jon thats perfect, although it is casuing me some problems.  I got it to run once successfully however when I attempted to run this a second time I got an error message

>Error # 6890.  Command name: begin program
>Configuration file spssdxcfg.ini is invalid.
>Execution of this command stops.
Configration file spssdxcfg.ini is invalid because the LIB_NAME is NULL.
 
SPSSINC TRANS result=csum
 
>Error # 1.  Command name: SPSSINC
>The first word in the line is not recognized as an SPSS Statistics command.
/FORMULA csum(ID, Test_Score, Date).

 
 
 
 
 
On Mon, Dec 6, 2010 at 3:04 AM, Jon K Peck <
[hidden email]> wrote:
Here is a solution using programmability with no constraints on the number of records or long to wide conversion.

It uses the SPSSINC TRANS extension command available from SPSS Developer Central.

For simplicity, it is using a nominal 365-day year, but it could be made to use exact calendar differences with a little more work.


It wasn't clear to me whether to include a score from exactly 12 months ago.  In the test data, based on the desired output values, I excluded such a value, but by changing the "<" in the comparison to "<=" it would work the other way.


I can send the program off-line if the email messes up the indentation below.


data list free/ID(F2.0) Test_Score(F8.2) Date(EDATE10) Desired_output.(F8.2).

begin data

1         0.33         01.02.2007         0.33

1         0.50         20.03.2007         0.83

1         1.00         01.05.2008         1.00

1         1.00         15.05.2008         2.00

1         0.22         01.02.2008         2.22

1         0.22         20.03.2008         2.44

1         1.00         01.05.2008         3.44

1         1.00         15.05.2008         4.44

3         0.10         01.02.2006         0.10

3         0.33         20.03.2006         0.43

3         1.00         01.05.2006         1.43

3         0.98         15.05.2008         0.98

4         0.98         01.02.2008         0.98

4         0.77         02.03.2008         1.75

4         1.00         01.05.2010         1.00

end data.

exec.


* This program will be called for each case.

begin program.

import spss, sys

nominalyear = 365*24*60*60


def csum(ID, Test_Score, Date):

  global history, previd

  if not hasattr (sys.modules[__name__], "previd") or ID != previd:

    previd = ID

    history = set()

  history.add((Date, Test_Score))

  print history

  thecsum = 0

  for item in history:

    if Date - item[0] < nominalyear:

      thecsum += item[1]

  return thecsum

end program.


SPSSINC TRANS result=csum

/FORMULA csum(ID, Test_Score, Date).


HTH,



Jon Peck
Senior Software Engineer, IBM

[hidden email]
312-651-3435




From:        
Snuffy Dog <[hidden email]>
To:        
[hidden email]
Date:        
12/05/2010 05:20 AM
Subject:        
Re: [SPSSX-L] Using CSUM and LAG within groups and within time              period
Sent by:        
"SPSSX(r) Discussion" <[hidden email]>






Hi all,
 
Supplimental question.  David's solution works best however what I neglected to add was that the maximum number of rows per ID grouping variable may change and in some cases goes into the hundreds.  Currently the solution uses 8 vectors, is there an automatic way to work out the the maximum number of vectors required and write the program to dynamically reflect this?
 
Kind regard,
 
Jack

On Sun, Dec 5, 2010 at 12:22 PM, David Marso <
[hidden email]> wrote:
Hey Dawg,
Here is another way to do this.
Restructure the data from long to wide and then use vectors.
Finally restructure from wide to long.
Modify the index values appropriately (8 change to the largest number of
values per id).
BTW:  Several of your data values are out of order, so your "desire" column
is incorrect.
OR maybe you should have ID as 2.  Anyway, my solution should do the trick.
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
----------------------------------------------------------------------------------------------------
DATA LIST LIST/ ID (f1) score(f5.2) MyDate(date10) desire (f5.2).
BEGIN DATA
1     0.33     01.02.2007     0.33
1     0.50     20.03.2007     0.83
1     1.00     01.05.2008     1.00
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
1     0.22     20.03.2008     2.44
1     1.00     01.05.2008     3.44
1     1.00     15.05.2008     4.44
3     0.10     01.02.2006     0.10
3     0.33     20.03.2006     0.43
3     1.00     01.05.2006     1.43
3     0.98     15.05.2008     0.98
4     0.98     01.02.2008     0.98
4     0.77     02.03.2008     1.75
4     1.00     01.05.2010     1.00
END DATA.
SORT CASES BY id mydate.
IF $CASENUM=1 OR (ID NE lag(ID)) SEQ=1.
IF SYSMIS(SEQ) SEQ=LAG(SEQ)+1.
VECTOR DATES(8) SCORES(8).
COMPUTE DATES(SEQ)=mydate.
COMPUTE SCORES(SEQ)=score.
AGGREGATE OUTFILE *
       / BREAK ID
       / DATES1 TO DATES8 SCORES1 TO SCORES8
    =MAX(DATES1 TO DATES8 SCORES1 TO SCORES8).
FORMATS  DATES1 TO DATES8 (EDATE).

VECTOR DATES=DATES1 TO DATES8 .
VECTOR SCORES=SCORES1 TO SCORES8.
VECTOR CUMS(8).
LOOP #=1 TO 8.
+  COMPUTE CUMS(#)=0.
+  LOOP ##=# TO 1 BY -1.
+    IF CTIME.DAYS(DATES(#)-DATES(##)) LE 365.25 CUMS(#)=CUMS(#) +
SCORES(##).
+  END LOOP.
END LOOP.

LOOP #=1 to 8.
+  COMPUTE SCORE=SCORES(#).
+  COMPUTE MYDATE=DATES(#).
+  COMPUTE CUM=CUMS(#).
+  DO IF NOT MISSING(SCORE).
+    XSAVE OUTFILE "TMP.SAV" / KEEP ID SCORE MYDATE CUM.
+  END IF.
END LOOP.
EXE.
GET FILE "TMP.SAV" .

------------------------------------------------------
Another version using CASESTOVARS and VARSTOCASES.
The core is identical.
DATA LIST LIST/ ID (f1) score(f5.2) MyDate(date10) desire (f5.2).
BEGIN DATA
1     0.33     01.02.2007     0.33
1     0.50     20.03.2007     0.83
1     1.00     01.05.2008     1.00
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
1     0.22     20.03.2008     2.44
1     1.00     01.05.2008     3.44
1     1.00     15.05.2008     4.44
3     0.10     01.02.2006     0.10
3     0.33     20.03.2006     0.43
3     1.00     01.05.2006     1.43
3     0.98     15.05.2008     0.98
4     0.98     01.02.2008     0.98
4     0.77     02.03.2008     1.75
4     1.00     01.05.2010     1.00
END DATA.
SORT CASES BY id mydate.
IF $CASENUM=1 OR (ID NE lag(ID)) SEQ=1.
IF SYSMIS(SEQ) SEQ=LAG(SEQ)+1.
FORMATS SEQ(F2.0).
CASESTOVARS /ID = id /INDEX = seq /GROUPBY = VARIABLE .
VECTOR DATES=MYDATE.1 TO MYDATE.8 .
VECTOR SCORES=SCORE.1 TO SCORE.8.
VECTOR CUMS(8).
LOOP #=1 TO 8.
+  COMPUTE CUMS(#)=0.
+  LOOP ##=# TO 1 BY -1.
+    IF CTIME.DAYS(DATES(#)-DATES(##)) LE 365.25 CUMS(#)=CUMS(#) +
SCORES(##).
+  END LOOP.
END LOOP.
VARSTOCASES
 /MAKE mydate FROM mydate.1 TO mydate.8
 /MAKE Score  FROM score.1  TO score.8
 /MAKE cum    FROM cums1    TO cums8
 /KEEP =  id
 /NULL = DROP.
SELECT IF NOT MISSING(SCORE).
LIST.

ID     MYDATE  SCORE      CUM

 1 01-FEB-07     .33      .33
 1 20-MAR-07     .50      .83
 1 01-FEB-08     .22     1.05
 1 20-MAR-08     .22      .44
 1 01-MAY-08    1.00     1.44
 1 01-MAY-08    1.00     2.44
 1 15-MAY-08    1.00     3.44
 1 15-MAY-08    1.00     4.44
 3 01-FEB-06     .10      .10
 3 20-MAR-06     .33      .43
 3 01-MAY-06    1.00     1.43
 3 15-MAY-08     .98      .98
 4 01-FEB-08     .98      .98
 4 02-MAR-08     .77     1.75
 4 01-MAY-10    1.00     1.00


Number of cases read:  15    Number of cases listed:  15




Snuffy Dog wrote:
>
> SPSS friends,
>
> Given the following dataset (excluding the desired output column).
>
> I want to calculate the cumulitive sum for each row using the Test_score
> variable and I want to do this within each id.  That bit I have managed to
> solve, however to make things more complicated I only want to use those
> Test_Score values for each case from those taken from the preceeding 12
> months. The desired output column shows you the output that I am trying to
> calculate. In essesence I am using CSUM within a group defined by ID and
> also within a time frame of the last 12 months.
>
> I can work out the CSUM bit but not the lag for the preceeding 12 months.
>
> sort cases by ID (A) date (A).
> split file by ID.
> create cum = CSUM(Test_Score). *but only use values for the last 12 months
> in the csum
> split file off.
>
>      ID Test_Score Date   Desired_output  1 0.33 01.02.2007 0.33 1 0.50
> 20.03.2007 0.83 1 1.00 01.05.2008 1.00 1 1.00 15.05.2008 2.00 1 0.22
> 01.02.2008 2.22 1 0.22 20.03.2008 2.44 1 1.00 01.05.2008 3.44 1 1.00
> 15.05.2008 4.44 3 0.10 01.02.2006 0.10 3 0.33 20.03.2006 0.43 3 1.00
> 01.05.2006 1.43 3 0.98 15.05.2008 0.98 4 0.98 01.02.2008 0.98 4 0.77
> 02.03.2008 1.75 4 1.00 01.05.2010 1.00
>
> Help - this is a tricky one (for me at least)
>
> Many thanks,
>
> Jack Thomas
>
>

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Using-CSUM-and-LAG-within-groups-and-within-time-period-tp3292384p3292649.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD





Reply | Threaded
Open this post in threaded view
|

Re: Using CSUM and LAG within groups and within time period

hillel vardi
In reply to this post by Snuffy Dog
Shalom

Here is another way to handle that challenge .

The program need only one pass of the data and loop only on the necessary  part of the vector .

Hillel Vardi
BGU


DATA LIST LIST/ ID (f1) score(f5.2) Date(date10) desire (f5.2).
BEGIN DATA
1     0.33     01.02.2007     0.33
1     0.50     20.03.2007     0.83
1     1.00     01.05.2008     1.00
1     1.00     15.05.2008     2.00
1     0.22     01.02.2008     2.22
1     0.22     20.03.2008     2.44
1     1.00     01.05.2008     3.44
1     1.00     15.05.2008     4.44
3     0.10     01.02.2006     0.10
3     0.33     20.03.2006     0.43
3     1.00     01.05.2006     1.43
3     0.98     15.05.2008     0.98
4     0.98     01.02.2008     0.98
4     0.77     02.03.2008     1.75
4     1.00     01.05.2010     1.00
END DATA.
SORT CASES BY  id(A) date(A).
add files      file = * / by id/first=first.
if                    first eq 1  seq=0.
if first eq 1 run=1 .
compute    seq=sum(seq,1).
leave      seq run .
VECTOR     #DATES(8) #SCORES(8).
compute    #DATES(seq) =date .
compute    #SCORES(seq) =SCORE .
loop       ii=run to seq .
do if        DATEDIFF(#DATES(seq) ,#DATES(ii),"days")    gt 365    .
compute    run=sum(run,1) .
else .
compute    sumscore=sum(sumscore,#SCORES(ii) ) .
end if .
end loop .
EXECUTE.
add files  file =* / keep=ID score Date  seq  sumscore   .



SPSS friends,

>
>  Given the following dataset (excluding the desired output column).
>
>  I want to calculate the cumulitive sum for each row using the Test_score
>  variable and I want to do this within each id.  That bit I have managed to
>  solve, however to make things more complicated I only want to use those
>  Test_Score values for each case from those taken from the preceeding 12
>  months. The desired output column shows you the output that I am trying to
>  calculate. In essesence I am using CSUM within a group defined by ID and
>  also within a time frame of the last 12 months.
>
>  I can work out the CSUM bit but not the lag for the preceeding 12 months.
>
>  sort cases by ID (A) date (A).
>  split file by ID.
>  create cum = CSUM(Test_Score). *but only use values for the last 12 months
>  in the csum
>  split file off.
>
>        ID Test_Score Date   Desired_output  1 0.33 01.02.2007 0.33 1 0.50
>  20.03.2007 0.83 1 1.00 01.05.2008 1.00 1 1.00 15.05.2008 2.00 1 0.22
>  01.02.2008 2.22 1 0.22 20.03.2008 2.44 1 1.00 01.05.2008 3.44 1 1.00
>  15.05.2008 4.44 3 0.10 01.02.2006 0.10 3 0.33 20.03.2006 0.43 3 1.00
>  01.05.2006 1.43 3 0.98 15.05.2008 0.98 4 0.98 01.02.2008 0.98 4 0.77
>  02.03.2008 1.75 4 1.00 01.05.2010 1.00
>
>  Help - this is a tricky one (for me at least)
>
>  Many thanks,
>
>  Jack Thomas
>
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Using CSUM and LAG within groups and within time period

Richard Ristow
In reply to this post by Snuffy Dog
At 01:47 PM 12/4/2010, Snuffy Dog wrote:

>I want to calculate the cumulative sum of the Test_score variable
>within each id.  However, I only want to use those Test_Score values
>for each case from those taken from the preceeding 12 months.
>
>Help - this is a tricky one (for me at least)

As you've seen, it's a fairly tricky one altogether, though you've
also received some good solutions.

As an exercise and illustration, here's a solution using long-form
data organization and native SPSS code. Like Jon Peck's Python
solution, it should have no constraints on the number of records per
ID. It performs the summing loop once -- I think the Python code runs
the loop for each case -- and so *might* have speed advantages on
very large files. In this solution, 'year' is taken as calendar year.
|-----------------------------|---------------------------|
|Output Created               |07-DEC-2010 22:13:02       |
|-----------------------------|---------------------------|
[TestData]
ID Test_Score       Date

  1      .33   01.02.2007
  1      .50   20.03.2007
  1      .22   01.02.2008
  1      .22   20.03.2008
  1     1.00   01.05.2008
  1     1.00   01.05.2008
  1     1.00   15.05.2008
  1     1.00   15.05.2008
  3      .10   01.02.2006
  3      .33   20.03.2006
  3     1.00   01.05.2006
  3      .98   15.05.2008
  4      .98   01.02.2008
  4      .77   02.03.2008
  4     1.00   01.05.2010

Number of cases read:  15    Number of cases listed:  15


DATASET COPY     CumNow.
DATASET ACTIVATE CumNow  WINDOW=Front.
split file by ID.
CREATE    ToDate = CSUM(Test_Score).

Create
|-----------------------------|---------------------------|
|Output Created               |07-DEC-2010 22:13:03       |
|-----------------------------|---------------------------|
[CumNow]
Created Series
|--|-|------|--------------------|----------|----------------|
|ID| |Series|Case Number of      |N of Valid|Creating        |
|  | |Name  |Non-Missing Values  |Cases     |Function        |
|  | |      |---------------|----|          |                |
|  | |      |First          |Last|          |                |
|--|-|------|--------------------|----------|----------------|
|1 |1|ToDate| 1             | 8  |8         |CSUM(Test_Score)|
|3 |1|ToDate| 9             |12  |4         |CSUM(Test_Score)|
|4 |1|ToDate|13             |15  |3         |CSUM(Test_Score)|
|--|-|------|--------------------|----------|----------------|


FORMATS   ToDate (F5.2).
VAR LABEL ToDate 'Total score thru current date'.
split file off.

DATASET COPY     CumPast.
DATASET ACTIVATE CumPast WINDOW=Front.
COMPUTE   Date  = DATESUM(Date, 1, "years").
VAR LABEL ToDate 'Total score thru one year ago'.
EXECUTE /* needed because of a glitch in dataset handling */.

ADD FILES
   /FILE=CumPast /RENAME=(ToDate   Test_Score
                         =To1yrAgo drop1     )
   /FILE=CumNow
   /BY ID Date
   /DROP=drop1
   /KEEP=ID Test_Score Date
            ToDate To1yrAgo.


IF (MISSING(To1yrAgo) AND ID EQ LAG(ID))
            To1yrAgo=LAG(To1yrAgo).
RECODE     To1yrAgo (MISSING=0).

EXECUTE /* so the LAG calculation is completed before the selection */.
SELECT IF NOT MISSING(Test_Score).

COMPUTE InPastYr = ToDate - To1yrAgo.
FORMATS InPastYr (F5.2).
LIST.

List
|-----------------------------|---------------------------|
|Output Created               |07-DEC-2010 22:13:06       |
|-----------------------------|---------------------------|
ID Test_Score       Date ToDate To1yrAgo InPastYr

  1      .33   01.02.2007    .33     .00      .33
  1      .50   20.03.2007    .83     .00      .83
  1      .22   01.02.2008   1.05     .33      .72
  1      .22   20.03.2008   1.27     .83      .44
  1     1.00   01.05.2008   2.27     .83     1.44
  1     1.00   01.05.2008   3.27     .83     2.44
  1     1.00   15.05.2008   4.27     .83     3.44
  1     1.00   15.05.2008   5.27     .83     4.44
  3      .10   01.02.2006    .10     .00      .10
  3      .33   20.03.2006    .43     .00      .43
  3     1.00   01.05.2006   1.43     .00     1.43
  3      .98   15.05.2008   2.41    1.43      .98
  4      .98   01.02.2008    .98     .00      .98
  4      .77   02.03.2008   1.75     .00     1.75
  4     1.00   01.05.2010   2.75    1.75     1.00

Number of cases read:  15    Number of cases listed:  15
=============================
APPENDIX: Test data, and code
=============================
DATA LIST LIST/
     ID Test_Score Date #Desired
     (F2,F5.2, EDATE10, F5.2).
BEGIN DATA
     1 0.33 01.02.2007 0.33
     1 0.50 20.03.2007 0.83
     1 1.00 01.05.2008 1.00
     1 1.00 15.05.2008 2.00
     1 0.22 01.02.2008 2.22
     1 0.22 20.03.2008 2.44
     1 1.00 01.05.2008 3.44
     1 1.00 15.05.2008 4.44
     3 0.10 01.02.2006 0.10
     3 0.33 20.03.2006 0.43
     3 1.00 01.05.2006 1.43
     3 0.98 15.05.2008 0.98
     4 0.98 01.02.2008 0.98
     4 0.77 02.03.2008 1.75
     4 1.00 01.05.2010 1.00
END DATA.
DATASET NAME TestData WINDOW=FRONT.
SORT CASES BY ID Date.
LIST.

DATASET COPY     CumNow.
DATASET ACTIVATE CumNow  WINDOW=Front.

split file by ID.
CREATE    ToDate = CSUM(Test_Score).
FORMATS   ToDate (F5.2).
VAR LABEL ToDate 'Total score thru current date'.
split file off.

DATASET COPY     CumPast.
DATASET ACTIVATE CumPast WINDOW=Front.
COMPUTE   Date  = DATESUM(Date, 1, "years").
VAR LABEL ToDate 'Total score thru one year ago'.
EXECUTE /* needed because of a glitch in dataset handling */.

ADD FILES
   /FILE=CumPast /RENAME=(ToDate   Test_Score
                         =To1yrAgo drop1     )
   /FILE=CumNow
   /BY ID Date
   /DROP=drop1
   /KEEP=ID Test_Score Date
            ToDate To1yrAgo.


IF (MISSING(To1yrAgo) AND ID EQ LAG(ID))
            To1yrAgo=LAG(To1yrAgo).
RECODE     To1yrAgo (MISSING=0).

EXECUTE /* so the LAG calculation is completed before the selection */.
SELECT IF NOT MISSING(Test_Score).

COMPUTE InPastYr = ToDate - To1yrAgo.
FORMATS InPastYr (F5.2).
LIST.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD