Computing variables based on multiple rows in a tall-format file

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

David Marso
Administrator
"Another somewhat mysterious aspect of David's method is that LAG(SEQ) can be used on the right side of the COMPUTE statement that is bringing SEQ into existence as a variable.  That this can be done may not be intuitively obvious!  ;-) "

MAGIC!!!  Obvious is BORING!  I really don't expect novices to get it and could really care less if they do!


Bruce Weaver wrote
I mostly agree with Art on this one, partly because there are some aspects of David's one-liner that are a bit mysterious, especially to novices, I expect.  E.g., on Row 1, LAG(SEQ) and LAG(ID) both return SYSMIS.  The reason why David's code works is that those SYSMIS values are appearing as arguments to the SUM function, and SUM returns a valid result if at least one argument has a valid value--see the demo below.  

Another somewhat mysterious aspect of David's method is that LAG(SEQ) can be used on the right side of the COMPUTE statement that is bringing SEQ into existence as a variable.  That this can be done may not be intuitively obvious!  ;-)

Re Richard's DO-IF, I prefer to have both conditions that result in SEQ = 1 on a single conditional statement with an OR.  And perhaps two separate IF statements would be even easier for novice users to understand.  See below.  


NEW FILE.
DATASET CLOSE all.
DATA LIST FREE / ID.
BEGIN DATA
1 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 4 4 4 5 5 5 5
END DATA.

* David's one-liner.

COMPUTE SEQ1=SUM(1,LAG(SEQ1)*(LAG(ID) EQ ID)).

* Variation on Richard's DO-IF.
* The two conditions for which SEQ2 = 1 are combined with OR.

DO IF  ($CASENUM EQ 1) OR (ID NE LAG(ID)) .
.  COMPUTE SEQ2 = 1.
ELSE.
.  COMPUTE SEQ2 = LAG(SEQ2) + 1.
END IF.

* Two IF statements.
* This is somewhat less efficient in machine time than DO-IF,
* but possibly easier for the novice user to understand.

IF  ($CASENUM EQ 1) OR (ID NE LAG(ID)) SEQ3 = 1.
IF MISSING(SEQ3) SEQ3 = LAG(SEQ3) + 1.

FORMATS ID SEQ1 to SEQ3(f5.0).
LIST.

* The slightly mysterious thing about David's one-liner
* is that the result is not SYSMIS on Row 1, because
* on Row 1, LAG(SEQ1) and LAG(ID) both return SYSMIS.

COMPUTE LagID = LAG(ID).
COMPUTE LagSEQ1 = LAG(SEQ1).
FORMATS LagID LagSEQ1(F5.0).
LIST.

Output:
   ID  SEQ1  SEQ2  SEQ3 LagID LagSEQ1
 
    1     1     1     1     .      .
    1     2     2     2     1      1
    1     3     3     3     1      2
    1     4     4     4     1      3
etc.

* The reason David's code does not return SYSMIS as a
* result is that those missing values appear within
* the SUM function:  SUM will return a valid result
* if at lest one of the arguments is valid.  If you
* compute a sum using plus signs, on the other hand,
* all variables must be valid.

NEW FILE.
DATASET CLOSE all.
DATA LIST LIST / V1 to V3 (3f1).
BEGIN DATA
1 2 3
1 2 .
1 . .
. . .
END DATA.

COMPUTE SumViaSUM = SUM(V1 to V3).
COMPUTE SumViaPlus = V1 + V2 + V3.
FORMATS SumViaSUM SumViaPlus (F5.0).
LIST.

Output:
V1 V2 V3 SumViaSUM SumViaPlus
 
 1  2  3       6          6
 1  2  .       3          .
 1  .  .       1          .
 .  .  .       .          .
 
Number of cases read:  4    Number of cases listed:  4


David Marso wrote
Assumptions of my one liner.
Something is equal to something else or it isn't (true=1 false=0).
Multiplication of X by 0 = 0, by 1 = X
0+1 = 1, X+1 = X+1.
Anybody having a problem with this might ponder their choice of careers or majors?


Art Kendall wrote
<Don flame shields!>
        Your solution does require fewer lines and characters in
        the syntax, and most likely fewer internal operations.
        However, I deny your assertion that the one line of code is
        "reasonably intuitive".
        It would help beginners to see and understand Rich's solution
        and then see that it can be expressed more compactly. 
       
        Your solution is "reasonably intuitive" only for people with at
        least a moderate amount of experience at some computer
        languages.
       
        However, Rich's do repeat syntax is much easier for people to
        read.
        It is my impression that many posts from this list are from
        beginners.
        It is also my impression that people searching the archives are
        beginners.
       
       
        The easier it is for people to read and understand the syntax
        the easier it is to communicate the process to other people,
        e.g., classmates working as peer reviewers, on the job QA
        reviewers, triers-of-fact, archive users, process maintainers
        and up-daters.
       
        Soapbox: Efficiency in terms of cognitive load trumps saving
        storage space for syntax and often trumps saving small amounts
        of processing time.  Marginal labor cost is a greater
        consideration than marginal cost of machine resources.
       
        A rhetorical question:  How much computer time would it take to
        do David's solution compared to the computer time to do Rich's
        solution?  How many times of running the same code would it take
        to actually get a measurable difference in computer time, e.g.,
        10 seconds?
       
        <remove flame shield.>
       
     
      Art Kendall
Social Research Consultants
      On 12/5/2013 6:54 AM, David Marso [via SPSSX Discussion] wrote:
   
     I don't see why people use that ponderous DO IF
      $CASENUM = 1 .... ELSE blah blah blah (7 lines more or less)
      approach when a counter can be build with !!!!!
     
     
      COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).
     
     
      DATA LIST FREE / ID.
     
      BEGIN DATA
     
      1 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 4 4 4 5 5 5 5
     
      END DATA.
     
      COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).
     
      LIST.
     
     
            ID      SEQ
       
     
          1.00     1.00
          1.00     2.00
          1.00     3.00
          1.00     4.00
          1.00     5.00
          1.00     6.00
          2.00     1.00
          2.00     2.00
          2.00     3.00
          3.00     1.00
          3.00     2.00
          3.00     3.00
          3.00     4.00
          3.00     5.00
          3.00     6.00
          3.00     7.00
          4.00     1.00
          4.00     2.00
          4.00     3.00
          5.00     1.00
          5.00     2.00
          5.00     3.00
          5.00     4.00
       
     
       
     
      Number of cases read:  23    Number of cases listed:  23
     
     
       
          Michael
            Cohn wrote
          I think this
            solves all my problems! Many thanks to Andy and Richard for
            their help. I wasn't familiar with the LAG and AGGREGATE
            functions in SPSS but now I know what to start learning
            about.
           
           
            - Michael
           
           
            ----------------------------------
           
            Michael A. Cohn, PhD
           
            [hidden
              email]
            Osher Center for Integrative Medicine
           
            University of California, San Francisco
           
           
            From: "Richard Ristow [via SPSSX Discussion]" < [hidden
              email] <mailto: [hidden
              email] >>
           
            Date: Wednesday, December 4, 2013 at 14:56
           
            To: Michael Cohn < [hidden
              email] <mailto: [hidden
              email] >>
           
            Subject: Re: Computing variables based on multiple rows in a
            tall-format file
           
           
            At 03:23 PM 12/4/2013, Michael Cohn wrote:
           
           
            >My data file is currently in "tall" format (one row per
            measurement
           
            >per participant). Is there a way to generate variables
            in each
           
            >record that are based on information in that user's
            other records?
           
           
            You'll get many answers; the fact is, all of these are quite
            easy.
           
            The code I'm posting (not tested) assumes variables
           
           
            PcptID  -- Participant identifier
           
            Date    -- Date stamp
           
            Outcome -- Outcome value
           
           
            neither of the first two are ever missing; and your file is
            sorted in
           
            ascending order on the first two.
           
           
            >* A sequential index variable based on the record's
            datestamp (i.e.,
           
            >number a participant's responses 1, 2, 3... in
            chronological order).
           
           
            Various ways; here's a simple one, using transformation
            language:
           
           
            NUMERIC VisitSeq (F4).
           
           
            DO IF    $CASENUM EQ 1.
           
            .  COMPUTE VisitSeq = 1.
           
            ELSE IF  PcptID NE LAG(PcptID).
           
            .  COMPUTE VisitSeq = 1.
           
            ELSE.
           
            .  COMPUTE VisitSeq = LAG(VisitSeq) + 1.
           
            END IF.
           
           
           
            >* The length of time between the record and the earliest
            record for that
           
            >participant
           
            >* The difference between the outcome variable in the
            record and the minimum
           
            >value ever recorded for that participant.
           
           
            In both cases, start by putting the minimum value for the
            participant
           
            in every record for that participant, and then it's easy:
           
           
            AGGREGATE OUTFILE=* MODE=ADDVARIABLES
           
                /BREAK=PcptID
           
                /Earliest 'Date of earliest record for participant' =
            MIN(Date)
           
                /MinOut   'Lowest outcome value for participant'    =
            MIN(Outcome).
           
           
            >It's easy to do these using a spreadsheet or a python
            script ...
           
           
            Actually, I think it's probably easier in long ('tall') form
            in
           
            native SPSS than in either of those two.
           
           
            -Best of luck,
           
              Richard
           
           
            =====================
           
            To manage your subscription to SPSSX-L, send a message to
           
            [hidden
            email]</user/SendEmail.jtp?type=node&node=5723438&i=0>
            (not to SPSSX-L), with no body text except the
           
            command. To leave the list, send the command
           
            SIGNOFF SPSSX-L
           
            For a list of commands to manage subscriptions, send the
            command
           
            INFO REFCARD
           
           
           
            ________________________________
           
            If you reply to this email, your message will be added to
            the discussion below:
           
            http://spssx-discussion.1045642.n5.nabble.com/Computing-variables-based-on-multiple-rows-in-a-tall-format-file-tp5723431p5723438.html 
            To unsubscribe from Computing variables based on multiple
            rows in a tall-format file, click here< http://spssx-discussion.1045642.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=5723431&code=Y29obm1Ab2NpbS51Y3NmLmVkdXw1NzIzNDMxfC03OTA5NjYwNDY=> .
           
            NAML< http://spssx-discussion.1045642.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> 
       
     
       Please
        reply to the list and not to my personal email.
       
        Those desiring my consulting or training services please feel
        free to email me.
       
        ---
       
        "Nolite dare sanctum canibus neque mittatis margaritas vestras
        ante porcos ne forte conculcent eas pedibus suis."
       
        Cum es damnatorum possederunt porcos iens ut salire off
        sanguinum cliff in abyssum?"
     
     
     
     
        If you reply to this email, your
          message will be added to the discussion below:
        http://spssx-discussion.1045642.n5.nabble.com/Computing-variables-based-on-multiple-rows-in-a-tall-format-file-tp5723431p5723446.html 
     
     
        To start a new topic under SPSSX Discussion, email
        [hidden email] 
        To unsubscribe from SPSSX Discussion, click
          here .
        NAML
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

Maguin, Eugene
One of the things that David's solutions use is what I'd call "nesting" and "recursion". I'm curious if spss has limits on either of those elements. For nesting, I think the question is how many sets of parentheses can be accommodated and for recursion how many times can the same function be called.
Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Marso
Sent: Saturday, December 07, 2013 5:14 PM
To: [hidden email]
Subject: Re: Computing variables based on multiple rows in a tall-format file

"Another somewhat mysterious aspect of David's method is that LAG(SEQ) can be used on the right side of the COMPUTE statement that is bringing SEQ into existence as a variable.  That this can be done may not be intuitively obvious!  ;-) "

MAGIC!!!  Obvious is BORING!  I really don't expect novices to get it and could really care less if they do!



Bruce Weaver wrote
> I
/
> mostly
/

>  agree with Art on this one, partly because there are some aspects of
> David's one-liner that are a bit mysterious, especially to novices, I
> expect.  E.g., on Row 1, LAG(SEQ) and LAG(ID) both return SYSMIS.  The
> reason why David's code works is that those SYSMIS values are
> appearing as arguments to the SUM function, and SUM returns a valid
> result if at least one argument has a valid value--see the demo below.
>
> Another somewhat mysterious aspect of David's method is that LAG(SEQ)
> can be used on the right side of the COMPUTE statement that is
> bringing SEQ into existence as a variable.  That this can be done may
> not be intuitively obvious!  ;-)
>
> Re Richard's DO-IF, I prefer to have both conditions that result in
> SEQ =
> 1 on a single conditional statement with an OR.  And perhaps two
> separate IF statements would be even easier for novice users to
> understand.  See below.
>
>
> NEW FILE.
> DATASET CLOSE all.
> DATA LIST FREE / ID.
> BEGIN DATA
> 1 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 4 4 4 5 5 5 5 END DATA.
>
> * David's one-liner.
>
> COMPUTE SEQ1=SUM(1,LAG(SEQ1)*(LAG(ID) EQ ID)).
>
> * Variation on Richard's DO-IF.
> * The two conditions for which SEQ2 = 1 are combined with OR.
>
> DO IF  ($CASENUM EQ 1) OR (ID NE LAG(ID)) .
> .  COMPUTE SEQ2 = 1.
> ELSE.
> .  COMPUTE SEQ2 = LAG(SEQ2) + 1.
> END IF.
>
> * Two IF statements.
> * This is somewhat less efficient in machine time than DO-IF,
> * but possibly easier for the novice user to understand.
>
> IF  ($CASENUM EQ 1) OR (ID NE LAG(ID)) SEQ3 = 1.
> IF MISSING(SEQ3) SEQ3 = LAG(SEQ3) + 1.
>
> FORMATS ID SEQ1 to SEQ3(f5.0).
> LIST.
>
> * The slightly mysterious thing about David's one-liner
> * is that the result is not SYSMIS on Row 1, because
> * on Row 1, LAG(SEQ1) and LAG(ID) both return SYSMIS.
>
> COMPUTE LagID = LAG(ID).
> COMPUTE LagSEQ1 = LAG(SEQ1).
> FORMATS LagID LagSEQ1(F5.0).
> LIST.
>
> Output:
>    ID  SEQ1  SEQ2  SEQ3 LagID LagSEQ1
>
>     1     1     1     1     .      .
>     1     2     2     2     1      1
>     1     3     3     3     1      2
>     1     4     4     4     1      3
> etc.
>
> * The reason David's code does not return SYSMIS as a
> * result is that those missing values appear within
> * the SUM function:  SUM will return a valid result
> * if at lest one of the arguments is valid.  If you
> * compute a sum using plus signs, on the other hand,
> * all variables must be valid.
>
> NEW FILE.
> DATASET CLOSE all.
> DATA LIST LIST / V1 to V3 (3f1).
> BEGIN DATA
> 1 2 3
> 1 2 .
> 1 . .
> . . .
> END DATA.
>
> COMPUTE SumViaSUM = SUM(V1 to V3).
> COMPUTE SumViaPlus = V1 + V2 + V3.
> FORMATS SumViaSUM SumViaPlus (F5.0).
> LIST.
>
> Output:
> V1 V2 V3 SumViaSUM SumViaPlus
>
>  1  2  3       6          6
>  1  2  .       3          .
>  1  .  .       1          .
>  .  .  .       .          .
>
> Number of cases read:  4    Number of cases listed:  4
>
> David Marso wrote
>> Assumptions of my one liner.
>> Something is equal to something else or it isn't (true=1 false=0).
>> Multiplication of X by 0 = 0, by 1 = X
>> 0+1 = 1, X+1 = X+1.
>> Anybody having a problem with this might ponder their choice of
>> careers or majors?
>>
>> Art Kendall wrote
>>> &lt;Don flame shields!&gt;
>>>         Your solution does require fewer lines and characters in
>>>         the syntax, and most likely fewer internal operations.
>>>         However, I deny your assertion that the one line of code is
>>>         "reasonably intuitive".
>>>         It would help beginners to see and understand Rich's solution
>>>         and then see that it can be expressed more compactly.&nbsp;
>>>
>>>         Your solution is "reasonably intuitive" only for people with at
>>>         least a moderate amount of experience at some computer
>>>         languages.
>>>
>>>         However, Rich's do repeat syntax is much easier for people to
>>>         read.
>>>         It is my impression that many posts from this list are from
>>>         beginners.
>>>         It is also my impression that people searching the archives are
>>>         beginners.
>>>
>>>
>>>         The easier it is for people to read and understand the syntax
>>>         the easier it is to communicate the process to other people,
>>>         e.g., classmates working as peer reviewers, on the job QA
>>>         reviewers, triers-of-fact, archive users, process maintainers
>>>         and up-daters.
>>>
>>>         Soapbox: Efficiency in terms of cognitive load trumps saving
>>>         storage space for syntax and often trumps saving small amounts
>>>         of processing time.&nbsp; Marginal labor cost is a greater
>>>         consideration than marginal cost of machine resources.
>>>
>>>         A rhetorical question:&nbsp; How much computer time would it
>>> take to
>>>         do David's solution compared to the computer time to do Rich's
>>>         solution?&nbsp; How many times of running the same code
>>> would it take
>>>         to actually get a measurable difference in computer time, e.g.,
>>>         10 seconds?
>>>
>>>         &lt;remove flame shield.&gt;
>>>
>>>
>>>       Art Kendall
>>> Social Research Consultants
>>>       On 12/5/2013 6:54 AM, David Marso [via SPSSX Discussion] wrote:
>>>
>>>      I don't see why people use that ponderous DO IF
>>>       $CASENUM = 1 .... ELSE blah blah blah (7 lines more or less)
>>>       approach when a counter can be build with !!!!!
>>>
>>>
>>>       COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).
>>>
>>>
>>>       DATA LIST FREE / ID.
>>>
>>>       BEGIN DATA
>>>
>>>       1 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 4 4 4 5 5 5 5
>>>
>>>       END DATA.
>>>
>>>       COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).
>>>
>>>       LIST.
>>>
>>>
>>>       &nbsp; &nbsp; &nbsp; ID &nbsp; &nbsp; &nbsp;SEQ
>>>       &nbsp;
>>>
>>>       &nbsp; &nbsp; 1.00 &nbsp; &nbsp; 1.00
>>>       &nbsp; &nbsp; 1.00 &nbsp; &nbsp; 2.00
>>>       &nbsp; &nbsp; 1.00 &nbsp; &nbsp; 3.00
>>>       &nbsp; &nbsp; 1.00 &nbsp; &nbsp; 4.00
>>>       &nbsp; &nbsp; 1.00 &nbsp; &nbsp; 5.00
>>>       &nbsp; &nbsp; 1.00 &nbsp; &nbsp; 6.00
>>>       &nbsp; &nbsp; 2.00 &nbsp; &nbsp; 1.00
>>>       &nbsp; &nbsp; 2.00 &nbsp; &nbsp; 2.00
>>>       &nbsp; &nbsp; 2.00 &nbsp; &nbsp; 3.00
>>>       &nbsp; &nbsp; 3.00 &nbsp; &nbsp; 1.00
>>>       &nbsp; &nbsp; 3.00 &nbsp; &nbsp; 2.00
>>>       &nbsp; &nbsp; 3.00 &nbsp; &nbsp; 3.00
>>>       &nbsp; &nbsp; 3.00 &nbsp; &nbsp; 4.00
>>>       &nbsp; &nbsp; 3.00 &nbsp; &nbsp; 5.00
>>>       &nbsp; &nbsp; 3.00 &nbsp; &nbsp; 6.00
>>>       &nbsp; &nbsp; 3.00 &nbsp; &nbsp; 7.00
>>>       &nbsp; &nbsp; 4.00 &nbsp; &nbsp; 1.00
>>>       &nbsp; &nbsp; 4.00 &nbsp; &nbsp; 2.00
>>>       &nbsp; &nbsp; 4.00 &nbsp; &nbsp; 3.00
>>>       &nbsp; &nbsp; 5.00 &nbsp; &nbsp; 1.00
>>>       &nbsp; &nbsp; 5.00 &nbsp; &nbsp; 2.00
>>>       &nbsp; &nbsp; 5.00 &nbsp; &nbsp; 3.00
>>>       &nbsp; &nbsp; 5.00 &nbsp; &nbsp; 4.00
>>>       &nbsp;
>>>
>>>       &nbsp;
>>>
>>>       Number of cases read: &nbsp;23 &nbsp; &nbsp;Number of cases
>>> listed: &nbsp;23
>>>
>>>
>>>
>>>           Michael
>>>             Cohn wrote
>>>           I think this
>>>             solves all my problems! Many thanks to Andy and Richard for
>>>             their help. I wasn't familiar with the LAG and AGGREGATE
>>>             functions in SPSS but now I know what to start learning
>>>             about.
>>>
>>>
>>>             - Michael
>>>
>>>
>>>             ----------------------------------
>>>
>>>             Michael A. Cohn, PhD
>>>
>>>             [hidden
>>>               email]
>>>             Osher Center for Integrative Medicine
>>>
>>>             University of California, San Francisco
>>>
>>>
>>>             From: "Richard Ristow [via SPSSX Discussion]" &lt; [hidden
>>>               email] &lt;mailto: [hidden
>>>               email] &gt;&gt;
>>>
>>>             Date: Wednesday, December 4, 2013 at 14:56
>>>
>>>             To: Michael Cohn &lt; [hidden
>>>               email] &lt;mailto: [hidden
>>>               email] &gt;&gt;
>>>
>>>             Subject: Re: Computing variables based on multiple rows in a
>>>             tall-format file
>>>
>>>
>>>             At 03:23 PM 12/4/2013, Michael Cohn wrote:
>>>
>>>
>>>             &gt;My data file is currently in "tall" format (one row per
>>>             measurement
>>>
>>>             &gt;per participant). Is there a way to generate variables
>>>             in each
>>>
>>>             &gt;record that are based on information in that user's
>>>             other records?
>>>
>>>
>>>             You'll get many answers; the fact is, all of these are quite
>>>             easy.
>>>
>>>             The code I'm posting (not tested) assumes variables
>>>
>>>
>>>             PcptID &nbsp;-- Participant identifier
>>>
>>>             Date &nbsp; &nbsp;-- Date stamp
>>>
>>>             Outcome -- Outcome value
>>>
>>>
>>>             neither of the first two are ever missing; and your file is
>>>             sorted in
>>>
>>>             ascending order on the first two.
>>>
>>>
>>>             &gt;* A sequential index variable based on the record's
>>>             datestamp (i.e.,
>>>
>>>             &gt;number a participant's responses 1, 2, 3... in
>>>             chronological order).
>>>
>>>
>>>             Various ways; here's a simple one, using transformation
>>>             language:
>>>
>>>
>>>             NUMERIC VisitSeq (F4).
>>>
>>>
>>>             DO IF &nbsp; &nbsp;$CASENUM EQ 1.
>>>
>>>             . &nbsp;COMPUTE VisitSeq = 1.
>>>
>>>             ELSE IF &nbsp;PcptID NE LAG(PcptID).
>>>
>>>             . &nbsp;COMPUTE VisitSeq = 1.
>>>
>>>             ELSE.
>>>
>>>             . &nbsp;COMPUTE VisitSeq = LAG(VisitSeq) + 1.
>>>
>>>             END IF.
>>>
>>>
>>>
>>>             &gt;* The length of time between the record and the earliest
>>>             record for that
>>>
>>>             &gt;participant
>>>
>>>             &gt;* The difference between the outcome variable in the
>>>             record and the minimum
>>>
>>>             &gt;value ever recorded for that participant.
>>>
>>>
>>>             In both cases, start by putting the minimum value for the
>>>             participant
>>>
>>>             in every record for that participant, and then it's easy:
>>>
>>>
>>>             AGGREGATE OUTFILE=* MODE=ADDVARIABLES
>>>
>>>             &nbsp; &nbsp; /BREAK=PcptID
>>>
>>>             &nbsp; &nbsp; /Earliest 'Date of earliest record for
>>> participant' =
>>>             MIN(Date)
>>>
>>>             &nbsp; &nbsp; /MinOut &nbsp; 'Lowest outcome value for
>>> participant' &nbsp; &nbsp;=
>>>             MIN(Outcome).
>>>
>>>
>>>             &gt;It's easy to do these using a spreadsheet or a python
>>>             script ...
>>>
>>>
>>>             Actually, I think it's probably easier in long ('tall') form
>>>             in
>>>
>>>             native SPSS than in either of those two.
>>>
>>>
>>>             -Best of luck,
>>>
>>>             &nbsp; Richard
>>>
>>>
>>>             =====================
>>>
>>>             To manage your subscription to SPSSX-L, send a message
>>> to
>>>
>>>             [hidden
>>>
>>> email]&lt;/user/SendEmail.jtp?type=node&amp;node=5723438&amp;i=0&gt;
>>>             (not to SPSSX-L), with no body text except the
>>>
>>>             command. To leave the list, send the command
>>>
>>>             SIGNOFF SPSSX-L
>>>
>>>             For a list of commands to manage subscriptions, send the
>>>             command
>>>
>>>             INFO REFCARD
>>>
>>>
>>>
>>>             ________________________________
>>>
>>>             If you reply to this email, your message will be added to
>>>             the discussion below:
>>>
>>>
>>> http://spssx-discussion.1045642.n5.nabble.com/Computing-variables-based-on-multiple-rows-in-a-tall-format-file-tp5723431p5723438.html
>>>             To unsubscribe from Computing variables based on multiple
>>>             rows in a tall-format file, click here&lt;
>>> http://spssx-discussion.1045642.n5.nabble.com/template/NamlServlet.j
>>> tp?macro=unsubscribe_by_code&amp;node=5723431&amp;code=Y29obm1Ab2Npb
>>> S51Y3NmLmVkdXw1NzIzNDMxfC03OTA5NjYwNDY=&gt;
>>> .
>>>
>>>             NAML&lt;
>>> http://spssx-discussion.1045642.n5.nabble.com/template/NamlServlet.j
>>> tp?macro=macro_viewer&amp;id=instant_html%21nabble%3Aemail.naml&amp;
>>> base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.
>>> NabbleNamespace-nabble.view.web.template.NodeNamespace&amp;breadcrum
>>> bs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%
>>> 3Aemail.naml-send_instant_email%21nabble%3Aemail.naml&gt;
>>>
>>>
>>>        Please
>>>         reply to the list and not to my personal email.
>>>
>>>         Those desiring my consulting or training services please feel
>>>         free to email me.
>>>
>>>         ---
>>>
>>>         "Nolite dare sanctum canibus neque mittatis margaritas vestras
>>>         ante porcos ne forte conculcent eas pedibus suis."
>>>
>>>         Cum es damnatorum possederunt porcos iens ut salire off
>>>         sanguinum cliff in abyssum?"
>>>
>>>
>>>
>>>
>>>         If you reply to this email, your
>>>           message will be added to the discussion below:
>>>
>>> http://spssx-discussion.1045642.n5.nabble.com/Computing-variables-ba
>>> sed-on-multiple-rows-in-a-tall-format-file-tp5723431p5723446.html
>>>
>>>
>>>         To start a new topic under SPSSX Discussion, email
>>>

>>> ml-node+s1045642n1068821h68@.nabble

>>>
>>>         To unsubscribe from SPSSX Discussion, click
>>>           here .
>>>         NAML





-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Computing-variables-based-on-multiple-rows-in-a-tall-format-file-tp5723431p5723500.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

David Marso
Administrator
I'm pretty sure the notion of recursion doesn't apply.
I can't think of where recursion would fit into standard SPSS syntax since aside from python SPSS doesn't really have functions except in certain uses of macro (but then I think of recursion as a primarily mathematical concept and MACRO doesn't know anything about math).  I would be very interested in seeing a recursive method implemented in SPSS particularly in a macro.
An example of recursion is say the Fibonacci sequence 1 1 2 3 5 8 13 21 ..
such that Fib(n)=Fib(n-1)+Fib(n-2)
Fib(1)=Fib(2)=1
or Factorial(N)=Factorial(N-1)*N

I do however use nested functions to a great degree in my coding.
I have yet to encounter any sort of misbehavior in standard syntax.
MATRIX does have a nesting limit of 6 .

If you blow past that you are rewarded with the following useless directive (useless because technical support will likely not be of any help whatsoever since the message is in all likelihood on the money ;-)
>Error # 12540
>Parser stack overflow.  The possible reason: nesting level is more than 6.  A
>program error has occurred: Please note the circumstances under which this
>error occurred, attempting to replicate it if possible, and then notify
>Technical Support.
>Execution of this command stops.
 
The MATRIX statement skipped.
This resulted from the following monstrosity which I doubt I would ever push
MATRIX.
COMPUTE X=T({1,1,1,1,1;1,2,3,4,5;6,7,8,9,11}).
COMPUTE Boogie1=DET(DET(INV(DET(INV(T(X)*X))*T(X)*X))*T(X)*X).
PRINT Boogie1.
COMPUTE Boogie2=DET(DET(DET(INV(DET(INV(T(X)*X))*T(X)*X))*T(X)*X)*T(X)*X).
PRINT Boogie2.
END MATRIX.


Maguin, Eugene wrote
One of the things that David's solutions use is what I'd call "nesting" and "recursion". I'm curious if spss has limits on either of those elements. For nesting, I think the question is how many sets of parentheses can be accommodated and for recursion how many times can the same function be called.
Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Marso
Sent: Saturday, December 07, 2013 5:14 PM
To: [hidden email]
Subject: Re: Computing variables based on multiple rows in a tall-format file

"Another somewhat mysterious aspect of David's method is that LAG(SEQ) can be used on the right side of the COMPUTE statement that is bringing SEQ into existence as a variable.  That this can be done may not be intuitively obvious!  ;-) "

MAGIC!!!  Obvious is BORING!  I really don't expect novices to get it and could really care less if they do!



Bruce Weaver wrote
> I
/
> mostly
/
>  agree with Art on this one, partly because there are some aspects of
> David's one-liner that are a bit mysterious, especially to novices, I
> expect.  E.g., on Row 1, LAG(SEQ) and LAG(ID) both return SYSMIS.  The
> reason why David's code works is that those SYSMIS values are
> appearing as arguments to the SUM function, and SUM returns a valid
> result if at least one argument has a valid value--see the demo below.
>
> Another somewhat mysterious aspect of David's method is that LAG(SEQ)
> can be used on the right side of the COMPUTE statement that is
> bringing SEQ into existence as a variable.  That this can be done may
> not be intuitively obvious!  ;-)
>
> Re Richard's DO-IF, I prefer to have both conditions that result in
> SEQ =
> 1 on a single conditional statement with an OR.  And perhaps two
> separate IF statements would be even easier for novice users to
> understand.  See below.
>
>
> NEW FILE.
> DATASET CLOSE all.
> DATA LIST FREE / ID.
> BEGIN DATA
> 1 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 4 4 4 5 5 5 5 END DATA.
>
> * David's one-liner.
>
> COMPUTE SEQ1=SUM(1,LAG(SEQ1)*(LAG(ID) EQ ID)).
>
> * Variation on Richard's DO-IF.
> * The two conditions for which SEQ2 = 1 are combined with OR.
>
> DO IF  ($CASENUM EQ 1) OR (ID NE LAG(ID)) .
> .  COMPUTE SEQ2 = 1.
> ELSE.
> .  COMPUTE SEQ2 = LAG(SEQ2) + 1.
> END IF.
>
> * Two IF statements.
> * This is somewhat less efficient in machine time than DO-IF,
> * but possibly easier for the novice user to understand.
>
> IF  ($CASENUM EQ 1) OR (ID NE LAG(ID)) SEQ3 = 1.
> IF MISSING(SEQ3) SEQ3 = LAG(SEQ3) + 1.
>
> FORMATS ID SEQ1 to SEQ3(f5.0).
> LIST.
>
> * The slightly mysterious thing about David's one-liner
> * is that the result is not SYSMIS on Row 1, because
> * on Row 1, LAG(SEQ1) and LAG(ID) both return SYSMIS.
>
> COMPUTE LagID = LAG(ID).
> COMPUTE LagSEQ1 = LAG(SEQ1).
> FORMATS LagID LagSEQ1(F5.0).
> LIST.
>
> Output:
>    ID  SEQ1  SEQ2  SEQ3 LagID LagSEQ1
>
>     1     1     1     1     .      .
>     1     2     2     2     1      1
>     1     3     3     3     1      2
>     1     4     4     4     1      3
> etc.
>
> * The reason David's code does not return SYSMIS as a
> * result is that those missing values appear within
> * the SUM function:  SUM will return a valid result
> * if at lest one of the arguments is valid.  If you
> * compute a sum using plus signs, on the other hand,
> * all variables must be valid.
>
> NEW FILE.
> DATASET CLOSE all.
> DATA LIST LIST / V1 to V3 (3f1).
> BEGIN DATA
> 1 2 3
> 1 2 .
> 1 . .
> . . .
> END DATA.
>
> COMPUTE SumViaSUM = SUM(V1 to V3).
> COMPUTE SumViaPlus = V1 + V2 + V3.
> FORMATS SumViaSUM SumViaPlus (F5.0).
> LIST.
>
> Output:
> V1 V2 V3 SumViaSUM SumViaPlus
>
>  1  2  3       6          6
>  1  2  .       3          .
>  1  .  .       1          .
>  .  .  .       .          .
>
> Number of cases read:  4    Number of cases listed:  4
>
> David Marso wrote
>> Assumptions of my one liner.
>> Something is equal to something else or it isn't (true=1 false=0).
>> Multiplication of X by 0 = 0, by 1 = X
>> 0+1 = 1, X+1 = X+1.
>> Anybody having a problem with this might ponder their choice of
>> careers or majors?
>>
>> Art Kendall wrote
>>> <Don flame shields!>
>>>         Your solution does require fewer lines and characters in
>>>         the syntax, and most likely fewer internal operations.
>>>         However, I deny your assertion that the one line of code is
>>>         "reasonably intuitive".
>>>         It would help beginners to see and understand Rich's solution
>>>         and then see that it can be expressed more compactly. 
>>>
>>>         Your solution is "reasonably intuitive" only for people with at
>>>         least a moderate amount of experience at some computer
>>>         languages.
>>>
>>>         However, Rich's do repeat syntax is much easier for people to
>>>         read.
>>>         It is my impression that many posts from this list are from
>>>         beginners.
>>>         It is also my impression that people searching the archives are
>>>         beginners.
>>>
>>>
>>>         The easier it is for people to read and understand the syntax
>>>         the easier it is to communicate the process to other people,
>>>         e.g., classmates working as peer reviewers, on the job QA
>>>         reviewers, triers-of-fact, archive users, process maintainers
>>>         and up-daters.
>>>
>>>         Soapbox: Efficiency in terms of cognitive load trumps saving
>>>         storage space for syntax and often trumps saving small amounts
>>>         of processing time.  Marginal labor cost is a greater
>>>         consideration than marginal cost of machine resources.
>>>
>>>         A rhetorical question:  How much computer time would it
>>> take to
>>>         do David's solution compared to the computer time to do Rich's
>>>         solution?  How many times of running the same code
>>> would it take
>>>         to actually get a measurable difference in computer time, e.g.,
>>>         10 seconds?
>>>
>>>         <remove flame shield.>
>>>
>>>
>>>       Art Kendall
>>> Social Research Consultants
>>>       On 12/5/2013 6:54 AM, David Marso [via SPSSX Discussion] wrote:
>>>
>>>      I don't see why people use that ponderous DO IF
>>>       $CASENUM = 1 .... ELSE blah blah blah (7 lines more or less)
>>>       approach when a counter can be build with !!!!!
>>>
>>>
>>>       COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).
>>>
>>>
>>>       DATA LIST FREE / ID.
>>>
>>>       BEGIN DATA
>>>
>>>       1 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 4 4 4 5 5 5 5
>>>
>>>       END DATA.
>>>
>>>       COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).
>>>
>>>       LIST.
>>>
>>>
>>>             ID      SEQ
>>>        
>>>
>>>           1.00     1.00
>>>           1.00     2.00
>>>           1.00     3.00
>>>           1.00     4.00
>>>           1.00     5.00
>>>           1.00     6.00
>>>           2.00     1.00
>>>           2.00     2.00
>>>           2.00     3.00
>>>           3.00     1.00
>>>           3.00     2.00
>>>           3.00     3.00
>>>           3.00     4.00
>>>           3.00     5.00
>>>           3.00     6.00
>>>           3.00     7.00
>>>           4.00     1.00
>>>           4.00     2.00
>>>           4.00     3.00
>>>           5.00     1.00
>>>           5.00     2.00
>>>           5.00     3.00
>>>           5.00     4.00
>>>        
>>>
>>>        
>>>
>>>       Number of cases read:  23    Number of cases
>>> listed:  23
>>>
>>>
>>>
>>>           Michael
>>>             Cohn wrote
>>>           I think this
>>>             solves all my problems! Many thanks to Andy and Richard for
>>>             their help. I wasn't familiar with the LAG and AGGREGATE
>>>             functions in SPSS but now I know what to start learning
>>>             about.
>>>
>>>
>>>             - Michael
>>>
>>>
>>>             ----------------------------------
>>>
>>>             Michael A. Cohn, PhD
>>>
>>>             [hidden
>>>               email]
>>>             Osher Center for Integrative Medicine
>>>
>>>             University of California, San Francisco
>>>
>>>
>>>             From: "Richard Ristow [via SPSSX Discussion]" < [hidden
>>>               email] <mailto: [hidden
>>>               email] >>
>>>
>>>             Date: Wednesday, December 4, 2013 at 14:56
>>>
>>>             To: Michael Cohn < [hidden
>>>               email] <mailto: [hidden
>>>               email] >>
>>>
>>>             Subject: Re: Computing variables based on multiple rows in a
>>>             tall-format file
>>>
>>>
>>>             At 03:23 PM 12/4/2013, Michael Cohn wrote:
>>>
>>>
>>>             >My data file is currently in "tall" format (one row per
>>>             measurement
>>>
>>>             >per participant). Is there a way to generate variables
>>>             in each
>>>
>>>             >record that are based on information in that user's
>>>             other records?
>>>
>>>
>>>             You'll get many answers; the fact is, all of these are quite
>>>             easy.
>>>
>>>             The code I'm posting (not tested) assumes variables
>>>
>>>
>>>             PcptID  -- Participant identifier
>>>
>>>             Date    -- Date stamp
>>>
>>>             Outcome -- Outcome value
>>>
>>>
>>>             neither of the first two are ever missing; and your file is
>>>             sorted in
>>>
>>>             ascending order on the first two.
>>>
>>>
>>>             >* A sequential index variable based on the record's
>>>             datestamp (i.e.,
>>>
>>>             >number a participant's responses 1, 2, 3... in
>>>             chronological order).
>>>
>>>
>>>             Various ways; here's a simple one, using transformation
>>>             language:
>>>
>>>
>>>             NUMERIC VisitSeq (F4).
>>>
>>>
>>>             DO IF    $CASENUM EQ 1.
>>>
>>>             .  COMPUTE VisitSeq = 1.
>>>
>>>             ELSE IF  PcptID NE LAG(PcptID).
>>>
>>>             .  COMPUTE VisitSeq = 1.
>>>
>>>             ELSE.
>>>
>>>             .  COMPUTE VisitSeq = LAG(VisitSeq) + 1.
>>>
>>>             END IF.
>>>
>>>
>>>
>>>             >* The length of time between the record and the earliest
>>>             record for that
>>>
>>>             >participant
>>>
>>>             >* The difference between the outcome variable in the
>>>             record and the minimum
>>>
>>>             >value ever recorded for that participant.
>>>
>>>
>>>             In both cases, start by putting the minimum value for the
>>>             participant
>>>
>>>             in every record for that participant, and then it's easy:
>>>
>>>
>>>             AGGREGATE OUTFILE=* MODE=ADDVARIABLES
>>>
>>>                 /BREAK=PcptID
>>>
>>>                 /Earliest 'Date of earliest record for
>>> participant' =
>>>             MIN(Date)
>>>
>>>                 /MinOut   'Lowest outcome value for
>>> participant'    =
>>>             MIN(Outcome).
>>>
>>>
>>>             >It's easy to do these using a spreadsheet or a python
>>>             script ...
>>>
>>>
>>>             Actually, I think it's probably easier in long ('tall') form
>>>             in
>>>
>>>             native SPSS than in either of those two.
>>>
>>>
>>>             -Best of luck,
>>>
>>>               Richard
>>>
>>>
>>>             =====================
>>>
>>>             To manage your subscription to SPSSX-L, send a message
>>> to
>>>
>>>             [hidden
>>>
>>> email]</user/SendEmail.jtp?type=node&node=5723438&i=0>
>>>             (not to SPSSX-L), with no body text except the
>>>
>>>             command. To leave the list, send the command
>>>
>>>             SIGNOFF SPSSX-L
>>>
>>>             For a list of commands to manage subscriptions, send the
>>>             command
>>>
>>>             INFO REFCARD
>>>
>>>
>>>
>>>             ________________________________
>>>
>>>             If you reply to this email, your message will be added to
>>>             the discussion below:
>>>
>>>
>>> http://spssx-discussion.1045642.n5.nabble.com/Computing-variables-based-on-multiple-rows-in-a-tall-format-file-tp5723431p5723438.html
>>>             To unsubscribe from Computing variables based on multiple
>>>             rows in a tall-format file, click here<
>>> http://spssx-discussion.1045642.n5.nabble.com/template/NamlServlet.j
>>> tp?macro=unsubscribe_by_code&node=5723431&code=Y29obm1Ab2Npb
>>> S51Y3NmLmVkdXw1NzIzNDMxfC03OTA5NjYwNDY=>
>>> .
>>>
>>>             NAML<
>>> http://spssx-discussion.1045642.n5.nabble.com/template/NamlServlet.j
>>> tp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&
>>> base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.
>>> NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrum
>>> bs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%
>>> 3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>>
>>>        Please
>>>         reply to the list and not to my personal email.
>>>
>>>         Those desiring my consulting or training services please feel
>>>         free to email me.
>>>
>>>         ---
>>>
>>>         "Nolite dare sanctum canibus neque mittatis margaritas vestras
>>>         ante porcos ne forte conculcent eas pedibus suis."
>>>
>>>         Cum es damnatorum possederunt porcos iens ut salire off
>>>         sanguinum cliff in abyssum?"
>>>
>>>
>>>
>>>
>>>         If you reply to this email, your
>>>           message will be added to the discussion below:
>>>
>>> http://spssx-discussion.1045642.n5.nabble.com/Computing-variables-ba
>>> sed-on-multiple-rows-in-a-tall-format-file-tp5723431p5723446.html
>>>
>>>
>>>         To start a new topic under SPSSX Discussion, email
>>>

>>> ml-node+s1045642n1068821h68@.nabble

>>>
>>>         To unsubscribe from SPSSX Discussion, click
>>>           here .
>>>         NAML





-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Computing-variables-based-on-multiple-rows-in-a-tall-format-file-tp5723431p5723500.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Recursion with Macro ;-)

Richard Ristow
In reply to this post by Maguin, Eugene
Dear Gene,

At 10:06 AM 12/9/2013, you wrote (in thread "Computing variables
based on multiple rows in a tall-format file"):

>One of the things that David's solutions use is what I'd call
>"nesting" and "recursion".

You did spark quite a discussion, with some real illumination on the
working of macros -- I hope you've been enjoying it.

I would never have imagined that the macro system's memory management
was sophisticated enough to allow true recursive macros. Nor that
they'd be as useful as it looks like they are.

Happy New Year!

-Best wishes,
  Richard

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
12