Computing variables based on multiple rows in a tall-format file

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Computing variables based on multiple rows in a tall-format file

Michael Cohn
I'm analyzing a repeated measures dataset in which each participant was measured between 1 and 8 times, at essentially random intervals. My main analysis is a linear mixed model, so my data file is currently in "tall" format (one row per measurement per participant).

Is there a way to generate variables in each record that are based on information in that user's other records? For example:

* A sequential index variable based on the record's datestamp (i.e., number a participant's responses 1, 2, 3... in chronological order).

* The length of time between the record and the earliest record for that participant

* The difference between the outcome variable in the record and the minimum value ever recorded for that participant.

It's easy to do these using a spreadsheet or a python script, or by switching to a wide-format file and back. But those are cumbersome and error-prone, and I'd have to do it repeatedly, since we periodically add new records to the dataset. Can SPSS do this kind of thing natively, in the current file format?

Many thanks,

- Michael
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

Andy W
Sequential Index: Can be done use lags with something like below (with "Id" being a variable that uniquely identifies each participant and "Time" being the variable that orders the participant observations in time).

SORT CASES by Id Time.
DO IF $casenum = 1 OR Id <> LAG(Id).
  COMPUTE SeqInd = 1.
ELSE.
  COMPUTE SeqInd = LAG(SeqInd).
END IF.

(see http://andrewpwheeler.wordpress.com/2013/02/18/using-sequential-case-processing-for-data-management-in-spss/ for related)

Length of time since earliest record & difference between min value: See the AGGREGATE command, for both you would calculate the MIN using ID as the breaking group. (Then just calculate a second variable for the differences.) Taking group MEAN differences is a common procedure as well.

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

Andy W
Whoops the sequential id should be below (forgot the plus 1)

SORT CASES by Id Time.
DO IF $casenum = 1 OR Id <> LAG(Id).
  COMPUTE SeqInd = 1.
ELSE.
  COMPUTE SeqInd = LAG(SeqInd) + 1.
END IF.
Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

Richard Ristow
In reply to this post by Michael Cohn
At 03:23 PM 12/4/2013, Michael Cohn wrote:

>My data file is currently in "tall" format (one row per measurement
>per participant). Is there a way to generate variables in each
>record that are based on information in that user's other records?

You'll get many answers; the fact is, all of these are quite easy.
The code I'm posting (not tested) assumes variables

PcptID  -- Participant identifier
Date    -- Date stamp
Outcome -- Outcome value

neither of the first two are ever missing; and your file is sorted in
ascending order on the first two.

>* A sequential index variable based on the record's datestamp (i.e.,
>number a participant's responses 1, 2, 3... in chronological order).

Various ways; here's a simple one, using transformation language:

NUMERIC VisitSeq (F4).

DO IF    $CASENUM EQ 1.
.  COMPUTE VisitSeq = 1.
ELSE IF  PcptID NE LAG(PcptID).
.  COMPUTE VisitSeq = 1.
ELSE.
.  COMPUTE VisitSeq = LAG(VisitSeq) + 1.
END IF.


>* The length of time between the record and the earliest record for that
>participant
>* The difference between the outcome variable in the record and the minimum
>value ever recorded for that participant.

In both cases, start by putting the minimum value for the participant
in every record for that participant, and then it's easy:

AGGREGATE OUTFILE=* MODE=ADDVARIABLES
    /BREAK=PcptID
    /Earliest 'Date of earliest record for participant' = MIN(Date)
    /MinOut   'Lowest outcome value for participant'    = MIN(Outcome).

>It's easy to do these using a spreadsheet or a python script ...

Actually, I think it's probably easier in long ('tall') form in
native SPSS than in either of those two.

-Best of luck,
  Richard

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

Michael Cohn
I think this solves all my problems! Many thanks to Andy and Richard for their help. I wasn't familiar with the LAG and AGGREGATE functions in SPSS but now I know what to start learning about. 

- Michael

----------------------------------
Michael A. Cohn, PhD
Osher Center for Integrative Medicine
University of California, San Francisco

From: "Richard Ristow [via SPSSX Discussion]" <[hidden email]>
Date: Wednesday, December 4, 2013 at 14:56
To: Michael Cohn <[hidden email]>
Subject: Re: Computing variables based on multiple rows in a tall-format file

At 03:23 PM 12/4/2013, Michael Cohn wrote:

>My data file is currently in "tall" format (one row per measurement
>per participant). Is there a way to generate variables in each
>record that are based on information in that user's other records?

You'll get many answers; the fact is, all of these are quite easy.
The code I'm posting (not tested) assumes variables

PcptID  -- Participant identifier
Date    -- Date stamp
Outcome -- Outcome value

neither of the first two are ever missing; and your file is sorted in
ascending order on the first two.

>* A sequential index variable based on the record's datestamp (i.e.,
>number a participant's responses 1, 2, 3... in chronological order).

Various ways; here's a simple one, using transformation language:

NUMERIC VisitSeq (F4).

DO IF    $CASENUM EQ 1.
.  COMPUTE VisitSeq = 1.
ELSE IF  PcptID NE LAG(PcptID).
.  COMPUTE VisitSeq = 1.
ELSE.
.  COMPUTE VisitSeq = LAG(VisitSeq) + 1.
END IF.


>* The length of time between the record and the earliest record for that
>participant
>* The difference between the outcome variable in the record and the minimum
>value ever recorded for that participant.

In both cases, start by putting the minimum value for the participant
in every record for that participant, and then it's easy:

AGGREGATE OUTFILE=* MODE=ADDVARIABLES
    /BREAK=PcptID
    /Earliest 'Date of earliest record for participant' = MIN(Date)
    /MinOut   'Lowest outcome value for participant'    = MIN(Outcome).

>It's easy to do these using a spreadsheet or a python script ...

Actually, I think it's probably easier in long ('tall') form in
native SPSS than in either of those two.

-Best of luck,
  Richard

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



To unsubscribe from Computing variables based on multiple rows in a tall-format file, click here.
NAML
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

David Marso
Administrator
I don't see why people use that ponderous DO IF $CASENUM = 1 .... ELSE blah blah blah (7 lines more or less) approach when a counter can be build with ONE LINE OF REASONABLY INTUITIVE CODE!!!!!

COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).

DATA LIST FREE / ID.
BEGIN DATA
1 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 4 4 4 5 5 5 5
END DATA.
COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).
LIST.

      ID      SEQ
 
    1.00     1.00
    1.00     2.00
    1.00     3.00
    1.00     4.00
    1.00     5.00
    1.00     6.00
    2.00     1.00
    2.00     2.00
    2.00     3.00
    3.00     1.00
    3.00     2.00
    3.00     3.00
    3.00     4.00
    3.00     5.00
    3.00     6.00
    3.00     7.00
    4.00     1.00
    4.00     2.00
    4.00     3.00
    5.00     1.00
    5.00     2.00
    5.00     3.00
    5.00     4.00
 
 
Number of cases read:  23    Number of cases listed:  23
Michael Cohn wrote
I think this solves all my problems! Many thanks to Andy and Richard for their help. I wasn't familiar with the LAG and AGGREGATE functions in SPSS but now I know what to start learning about.

- Michael

----------------------------------
Michael A. Cohn, PhD
[hidden email]
Osher Center for Integrative Medicine
University of California, San Francisco

From: "Richard Ristow [via SPSSX Discussion]" <[hidden email]<mailto:[hidden email]>>
Date: Wednesday, December 4, 2013 at 14:56
To: Michael Cohn <[hidden email]<mailto:[hidden email]>>
Subject: Re: Computing variables based on multiple rows in a tall-format file

At 03:23 PM 12/4/2013, Michael Cohn wrote:

>My data file is currently in "tall" format (one row per measurement
>per participant). Is there a way to generate variables in each
>record that are based on information in that user's other records?

You'll get many answers; the fact is, all of these are quite easy.
The code I'm posting (not tested) assumes variables

PcptID  -- Participant identifier
Date    -- Date stamp
Outcome -- Outcome value

neither of the first two are ever missing; and your file is sorted in
ascending order on the first two.

>* A sequential index variable based on the record's datestamp (i.e.,
>number a participant's responses 1, 2, 3... in chronological order).

Various ways; here's a simple one, using transformation language:

NUMERIC VisitSeq (F4).

DO IF    $CASENUM EQ 1.
.  COMPUTE VisitSeq = 1.
ELSE IF  PcptID NE LAG(PcptID).
.  COMPUTE VisitSeq = 1.
ELSE.
.  COMPUTE VisitSeq = LAG(VisitSeq) + 1.
END IF.


>* The length of time between the record and the earliest record for that
>participant
>* The difference between the outcome variable in the record and the minimum
>value ever recorded for that participant.

In both cases, start by putting the minimum value for the participant
in every record for that participant, and then it's easy:

AGGREGATE OUTFILE=* MODE=ADDVARIABLES
    /BREAK=PcptID
    /Earliest 'Date of earliest record for participant' = MIN(Date)
    /MinOut   'Lowest outcome value for participant'    = MIN(Outcome).

>It's easy to do these using a spreadsheet or a python script ...

Actually, I think it's probably easier in long ('tall') form in
native SPSS than in either of those two.

-Best of luck,
  Richard

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email]</user/SendEmail.jtp?type=node&node=5723438&i=0> (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


________________________________
If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/Computing-variables-based-on-multiple-rows-in-a-tall-format-file-tp5723431p5723438.html
To unsubscribe from Computing variables based on multiple rows in a tall-format file, click here<http://spssx-discussion.1045642.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=5723431&code=Y29obm1Ab2NpbS51Y3NmLmVkdXw1NzIzNDMxfC03OTA5NjYwNDY=>.
NAML<http://spssx-discussion.1045642.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

Kirill Orlov
David, bravo! Sweetie-line.
I too, can't resist the beauty of concise coding.
However, we must remember that a fewer lines isn't always = faster code.


05.12.2013 15:54, David Marso пишет:
I don't see why people use that ponderous DO IF $CASENUM = 1 .... ELSE blah
blah blah (7 lines more or less) approach when a counter can be build with
ONE LINE OF REASONABLY INTUITIVE CODE!!!!!

COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).

DATA LIST FREE / ID.
BEGIN DATA
1 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 4 4 4 5 5 5 5
END DATA.
COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).
LIST.

      ID      SEQ

    1.00     1.00
    1.00     2.00
    1.00     3.00
    1.00     4.00
    1.00     5.00
    1.00     6.00
    2.00     1.00
    2.00     2.00
    2.00     3.00
    3.00     1.00
    3.00     2.00
    3.00     3.00
    3.00     4.00
    3.00     5.00
    3.00     6.00
    3.00     7.00
    4.00     1.00
    4.00     2.00
    4.00     3.00
    5.00     1.00
    5.00     2.00
    5.00     3.00
    5.00     4.00


Number of cases read:  23    Number of cases listed:  23


Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

Art Kendall
In reply to this post by David Marso
<Don flame shields!>
Your solution does require fewer lines and characters in the syntax, and most likely fewer internal operations.
However, I deny your assertion that the one line of code is "reasonably intuitive".
It would help beginners to see and understand Rich's solution and then see that it can be expressed more compactly. 

Your solution is "reasonably intuitive" only for people with at least a moderate amount of experience at some computer languages.

However, Rich's do repeat syntax is much easier for people to read.
It is my impression that many posts from this list are from beginners.
It is also my impression that people searching the archives are beginners.


The easier it is for people to read and understand the syntax the easier it is to communicate the process to other people, e.g., classmates working as peer reviewers, on the job QA reviewers, triers-of-fact, archive users, process maintainers and up-daters.

Soapbox: Efficiency in terms of cognitive load trumps saving storage space for syntax and often trumps saving small amounts of processing time.  Marginal labor cost is a greater consideration than marginal cost of machine resources.

A rhetorical question:  How much computer time would it take to do David's solution compared to the computer time to do Rich's solution?  How many times of running the same code would it take to actually get a measurable difference in computer time, e.g., 10 seconds?

<remove flame shield.>

Art Kendall
Social Research Consultants
On 12/5/2013 6:54 AM, David Marso [via SPSSX Discussion] wrote:
I don't see why people use that ponderous DO IF $CASENUM = 1 .... ELSE blah blah blah (7 lines more or less) approach when a counter can be build with !!!!!

COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).

DATA LIST FREE / ID.
BEGIN DATA
1 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 4 4 4 5 5 5 5
END DATA.
COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).
LIST.

      ID      SEQ
 
    1.00     1.00
    1.00     2.00
    1.00     3.00
    1.00     4.00
    1.00     5.00
    1.00     6.00
    2.00     1.00
    2.00     2.00
    2.00     3.00
    3.00     1.00
    3.00     2.00
    3.00     3.00
    3.00     4.00
    3.00     5.00
    3.00     6.00
    3.00     7.00
    4.00     1.00
    4.00     2.00
    4.00     3.00
    5.00     1.00
    5.00     2.00
    5.00     3.00
    5.00     4.00
 
 
Number of cases read:  23    Number of cases listed:  23
Michael Cohn wrote
I think this solves all my problems! Many thanks to Andy and Richard for their help. I wasn't familiar with the LAG and AGGREGATE functions in SPSS but now I know what to start learning about.

- Michael

----------------------------------
Michael A. Cohn, PhD
[hidden email]
Osher Center for Integrative Medicine
University of California, San Francisco

From: "Richard Ristow [via SPSSX Discussion]" <[hidden email]<mailto:[hidden email]>>
Date: Wednesday, December 4, 2013 at 14:56
To: Michael Cohn <[hidden email]<mailto:[hidden email]>>
Subject: Re: Computing variables based on multiple rows in a tall-format file

At 03:23 PM 12/4/2013, Michael Cohn wrote:

>My data file is currently in "tall" format (one row per measurement
>per participant). Is there a way to generate variables in each
>record that are based on information in that user's other records?

You'll get many answers; the fact is, all of these are quite easy.
The code I'm posting (not tested) assumes variables

PcptID  -- Participant identifier
Date    -- Date stamp
Outcome -- Outcome value

neither of the first two are ever missing; and your file is sorted in
ascending order on the first two.

>* A sequential index variable based on the record's datestamp (i.e.,
>number a participant's responses 1, 2, 3... in chronological order).

Various ways; here's a simple one, using transformation language:

NUMERIC VisitSeq (F4).

DO IF    $CASENUM EQ 1.
.  COMPUTE VisitSeq = 1.
ELSE IF  PcptID NE LAG(PcptID).
.  COMPUTE VisitSeq = 1.
ELSE.
.  COMPUTE VisitSeq = LAG(VisitSeq) + 1.
END IF.


>* The length of time between the record and the earliest record for that
>participant
>* The difference between the outcome variable in the record and the minimum
>value ever recorded for that participant.

In both cases, start by putting the minimum value for the participant
in every record for that participant, and then it's easy:

AGGREGATE OUTFILE=* MODE=ADDVARIABLES
    /BREAK=PcptID
    /Earliest 'Date of earliest record for participant' = MIN(Date)
    /MinOut   'Lowest outcome value for participant'    = MIN(Outcome).

>It's easy to do these using a spreadsheet or a python script ...

Actually, I think it's probably easier in long ('tall') form in
native SPSS than in either of those two.

-Best of luck,
  Richard

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email]</user/SendEmail.jtp?type=node&node=5723438&i=0> (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


________________________________
If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/Computing-variables-based-on-multiple-rows-in-a-tall-format-file-tp5723431p5723438.html
To unsubscribe from Computing variables based on multiple rows in a tall-format file, click here<
NAML<
http://spssx-discussion.1045642.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"



To start a new topic under SPSSX Discussion, email [hidden email]
To unsubscribe from SPSSX Discussion, click here.
NAML

Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

David Marso
Administrator
Assumptions of my one liner.
Something is equal to something else or it isn't (true=1 false=0).
Multiplication of X by 0 = 0, by 1 = X
0+1 = 1, X+1 = X+1.
Anybody having a problem with this might ponder their choice of careers or majors?


Art Kendall wrote
<Don flame shields!>
        Your solution does require fewer lines and characters in
        the syntax, and most likely fewer internal operations.
        However, I deny your assertion that the one line of code is
        "reasonably intuitive".
        It would help beginners to see and understand Rich's solution
        and then see that it can be expressed more compactly. 
       
        Your solution is "reasonably intuitive" only for people with at
        least a moderate amount of experience at some computer
        languages.
       
        However, Rich's do repeat syntax is much easier for people to
        read.
        It is my impression that many posts from this list are from
        beginners.
        It is also my impression that people searching the archives are
        beginners.
       
       
        The easier it is for people to read and understand the syntax
        the easier it is to communicate the process to other people,
        e.g., classmates working as peer reviewers, on the job QA
        reviewers, triers-of-fact, archive users, process maintainers
        and up-daters.
       
        Soapbox: Efficiency in terms of cognitive load trumps saving
        storage space for syntax and often trumps saving small amounts
        of processing time.  Marginal labor cost is a greater
        consideration than marginal cost of machine resources.
       
        A rhetorical question:  How much computer time would it take to
        do David's solution compared to the computer time to do Rich's
        solution?  How many times of running the same code would it take
        to actually get a measurable difference in computer time, e.g.,
        10 seconds?
       
        <remove flame shield.>
       
     
      Art Kendall
Social Research Consultants
      On 12/5/2013 6:54 AM, David Marso [via SPSSX Discussion] wrote:
   
     I don't see why people use that ponderous DO IF
      $CASENUM = 1 .... ELSE blah blah blah (7 lines more or less)
      approach when a counter can be build with !!!!!
     
     
      COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).
     
     
      DATA LIST FREE / ID.
     
      BEGIN DATA
     
      1 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 4 4 4 5 5 5 5
     
      END DATA.
     
      COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).
     
      LIST.
     
     
            ID      SEQ
       
     
          1.00     1.00
          1.00     2.00
          1.00     3.00
          1.00     4.00
          1.00     5.00
          1.00     6.00
          2.00     1.00
          2.00     2.00
          2.00     3.00
          3.00     1.00
          3.00     2.00
          3.00     3.00
          3.00     4.00
          3.00     5.00
          3.00     6.00
          3.00     7.00
          4.00     1.00
          4.00     2.00
          4.00     3.00
          5.00     1.00
          5.00     2.00
          5.00     3.00
          5.00     4.00
       
     
       
     
      Number of cases read:  23    Number of cases listed:  23
     
     
       
          Michael
            Cohn wrote
          I think this
            solves all my problems! Many thanks to Andy and Richard for
            their help. I wasn't familiar with the LAG and AGGREGATE
            functions in SPSS but now I know what to start learning
            about.
           
           
            - Michael
           
           
            ----------------------------------
           
            Michael A. Cohn, PhD
           
            [hidden
              email]
            Osher Center for Integrative Medicine
           
            University of California, San Francisco
           
           
            From: "Richard Ristow [via SPSSX Discussion]" < [hidden
              email] <mailto: [hidden
              email] >>
           
            Date: Wednesday, December 4, 2013 at 14:56
           
            To: Michael Cohn < [hidden
              email] <mailto: [hidden
              email] >>
           
            Subject: Re: Computing variables based on multiple rows in a
            tall-format file
           
           
            At 03:23 PM 12/4/2013, Michael Cohn wrote:
           
           
            >My data file is currently in "tall" format (one row per
            measurement
           
            >per participant). Is there a way to generate variables
            in each
           
            >record that are based on information in that user's
            other records?
           
           
            You'll get many answers; the fact is, all of these are quite
            easy.
           
            The code I'm posting (not tested) assumes variables
           
           
            PcptID  -- Participant identifier
           
            Date    -- Date stamp
           
            Outcome -- Outcome value
           
           
            neither of the first two are ever missing; and your file is
            sorted in
           
            ascending order on the first two.
           
           
            >* A sequential index variable based on the record's
            datestamp (i.e.,
           
            >number a participant's responses 1, 2, 3... in
            chronological order).
           
           
            Various ways; here's a simple one, using transformation
            language:
           
           
            NUMERIC VisitSeq (F4).
           
           
            DO IF    $CASENUM EQ 1.
           
            .  COMPUTE VisitSeq = 1.
           
            ELSE IF  PcptID NE LAG(PcptID).
           
            .  COMPUTE VisitSeq = 1.
           
            ELSE.
           
            .  COMPUTE VisitSeq = LAG(VisitSeq) + 1.
           
            END IF.
           
           
           
            >* The length of time between the record and the earliest
            record for that
           
            >participant
           
            >* The difference between the outcome variable in the
            record and the minimum
           
            >value ever recorded for that participant.
           
           
            In both cases, start by putting the minimum value for the
            participant
           
            in every record for that participant, and then it's easy:
           
           
            AGGREGATE OUTFILE=* MODE=ADDVARIABLES
           
                /BREAK=PcptID
           
                /Earliest 'Date of earliest record for participant' =
            MIN(Date)
           
                /MinOut   'Lowest outcome value for participant'    =
            MIN(Outcome).
           
           
            >It's easy to do these using a spreadsheet or a python
            script ...
           
           
            Actually, I think it's probably easier in long ('tall') form
            in
           
            native SPSS than in either of those two.
           
           
            -Best of luck,
           
              Richard
           
           
            =====================
           
            To manage your subscription to SPSSX-L, send a message to
           
            [hidden
            email]</user/SendEmail.jtp?type=node&node=5723438&i=0>
            (not to SPSSX-L), with no body text except the
           
            command. To leave the list, send the command
           
            SIGNOFF SPSSX-L
           
            For a list of commands to manage subscriptions, send the
            command
           
            INFO REFCARD
           
           
           
            ________________________________
           
            If you reply to this email, your message will be added to
            the discussion below:
           
            http://spssx-discussion.1045642.n5.nabble.com/Computing-variables-based-on-multiple-rows-in-a-tall-format-file-tp5723431p5723438.html 
            To unsubscribe from Computing variables based on multiple
            rows in a tall-format file, click here< http://spssx-discussion.1045642.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=5723431&code=Y29obm1Ab2NpbS51Y3NmLmVkdXw1NzIzNDMxfC03OTA5NjYwNDY=> .
           
            NAML< http://spssx-discussion.1045642.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> 
       
     
       Please
        reply to the list and not to my personal email.
       
        Those desiring my consulting or training services please feel
        free to email me.
       
        ---
       
        "Nolite dare sanctum canibus neque mittatis margaritas vestras
        ante porcos ne forte conculcent eas pedibus suis."
       
        Cum es damnatorum possederunt porcos iens ut salire off
        sanguinum cliff in abyssum?"
     
     
     
     
        If you reply to this email, your
          message will be added to the discussion below:
        http://spssx-discussion.1045642.n5.nabble.com/Computing-variables-based-on-multiple-rows-in-a-tall-format-file-tp5723431p5723446.html 
     
     
        To start a new topic under SPSSX Discussion, email
        [hidden email] 
        To unsubscribe from SPSSX Discussion, click
          here .
        NAML
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

Bruce Weaver
Administrator
I mostly agree with Art on this one, partly because there are some aspects of David's one-liner that are a bit mysterious, especially to novices, I expect.  E.g., on Row 1, LAG(SEQ) and LAG(ID) both return SYSMIS.  The reason why David's code works is that those SYSMIS values are appearing as arguments to the SUM function, and SUM returns a valid result if at least one argument has a valid value--see the demo below.  

Another somewhat mysterious aspect of David's method is that LAG(SEQ) can be used on the right side of the COMPUTE statement that is bringing SEQ into existence as a variable.  That this can be done may not be intuitively obvious!  ;-)

Re Richard's DO-IF, I prefer to have both conditions that result in SEQ = 1 on a single conditional statement with an OR.  And perhaps two separate IF statements would be even easier for novice users to understand.  See below.  


NEW FILE.
DATASET CLOSE all.
DATA LIST FREE / ID.
BEGIN DATA
1 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 4 4 4 5 5 5 5
END DATA.

* David's one-liner.

COMPUTE SEQ1=SUM(1,LAG(SEQ1)*(LAG(ID) EQ ID)).

* Variation on Richard's DO-IF.
* The two conditions for which SEQ2 = 1 are combined with OR.

DO IF  ($CASENUM EQ 1) OR (ID NE LAG(ID)) .
.  COMPUTE SEQ2 = 1.
ELSE.
.  COMPUTE SEQ2 = LAG(SEQ2) + 1.
END IF.

* Two IF statements.
* This is somewhat less efficient in machine time than DO-IF,
* but possibly easier for the novice user to understand.

IF  ($CASENUM EQ 1) OR (ID NE LAG(ID)) SEQ3 = 1.
IF MISSING(SEQ3) SEQ3 = LAG(SEQ3) + 1.

FORMATS ID SEQ1 to SEQ3(f5.0).
LIST.

* The slightly mysterious thing about David's one-liner
* is that the result is not SYSMIS on Row 1, because
* on Row 1, LAG(SEQ1) and LAG(ID) both return SYSMIS.

COMPUTE LagID = LAG(ID).
COMPUTE LagSEQ1 = LAG(SEQ1).
FORMATS LagID LagSEQ1(F5.0).
LIST.

Output:
   ID  SEQ1  SEQ2  SEQ3 LagID LagSEQ1
 
    1     1     1     1     .      .
    1     2     2     2     1      1
    1     3     3     3     1      2
    1     4     4     4     1      3
etc.

* The reason David's code does not return SYSMIS as a
* result is that those missing values appear within
* the SUM function:  SUM will return a valid result
* if at lest one of the arguments is valid.  If you
* compute a sum using plus signs, on the other hand,
* all variables must be valid.

NEW FILE.
DATASET CLOSE all.
DATA LIST LIST / V1 to V3 (3f1).
BEGIN DATA
1 2 3
1 2 .
1 . .
. . .
END DATA.

COMPUTE SumViaSUM = SUM(V1 to V3).
COMPUTE SumViaPlus = V1 + V2 + V3.
FORMATS SumViaSUM SumViaPlus (F5.0).
LIST.

Output:
V1 V2 V3 SumViaSUM SumViaPlus
 
 1  2  3       6          6
 1  2  .       3          .
 1  .  .       1          .
 .  .  .       .          .
 
Number of cases read:  4    Number of cases listed:  4


David Marso wrote
Assumptions of my one liner.
Something is equal to something else or it isn't (true=1 false=0).
Multiplication of X by 0 = 0, by 1 = X
0+1 = 1, X+1 = X+1.
Anybody having a problem with this might ponder their choice of careers or majors?


Art Kendall wrote
<Don flame shields!>
        Your solution does require fewer lines and characters in
        the syntax, and most likely fewer internal operations.
        However, I deny your assertion that the one line of code is
        "reasonably intuitive".
        It would help beginners to see and understand Rich's solution
        and then see that it can be expressed more compactly. 
       
        Your solution is "reasonably intuitive" only for people with at
        least a moderate amount of experience at some computer
        languages.
       
        However, Rich's do repeat syntax is much easier for people to
        read.
        It is my impression that many posts from this list are from
        beginners.
        It is also my impression that people searching the archives are
        beginners.
       
       
        The easier it is for people to read and understand the syntax
        the easier it is to communicate the process to other people,
        e.g., classmates working as peer reviewers, on the job QA
        reviewers, triers-of-fact, archive users, process maintainers
        and up-daters.
       
        Soapbox: Efficiency in terms of cognitive load trumps saving
        storage space for syntax and often trumps saving small amounts
        of processing time.  Marginal labor cost is a greater
        consideration than marginal cost of machine resources.
       
        A rhetorical question:  How much computer time would it take to
        do David's solution compared to the computer time to do Rich's
        solution?  How many times of running the same code would it take
        to actually get a measurable difference in computer time, e.g.,
        10 seconds?
       
        <remove flame shield.>
       
     
      Art Kendall
Social Research Consultants
      On 12/5/2013 6:54 AM, David Marso [via SPSSX Discussion] wrote:
   
     I don't see why people use that ponderous DO IF
      $CASENUM = 1 .... ELSE blah blah blah (7 lines more or less)
      approach when a counter can be build with !!!!!
     
     
      COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).
     
     
      DATA LIST FREE / ID.
     
      BEGIN DATA
     
      1 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 4 4 4 5 5 5 5
     
      END DATA.
     
      COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).
     
      LIST.
     
     
            ID      SEQ
       
     
          1.00     1.00
          1.00     2.00
          1.00     3.00
          1.00     4.00
          1.00     5.00
          1.00     6.00
          2.00     1.00
          2.00     2.00
          2.00     3.00
          3.00     1.00
          3.00     2.00
          3.00     3.00
          3.00     4.00
          3.00     5.00
          3.00     6.00
          3.00     7.00
          4.00     1.00
          4.00     2.00
          4.00     3.00
          5.00     1.00
          5.00     2.00
          5.00     3.00
          5.00     4.00
       
     
       
     
      Number of cases read:  23    Number of cases listed:  23
     
     
       
          Michael
            Cohn wrote
          I think this
            solves all my problems! Many thanks to Andy and Richard for
            their help. I wasn't familiar with the LAG and AGGREGATE
            functions in SPSS but now I know what to start learning
            about.
           
           
            - Michael
           
           
            ----------------------------------
           
            Michael A. Cohn, PhD
           
            [hidden
              email]
            Osher Center for Integrative Medicine
           
            University of California, San Francisco
           
           
            From: "Richard Ristow [via SPSSX Discussion]" < [hidden
              email] <mailto: [hidden
              email] >>
           
            Date: Wednesday, December 4, 2013 at 14:56
           
            To: Michael Cohn < [hidden
              email] <mailto: [hidden
              email] >>
           
            Subject: Re: Computing variables based on multiple rows in a
            tall-format file
           
           
            At 03:23 PM 12/4/2013, Michael Cohn wrote:
           
           
            >My data file is currently in "tall" format (one row per
            measurement
           
            >per participant). Is there a way to generate variables
            in each
           
            >record that are based on information in that user's
            other records?
           
           
            You'll get many answers; the fact is, all of these are quite
            easy.
           
            The code I'm posting (not tested) assumes variables
           
           
            PcptID  -- Participant identifier
           
            Date    -- Date stamp
           
            Outcome -- Outcome value
           
           
            neither of the first two are ever missing; and your file is
            sorted in
           
            ascending order on the first two.
           
           
            >* A sequential index variable based on the record's
            datestamp (i.e.,
           
            >number a participant's responses 1, 2, 3... in
            chronological order).
           
           
            Various ways; here's a simple one, using transformation
            language:
           
           
            NUMERIC VisitSeq (F4).
           
           
            DO IF    $CASENUM EQ 1.
           
            .  COMPUTE VisitSeq = 1.
           
            ELSE IF  PcptID NE LAG(PcptID).
           
            .  COMPUTE VisitSeq = 1.
           
            ELSE.
           
            .  COMPUTE VisitSeq = LAG(VisitSeq) + 1.
           
            END IF.
           
           
           
            >* The length of time between the record and the earliest
            record for that
           
            >participant
           
            >* The difference between the outcome variable in the
            record and the minimum
           
            >value ever recorded for that participant.
           
           
            In both cases, start by putting the minimum value for the
            participant
           
            in every record for that participant, and then it's easy:
           
           
            AGGREGATE OUTFILE=* MODE=ADDVARIABLES
           
                /BREAK=PcptID
           
                /Earliest 'Date of earliest record for participant' =
            MIN(Date)
           
                /MinOut   'Lowest outcome value for participant'    =
            MIN(Outcome).
           
           
            >It's easy to do these using a spreadsheet or a python
            script ...
           
           
            Actually, I think it's probably easier in long ('tall') form
            in
           
            native SPSS than in either of those two.
           
           
            -Best of luck,
           
              Richard
           
           
            =====================
           
            To manage your subscription to SPSSX-L, send a message to
           
            [hidden
            email]</user/SendEmail.jtp?type=node&node=5723438&i=0>
            (not to SPSSX-L), with no body text except the
           
            command. To leave the list, send the command
           
            SIGNOFF SPSSX-L
           
            For a list of commands to manage subscriptions, send the
            command
           
            INFO REFCARD
           
           
           
            ________________________________
           
            If you reply to this email, your message will be added to
            the discussion below:
           
            http://spssx-discussion.1045642.n5.nabble.com/Computing-variables-based-on-multiple-rows-in-a-tall-format-file-tp5723431p5723438.html 
            To unsubscribe from Computing variables based on multiple
            rows in a tall-format file, click here< http://spssx-discussion.1045642.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=5723431&code=Y29obm1Ab2NpbS51Y3NmLmVkdXw1NzIzNDMxfC03OTA5NjYwNDY=> .
           
            NAML< http://spssx-discussion.1045642.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> 
       
     
       Please
        reply to the list and not to my personal email.
       
        Those desiring my consulting or training services please feel
        free to email me.
       
        ---
       
        "Nolite dare sanctum canibus neque mittatis margaritas vestras
        ante porcos ne forte conculcent eas pedibus suis."
       
        Cum es damnatorum possederunt porcos iens ut salire off
        sanguinum cliff in abyssum?"
     
     
     
     
        If you reply to this email, your
          message will be added to the discussion below:
        http://spssx-discussion.1045642.n5.nabble.com/Computing-variables-based-on-multiple-rows-in-a-tall-format-file-tp5723431p5723446.html 
     
     
        To start a new topic under SPSSX Discussion, email
        [hidden email] 
        To unsubscribe from SPSSX Discussion, click
          here .
        NAML
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

Richard Ristow
In reply to this post by David Marso
Let me weigh in, since this began with a comment on my style.

At 06:54 AM 12/5/2013, David Marso wrote:
>I don't see why people use that ponderous DO IF $CASENUM = 1 ....
>ELSE blah blah blah (7 lines more or less) approach

I freely admit that my code is far from compact; possibly, I use more
lines of code for a solution than would any other regular poster.

Here and elsewhere, I write code to be readable, before almost any
other consideration. That's important for a list posting, when the
main purpose of the code is to *be* read; but I've settled on the
practice for production code, as well.

Practically all code is read sometime, by its author if by no one
else; and it is essential that the reader have a clear sense what the
*program* does. It distracts from this if the reader needs to pause
even a moment to make out what a *line* does, just as an obscure
sentence in English can distract from comprehension of the text.

David Marso wrote,

>a counter can be build with ONE LINE OF REASONABLY INTUITIVE CODE!!!!!
>
>COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).

For me, describing a coding technique as "reasonably intuitive" is an
immediate flag that the reader must pause to understand it; as, in
fact, I would have, if I encountered that line in code I was reading.
(And that would very much apply even if I'd written the code. Passage
of time would leave the code, but lose what I was thinking when I wrote it.)

At 08:51 AM 12/5/2013, David Marso wrote:
>Assumptions of my one liner.
>Something is equal to something else or it isn't (true=1 false=0).
>Multiplication of X by 0 = 0, by 1 = X
>0+1 = 1, X+1 = X+1.
>Anybody having a problem with this might ponder their choice of
>careers or majors?

I'm afraid I'm not sympathetic to this. It's the "emperor's new
clothes" argument: "if you don't see it, that just proves you're
stupid." If there's any rule for clear writing, it's that you don't
get to blame the reader for not understanding you.

There's real programming satisfaction in writing solutions like
David's, what I describe as 'cute' solutions. I admire people who can
write them, like David; or like Mel, the Real Programmer
(http://www.catb.org/jargon/html/story-of-mel.html). But I've come to
resist them, for the reasons I've given.

Finally: But, don't speed and compactness matter? Compactness matters
*occasionally*, like when you're putting a spacecraft guidance system
into a rugged but very small computer. As for speed, I think
clearly-written code is rarely much slower; and if it needs speed
improvements, having the logic clear guides you to which high-use
portions need to be optimized.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

David Marso
Administrator
We could also go with:

SPLIT FILE BY ID.
COMPUTE x=1.
CREATE cum = CSUM(x).

Is that to the point enough?

I must confess to  an inclination towards the somewhat  'occult' aspects of SPSS.
SUM(something,nothing)=something etc...
I hope that my 'cute' solutions inspire more RTFM and exploration by whoever might read them.
I still abhor the DO IF blah blah blah END IF pattern!!!

Richard Ristow wrote
Let me weigh in, since this began with a comment on my style.

At 06:54 AM 12/5/2013, David Marso wrote:
>I don't see why people use that ponderous DO IF $CASENUM = 1 ....
>ELSE blah blah blah (7 lines more or less) approach

I freely admit that my code is far from compact; possibly, I use more
lines of code for a solution than would any other regular poster.

Here and elsewhere, I write code to be readable, before almost any
other consideration. That's important for a list posting, when the
main purpose of the code is to *be* read; but I've settled on the
practice for production code, as well.

Practically all code is read sometime, by its author if by no one
else; and it is essential that the reader have a clear sense what the
*program* does. It distracts from this if the reader needs to pause
even a moment to make out what a *line* does, just as an obscure
sentence in English can distract from comprehension of the text.

David Marso wrote,

>a counter can be build with ONE LINE OF REASONABLY INTUITIVE CODE!!!!!
>
>COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).

For me, describing a coding technique as "reasonably intuitive" is an
immediate flag that the reader must pause to understand it; as, in
fact, I would have, if I encountered that line in code I was reading.
(And that would very much apply even if I'd written the code. Passage
of time would leave the code, but lose what I was thinking when I wrote it.)

At 08:51 AM 12/5/2013, David Marso wrote:
>Assumptions of my one liner.
>Something is equal to something else or it isn't (true=1 false=0).
>Multiplication of X by 0 = 0, by 1 = X
>0+1 = 1, X+1 = X+1.
>Anybody having a problem with this might ponder their choice of
>careers or majors?

I'm afraid I'm not sympathetic to this. It's the "emperor's new
clothes" argument: "if you don't see it, that just proves you're
stupid." If there's any rule for clear writing, it's that you don't
get to blame the reader for not understanding you.

There's real programming satisfaction in writing solutions like
David's, what I describe as 'cute' solutions. I admire people who can
write them, like David; or like Mel, the Real Programmer
(http://www.catb.org/jargon/html/story-of-mel.html). But I've come to
resist them, for the reasons I've given.

Finally: But, don't speed and compactness matter? Compactness matters
*occasionally*, like when you're putting a spacecraft guidance system
into a rugged but very small computer. As for speed, I think
clearly-written code is rarely much slower; and if it needs speed
improvements, having the logic clear guides you to which high-use
portions need to be optimized.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

Bruce Weaver
Administrator
And I will confess that some of David's solutions have indeed sent me to the FM!  

One more method, while we're at it.  

COMPUTE Case = $casenum.
RANK VARIABLES=Case (A) BY ID
  /RANK
  /PRINT=YES
  /TIES=MEAN.

Output (using the same data as before):

 Case    ID RCase
 
    1     1     1
    2     1     2
    3     1     3
    4     1     4
    5     1     5
    6     1     6

    7     2     1
    8     2     2
    9     2     3

   10     3     1
   11     3     2
   12     3     3
   13     3     4
   14     3     5
   15     3     6
   16     3     7

   17     4     1
   18     4     2
   19     4     3

   20     5     1
   21     5     2
   22     5     3
   23     5     4
 
Number of cases read:  23    Number of cases listed:  23

David Marso wrote
We could also go with:

SPLIT FILE BY ID.
COMPUTE x=1.
CREATE cum = CSUM(x).

Is that to the point enough?

I must confess to  an inclination towards the somewhat  'occult' aspects of SPSS.
SUM(something,nothing)=something etc...
I hope that my 'cute' solutions inspire more RTFM and exploration by whoever might read them.
I still abhor the DO IF blah blah blah END IF pattern!!!

Richard Ristow wrote
Let me weigh in, since this began with a comment on my style.

At 06:54 AM 12/5/2013, David Marso wrote:
>I don't see why people use that ponderous DO IF $CASENUM = 1 ....
>ELSE blah blah blah (7 lines more or less) approach

I freely admit that my code is far from compact; possibly, I use more
lines of code for a solution than would any other regular poster.

Here and elsewhere, I write code to be readable, before almost any
other consideration. That's important for a list posting, when the
main purpose of the code is to *be* read; but I've settled on the
practice for production code, as well.

Practically all code is read sometime, by its author if by no one
else; and it is essential that the reader have a clear sense what the
*program* does. It distracts from this if the reader needs to pause
even a moment to make out what a *line* does, just as an obscure
sentence in English can distract from comprehension of the text.

David Marso wrote,

>a counter can be build with ONE LINE OF REASONABLY INTUITIVE CODE!!!!!
>
>COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).

For me, describing a coding technique as "reasonably intuitive" is an
immediate flag that the reader must pause to understand it; as, in
fact, I would have, if I encountered that line in code I was reading.
(And that would very much apply even if I'd written the code. Passage
of time would leave the code, but lose what I was thinking when I wrote it.)

At 08:51 AM 12/5/2013, David Marso wrote:
>Assumptions of my one liner.
>Something is equal to something else or it isn't (true=1 false=0).
>Multiplication of X by 0 = 0, by 1 = X
>0+1 = 1, X+1 = X+1.
>Anybody having a problem with this might ponder their choice of
>careers or majors?

I'm afraid I'm not sympathetic to this. It's the "emperor's new
clothes" argument: "if you don't see it, that just proves you're
stupid." If there's any rule for clear writing, it's that you don't
get to blame the reader for not understanding you.

There's real programming satisfaction in writing solutions like
David's, what I describe as 'cute' solutions. I admire people who can
write them, like David; or like Mel, the Real Programmer
(http://www.catb.org/jargon/html/story-of-mel.html). But I've come to
resist them, for the reasons I've given.

Finally: But, don't speed and compactness matter? Compactness matters
*occasionally*, like when you're putting a spacecraft guidance system
into a rugged but very small computer. As for speed, I think
clearly-written code is rarely much slower; and if it needs speed
improvements, having the logic clear guides you to which high-use
portions need to be optimized.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

David Marso
Administrator
Note that the RANK command can be stated as simply:

RANK SEQ BY ID .

(Ascending (A), RANK , PRINT=YES,TIES=MEAN are ALL  default).
My tendency is to specify only the non default .

Curious Bruce, which of my many 'cute' postings over the decades have driven you to the manual?

FWIW:  In all of my production code there are copious comments
(in a recent project of about 4000 lines about 800-900 are comments).  
I deliberately leave out any WTF is going on type comments in NG postings in the hope that the
consumer after scratching his/her head for a few minutes will make an effort to self-educate (FM, help system etc...). TINSTAAFL!!!
---
Bruce Weaver wrote
And I will confess that some of David's solutions have indeed sent me to the FM!  

One more method, while we're at it.  

COMPUTE Case = $casenum.
RANK VARIABLES=Case (A) BY ID
  /RANK
  /PRINT=YES
  /TIES=MEAN.

Output (using the same data as before):

 Case    ID RCase
 
    1     1     1
    2     1     2
    3     1     3
    4     1     4
    5     1     5
    6     1     6

    7     2     1
    8     2     2
    9     2     3

   10     3     1
   11     3     2
   12     3     3
   13     3     4
   14     3     5
   15     3     6
   16     3     7

   17     4     1
   18     4     2
   19     4     3

   20     5     1
   21     5     2
   22     5     3
   23     5     4
 
Number of cases read:  23    Number of cases listed:  23

David Marso wrote
We could also go with:

SPLIT FILE BY ID.
COMPUTE x=1.
CREATE cum = CSUM(x).

Is that to the point enough?

I must confess to  an inclination towards the somewhat  'occult' aspects of SPSS.
SUM(something,nothing)=something etc...
I hope that my 'cute' solutions inspire more RTFM and exploration by whoever might read them.
I still abhor the DO IF blah blah blah END IF pattern!!!

Richard Ristow wrote
Let me weigh in, since this began with a comment on my style.

At 06:54 AM 12/5/2013, David Marso wrote:
>I don't see why people use that ponderous DO IF $CASENUM = 1 ....
>ELSE blah blah blah (7 lines more or less) approach

I freely admit that my code is far from compact; possibly, I use more
lines of code for a solution than would any other regular poster.

Here and elsewhere, I write code to be readable, before almost any
other consideration. That's important for a list posting, when the
main purpose of the code is to *be* read; but I've settled on the
practice for production code, as well.

Practically all code is read sometime, by its author if by no one
else; and it is essential that the reader have a clear sense what the
*program* does. It distracts from this if the reader needs to pause
even a moment to make out what a *line* does, just as an obscure
sentence in English can distract from comprehension of the text.

David Marso wrote,

>a counter can be build with ONE LINE OF REASONABLY INTUITIVE CODE!!!!!
>
>COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).

For me, describing a coding technique as "reasonably intuitive" is an
immediate flag that the reader must pause to understand it; as, in
fact, I would have, if I encountered that line in code I was reading.
(And that would very much apply even if I'd written the code. Passage
of time would leave the code, but lose what I was thinking when I wrote it.)

At 08:51 AM 12/5/2013, David Marso wrote:
>Assumptions of my one liner.
>Something is equal to something else or it isn't (true=1 false=0).
>Multiplication of X by 0 = 0, by 1 = X
>0+1 = 1, X+1 = X+1.
>Anybody having a problem with this might ponder their choice of
>careers or majors?

I'm afraid I'm not sympathetic to this. It's the "emperor's new
clothes" argument: "if you don't see it, that just proves you're
stupid." If there's any rule for clear writing, it's that you don't
get to blame the reader for not understanding you.

There's real programming satisfaction in writing solutions like
David's, what I describe as 'cute' solutions. I admire people who can
write them, like David; or like Mel, the Real Programmer
(http://www.catb.org/jargon/html/story-of-mel.html). But I've come to
resist them, for the reasons I've given.

Finally: But, don't speed and compactness matter? Compactness matters
*occasionally*, like when you're putting a spacecraft guidance system
into a rugged but very small computer. As for speed, I think
clearly-written code is rarely much slower; and if it needs speed
improvements, having the logic clear guides you to which high-use
portions need to be optimized.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

Bruce Weaver
Administrator
"Curious Bruce, which of my many 'cute' postings over the decades have driven you to the manual?"

I suppose I'm thinking back several years to a time when comp.soft-sys.stat.spss was still an active group without much SPAM (yes, that long ago!).  Back then, I didn't know much about syntax generally, and I knew next to nothing about the macro language.  (Looking at some the syntax I wrote back then would no doubt have me doing the palm-to-forehead maneuver.)    I can't point to any specific posts right now, but I do remember that some things you (or Neila Nessa) posted were impenetrable gibberish to me at that point.  Not even STRONG COFFEE would have helped.  But I chalk that up to my relative ignorance at the time.  I don't think it is necessary (or desirable) to have every post in a forum such as this be understandable by complete novices.  If that was a requirement, intermediate and advanced users would not have much opportunity to learn anything.  I hope we continue seeing a mix of posts to this list that includes some things that make my head spin!

By the way, I don't think any of the other posters in this thread were suggesting that every post needs to be understandable by complete novices.  (Just thought I'd throw that in before someone corrects me.)  ;-)


David Marso wrote
Note that the RANK command can be stated as simply:

RANK SEQ BY ID .

(Ascending (A), RANK , PRINT=YES,TIES=MEAN are ALL  default).
My tendency is to specify only the non default .

Curious Bruce, which of my many 'cute' postings over the decades have driven you to the manual?

FWIW:  In all of my production code there are copious comments
(in a recent project of about 4000 lines about 800-900 are comments).  
I deliberately leave out any WTF is going on type comments in NG postings in the hope that the
consumer after scratching his/her head for a few minutes will make an effort to self-educate (FM, help system etc...). TINSTAAFL!!!
---
Bruce Weaver wrote
And I will confess that some of David's solutions have indeed sent me to the FM!  

One more method, while we're at it.  

COMPUTE Case = $casenum.
RANK VARIABLES=Case (A) BY ID
  /RANK
  /PRINT=YES
  /TIES=MEAN.

Output (using the same data as before):

 Case    ID RCase
 
    1     1     1
    2     1     2
    3     1     3
    4     1     4
    5     1     5
    6     1     6

    7     2     1
    8     2     2
    9     2     3

   10     3     1
   11     3     2
   12     3     3
   13     3     4
   14     3     5
   15     3     6
   16     3     7

   17     4     1
   18     4     2
   19     4     3

   20     5     1
   21     5     2
   22     5     3
   23     5     4
 
Number of cases read:  23    Number of cases listed:  23

David Marso wrote
We could also go with:

SPLIT FILE BY ID.
COMPUTE x=1.
CREATE cum = CSUM(x).

Is that to the point enough?

I must confess to  an inclination towards the somewhat  'occult' aspects of SPSS.
SUM(something,nothing)=something etc...
I hope that my 'cute' solutions inspire more RTFM and exploration by whoever might read them.
I still abhor the DO IF blah blah blah END IF pattern!!!

Richard Ristow wrote
Let me weigh in, since this began with a comment on my style.

At 06:54 AM 12/5/2013, David Marso wrote:
>I don't see why people use that ponderous DO IF $CASENUM = 1 ....
>ELSE blah blah blah (7 lines more or less) approach

I freely admit that my code is far from compact; possibly, I use more
lines of code for a solution than would any other regular poster.

Here and elsewhere, I write code to be readable, before almost any
other consideration. That's important for a list posting, when the
main purpose of the code is to *be* read; but I've settled on the
practice for production code, as well.

Practically all code is read sometime, by its author if by no one
else; and it is essential that the reader have a clear sense what the
*program* does. It distracts from this if the reader needs to pause
even a moment to make out what a *line* does, just as an obscure
sentence in English can distract from comprehension of the text.

David Marso wrote,

>a counter can be build with ONE LINE OF REASONABLY INTUITIVE CODE!!!!!
>
>COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).

For me, describing a coding technique as "reasonably intuitive" is an
immediate flag that the reader must pause to understand it; as, in
fact, I would have, if I encountered that line in code I was reading.
(And that would very much apply even if I'd written the code. Passage
of time would leave the code, but lose what I was thinking when I wrote it.)

At 08:51 AM 12/5/2013, David Marso wrote:
>Assumptions of my one liner.
>Something is equal to something else or it isn't (true=1 false=0).
>Multiplication of X by 0 = 0, by 1 = X
>0+1 = 1, X+1 = X+1.
>Anybody having a problem with this might ponder their choice of
>careers or majors?

I'm afraid I'm not sympathetic to this. It's the "emperor's new
clothes" argument: "if you don't see it, that just proves you're
stupid." If there's any rule for clear writing, it's that you don't
get to blame the reader for not understanding you.

There's real programming satisfaction in writing solutions like
David's, what I describe as 'cute' solutions. I admire people who can
write them, like David; or like Mel, the Real Programmer
(http://www.catb.org/jargon/html/story-of-mel.html). But I've come to
resist them, for the reasons I've given.

Finally: But, don't speed and compactness matter? Compactness matters
*occasionally*, like when you're putting a spacecraft guidance system
into a rugged but very small computer. As for speed, I think
clearly-written code is rarely much slower; and if it needs speed
improvements, having the logic clear guides you to which high-use
portions need to be optimized.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

David Marso
Administrator
Here is an example of a simple problem with 5 different solutions.
I suspect any of them might be chosen by any particular coder for various reasons (experience, readability, explicitness etc).
I tend to avoid versions 1 and 2 and go more for versions 3-5.
Version 3 comes in particularly handy in the MATRIX language as the conditions and outcomes can be built as arrays and the function is simply a vector operation.  SETS of such can be set up as matrices.
Let's see what I can dig out of my head spinner collection for you Bruce ;-)

DATA LIST FREE / A B.
begin data
1 1 1 2 2 1 2 2
END DATA.

DO IF A=1 AND B=1.
+  COMPUTE C=1.
ELSE IF A=1 AND B=2.
+  COMPUTE C=2.
ELSE IF A=2 AND B=1.
+  COMPUTE C=3.
ELSE IF A=2 AND B=2.
+  COMPUTE C=4.
END IF.

IF A=1 AND B=1 C1=1.
IF A=1 AND B=2 C1=2.
IF A=2 AND B=1 C1=3.
IF A=2 AND B=2 C1=4.

DO IF A=1.
+  RECODE B (1=1)(2=2) INTO C2.
ELSE IF A=2.
+  RECODE B (1=3)(2=4) INTO C2.
END IF .


COMPUTE C3=SUM((A=1 AND B=1)*1,
               (A=1 AND B=2)*2,
               (A=2 AND B=1)*3,
               (A=2 AND B=2)*4).

COMPUTE C4=SUM((A=1)*B,
               (A=2 AND B=1)*3,
               (A=2 AND B=2)*4).

COMPUTE C5=(A-1)*2 + B.

FORMATS ALL (F1.0).
LIST.

 
A B C C1 C2 C3 C4 C5
 
1 1 1  1  1  1  1  1
1 2 2  2  2  2  2  2
2 1 3  3  3  3  3  3
2 2 4  4  4  4  4  4
 
 
Number of cases read:  4    Number of cases listed:  4

Bruce Weaver wrote
"Curious Bruce, which of my many 'cute' postings over the decades have driven you to the manual?"

I suppose I'm thinking back several years to a time when comp.soft-sys.stat.spss was still an active group without much SPAM (yes, that long ago!).  Back then, I didn't know much about syntax generally, and I knew next to nothing about the macro language.  (Looking at some the syntax I wrote back then would no doubt have me doing the palm-to-forehead maneuver.)    I can't point to any specific posts right now, but I do remember that some things you (or Neila Nessa) posted were impenetrable gibberish to me at that point.  Not even STRONG COFFEE would have helped.  But I chalk that up to my relative ignorance at the time.  I don't think it is necessary (or desirable) to have every post in a forum such as this be understandable by complete novices.  If that was a requirement, intermediate and advanced users would not have much opportunity to learn anything.  I hope we continue seeing a mix of posts to this list that includes some things that make my head spin!

By the way, I don't think any of the other posters in this thread were suggesting that every post needs to be understandable by complete novices.  (Just thought I'd throw that in before someone corrects me.)  ;-)


David Marso wrote
Note that the RANK command can be stated as simply:

RANK SEQ BY ID .

(Ascending (A), RANK , PRINT=YES,TIES=MEAN are ALL  default).
My tendency is to specify only the non default .

Curious Bruce, which of my many 'cute' postings over the decades have driven you to the manual?

FWIW:  In all of my production code there are copious comments
(in a recent project of about 4000 lines about 800-900 are comments).  
I deliberately leave out any WTF is going on type comments in NG postings in the hope that the
consumer after scratching his/her head for a few minutes will make an effort to self-educate (FM, help system etc...). TINSTAAFL!!!
---
Bruce Weaver wrote
And I will confess that some of David's solutions have indeed sent me to the FM!  

One more method, while we're at it.  

COMPUTE Case = $casenum.
RANK VARIABLES=Case (A) BY ID
  /RANK
  /PRINT=YES
  /TIES=MEAN.

Output (using the same data as before):

 Case    ID RCase
 
    1     1     1
    2     1     2
    3     1     3
    4     1     4
    5     1     5
    6     1     6

    7     2     1
    8     2     2
    9     2     3

   10     3     1
   11     3     2
   12     3     3
   13     3     4
   14     3     5
   15     3     6
   16     3     7

   17     4     1
   18     4     2
   19     4     3

   20     5     1
   21     5     2
   22     5     3
   23     5     4
 
Number of cases read:  23    Number of cases listed:  23

David Marso wrote
We could also go with:

SPLIT FILE BY ID.
COMPUTE x=1.
CREATE cum = CSUM(x).

Is that to the point enough?

I must confess to  an inclination towards the somewhat  'occult' aspects of SPSS.
SUM(something,nothing)=something etc...
I hope that my 'cute' solutions inspire more RTFM and exploration by whoever might read them.
I still abhor the DO IF blah blah blah END IF pattern!!!

Richard Ristow wrote
Let me weigh in, since this began with a comment on my style.

At 06:54 AM 12/5/2013, David Marso wrote:
>I don't see why people use that ponderous DO IF $CASENUM = 1 ....
>ELSE blah blah blah (7 lines more or less) approach

I freely admit that my code is far from compact; possibly, I use more
lines of code for a solution than would any other regular poster.

Here and elsewhere, I write code to be readable, before almost any
other consideration. That's important for a list posting, when the
main purpose of the code is to *be* read; but I've settled on the
practice for production code, as well.

Practically all code is read sometime, by its author if by no one
else; and it is essential that the reader have a clear sense what the
*program* does. It distracts from this if the reader needs to pause
even a moment to make out what a *line* does, just as an obscure
sentence in English can distract from comprehension of the text.

David Marso wrote,

>a counter can be build with ONE LINE OF REASONABLY INTUITIVE CODE!!!!!
>
>COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).

For me, describing a coding technique as "reasonably intuitive" is an
immediate flag that the reader must pause to understand it; as, in
fact, I would have, if I encountered that line in code I was reading.
(And that would very much apply even if I'd written the code. Passage
of time would leave the code, but lose what I was thinking when I wrote it.)

At 08:51 AM 12/5/2013, David Marso wrote:
>Assumptions of my one liner.
>Something is equal to something else or it isn't (true=1 false=0).
>Multiplication of X by 0 = 0, by 1 = X
>0+1 = 1, X+1 = X+1.
>Anybody having a problem with this might ponder their choice of
>careers or majors?

I'm afraid I'm not sympathetic to this. It's the "emperor's new
clothes" argument: "if you don't see it, that just proves you're
stupid." If there's any rule for clear writing, it's that you don't
get to blame the reader for not understanding you.

There's real programming satisfaction in writing solutions like
David's, what I describe as 'cute' solutions. I admire people who can
write them, like David; or like Mel, the Real Programmer
(http://www.catb.org/jargon/html/story-of-mel.html). But I've come to
resist them, for the reasons I've given.

Finally: But, don't speed and compactness matter? Compactness matters
*occasionally*, like when you're putting a spacecraft guidance system
into a rugged but very small computer. As for speed, I think
clearly-written code is rarely much slower; and if it needs speed
improvements, having the logic clear guides you to which high-use
portions need to be optimized.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

Bruce Weaver
Administrator
I like that C3 computation, but might line things up this way to make it a bit more readable:

COMPUTE C3=SUM(
  (A=1 AND B=1)*1,
  (A=1 AND B=2)*2,
  (A=2 AND B=1)*3,
  (A=2 AND B=2)*4).

Usually, I'd follow Art's advice and use EQ rather than =. But in this case, I think it's actually quite a bit easier to read with =.  

COMPUTE C3=SUM(
  (A EQ 1 AND B EQ 1)*1,
  (A EQ 1 AND B EQ 2)*2,
  (A EQ 2 AND B EQ 1)*3,
  (A EQ 2 AND B EQ 2)*4).


I have also used that C5 computation before.  It's not nearly as transparent--but it is very scalable.  I'd be most inclined to use it when the number of combinations of A and B is large, because the other methods result in too much syntax in that case.  E.g., if A and B both range from 1 to 5:

COMPUTE C5=(A-1)*5 + B.



David Marso wrote
Here is an example of a simple problem with 5 different solutions.
I suspect any of them might be chosen by any particular coder for various reasons (experience, readability, explicitness etc).
I tend to avoid versions 1 and 2 and go more for versions 3-5.
Version 3 comes in particularly handy in the MATRIX language as the conditions and outcomes can be built as arrays and the function is simply a vector operation.  SETS of such can be set up as matrices.
Let's see what I can dig out of my head spinner collection for you Bruce ;-)

DATA LIST FREE / A B.
begin data
1 1 1 2 2 1 2 2
END DATA.

DO IF A=1 AND B=1.
+  COMPUTE C=1.
ELSE IF A=1 AND B=2.
+  COMPUTE C=2.
ELSE IF A=2 AND B=1.
+  COMPUTE C=3.
ELSE IF A=2 AND B=2.
+  COMPUTE C=4.
END IF.

IF A=1 AND B=1 C1=1.
IF A=1 AND B=2 C1=2.
IF A=2 AND B=1 C1=3.
IF A=2 AND B=2 C1=4.

DO IF A=1.
+  RECODE B (1=1)(2=2) INTO C2.
ELSE IF A=2.
+  RECODE B (1=3)(2=4) INTO C2.
END IF .


COMPUTE C3=SUM((A=1 AND B=1)*1,
               (A=1 AND B=2)*2,
               (A=2 AND B=1)*3,
               (A=2 AND B=2)*4).

COMPUTE C4=SUM((A=1)*B,
               (A=2 AND B=1)*3,
               (A=2 AND B=2)*4).

COMPUTE C5=(A-1)*2 + B.

FORMATS ALL (F1.0).
LIST.

 
A B C C1 C2 C3 C4 C5
 
1 1 1  1  1  1  1  1
1 2 2  2  2  2  2  2
2 1 3  3  3  3  3  3
2 2 4  4  4  4  4  4
 
 
Number of cases read:  4    Number of cases listed:  4

Bruce Weaver wrote
"Curious Bruce, which of my many 'cute' postings over the decades have driven you to the manual?"

I suppose I'm thinking back several years to a time when comp.soft-sys.stat.spss was still an active group without much SPAM (yes, that long ago!).  Back then, I didn't know much about syntax generally, and I knew next to nothing about the macro language.  (Looking at some the syntax I wrote back then would no doubt have me doing the palm-to-forehead maneuver.)    I can't point to any specific posts right now, but I do remember that some things you (or Neila Nessa) posted were impenetrable gibberish to me at that point.  Not even STRONG COFFEE would have helped.  But I chalk that up to my relative ignorance at the time.  I don't think it is necessary (or desirable) to have every post in a forum such as this be understandable by complete novices.  If that was a requirement, intermediate and advanced users would not have much opportunity to learn anything.  I hope we continue seeing a mix of posts to this list that includes some things that make my head spin!

By the way, I don't think any of the other posters in this thread were suggesting that every post needs to be understandable by complete novices.  (Just thought I'd throw that in before someone corrects me.)  ;-)


David Marso wrote
Note that the RANK command can be stated as simply:

RANK SEQ BY ID .

(Ascending (A), RANK , PRINT=YES,TIES=MEAN are ALL  default).
My tendency is to specify only the non default .

Curious Bruce, which of my many 'cute' postings over the decades have driven you to the manual?

FWIW:  In all of my production code there are copious comments
(in a recent project of about 4000 lines about 800-900 are comments).  
I deliberately leave out any WTF is going on type comments in NG postings in the hope that the
consumer after scratching his/her head for a few minutes will make an effort to self-educate (FM, help system etc...). TINSTAAFL!!!
---
Bruce Weaver wrote
And I will confess that some of David's solutions have indeed sent me to the FM!  

One more method, while we're at it.  

COMPUTE Case = $casenum.
RANK VARIABLES=Case (A) BY ID
  /RANK
  /PRINT=YES
  /TIES=MEAN.

Output (using the same data as before):

 Case    ID RCase
 
    1     1     1
    2     1     2
    3     1     3
    4     1     4
    5     1     5
    6     1     6

    7     2     1
    8     2     2
    9     2     3

   10     3     1
   11     3     2
   12     3     3
   13     3     4
   14     3     5
   15     3     6
   16     3     7

   17     4     1
   18     4     2
   19     4     3

   20     5     1
   21     5     2
   22     5     3
   23     5     4
 
Number of cases read:  23    Number of cases listed:  23

David Marso wrote
We could also go with:

SPLIT FILE BY ID.
COMPUTE x=1.
CREATE cum = CSUM(x).

Is that to the point enough?

I must confess to  an inclination towards the somewhat  'occult' aspects of SPSS.
SUM(something,nothing)=something etc...
I hope that my 'cute' solutions inspire more RTFM and exploration by whoever might read them.
I still abhor the DO IF blah blah blah END IF pattern!!!

Richard Ristow wrote
Let me weigh in, since this began with a comment on my style.

At 06:54 AM 12/5/2013, David Marso wrote:
>I don't see why people use that ponderous DO IF $CASENUM = 1 ....
>ELSE blah blah blah (7 lines more or less) approach

I freely admit that my code is far from compact; possibly, I use more
lines of code for a solution than would any other regular poster.

Here and elsewhere, I write code to be readable, before almost any
other consideration. That's important for a list posting, when the
main purpose of the code is to *be* read; but I've settled on the
practice for production code, as well.

Practically all code is read sometime, by its author if by no one
else; and it is essential that the reader have a clear sense what the
*program* does. It distracts from this if the reader needs to pause
even a moment to make out what a *line* does, just as an obscure
sentence in English can distract from comprehension of the text.

David Marso wrote,

>a counter can be build with ONE LINE OF REASONABLY INTUITIVE CODE!!!!!
>
>COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).

For me, describing a coding technique as "reasonably intuitive" is an
immediate flag that the reader must pause to understand it; as, in
fact, I would have, if I encountered that line in code I was reading.
(And that would very much apply even if I'd written the code. Passage
of time would leave the code, but lose what I was thinking when I wrote it.)

At 08:51 AM 12/5/2013, David Marso wrote:
>Assumptions of my one liner.
>Something is equal to something else or it isn't (true=1 false=0).
>Multiplication of X by 0 = 0, by 1 = X
>0+1 = 1, X+1 = X+1.
>Anybody having a problem with this might ponder their choice of
>careers or majors?

I'm afraid I'm not sympathetic to this. It's the "emperor's new
clothes" argument: "if you don't see it, that just proves you're
stupid." If there's any rule for clear writing, it's that you don't
get to blame the reader for not understanding you.

There's real programming satisfaction in writing solutions like
David's, what I describe as 'cute' solutions. I admire people who can
write them, like David; or like Mel, the Real Programmer
(http://www.catb.org/jargon/html/story-of-mel.html). But I've come to
resist them, for the reasons I've given.

Finally: But, don't speed and compactness matter? Compactness matters
*occasionally*, like when you're putting a spacecraft guidance system
into a rugged but very small computer. As for speed, I think
clearly-written code is rarely much slower; and if it needs speed
improvements, having the logic clear guides you to which high-use
portions need to be optimized.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

Art Kendall
In reply to this post by Bruce Weaver
That was not what I meant to imply. The use of flame shield was intended as humor.

However, I try to guess from the question posed how much of a beginner the OP is. I also keep in mind building the archives.

The mix of approaches across several posts helps to get the message across that there can be many ways of doing the same thing.

I should but do not always post more than one solution.  However, members of this list do often reply with a redraft of syntax.

That being said, I try to push for
-- readability as it helps all of the people concerned.
-- considering any analysis a process that goes through drafts like any other writing.
--  teamwork in learning and doing analysis.
-- completing metadata (variable view)  and checking it with the team for readability and shared understanding before any processing.
-- review for QA and learning processes.
-- scientists sharing their data and analysis.
Art Kendall
Social Research Consultants
On 12/6/2013 7:06 PM, Bruce Weaver [via SPSSX Discussion] wrote:
By the way, I don't think any of the other posters in this thread were suggesting that every post needs to be understandable by complete novices.  (Just thought I'd throw that in before someone corrects me.)  ;-)

Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

Art Kendall
In reply to this post by Bruce Weaver
I have been using SPSS in consulting and evaluations for the Congress for many years.  I still learn from things that David, Jon, Bruce, Rich, and Andy post to this list.
Art Kendall
Social Research Consultants
On 12/6/2013 7:06 PM, Bruce Weaver [via SPSSX Discussion] wrote:
By the way, I don't think any of the other posters in this thread were suggesting that every post needs to be understandable by complete novices.  (Just thought I'd throw that in before someone corrects me.)  ;-)

Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Computing variables based on multiple rows in a tall-format file

Art Kendall
In reply to this post by David Marso
of course more drafts can be produced by using EQ and by using parentheses to clarify logic.
DO IF A EQ 1 AND EQ 1.
DO IF (A=1 AND B=1).
DO IF (A EQ 1 AND B EQ 1).
Art Kendall
Social Research Consultants
On 12/6/2013 7:22 PM, David Marso [via SPSSX Discussion] wrote:
Here is an example of a simple problem with 5 different solutions.
I suspect any of them might be chosen by any particular coder for various reasons (experience, readability, explicitness etc).
I tend to avoid versions 1 and 2 and go more for versions 3-5.
Version 3 comes in particularly handy in the MATRIX language as the conditions and outcomes can be built as arrays and the function is simply a vector operation.  SETS of such can be set up as matrices.
Let's see what I can dig out of my head spinner collection for you Bruce ;-)

DATA LIST FREE / A B.
begin data
1 1 1 2 2 1 2 2
END DATA.

DO IF A=1 AND B=1.
+  COMPUTE C=1.
ELSE IF A=1 AND B=2.
+  COMPUTE C=2.
ELSE IF A=2 AND B=1.
+  COMPUTE C=3.
ELSE IF A=2 AND B=2.
+  COMPUTE C=4.
END IF.

IF A=1 AND B=1 C1=1.
IF A=1 AND B=2 C1=2.
IF A=2 AND B=1 C1=3.
IF A=2 AND B=2 C1=4.

DO IF A=1.
+  RECODE B (1=1)(2=2) INTO C2.
ELSE IF A=2.
+  RECODE B (1=3)(2=4) INTO C2.
END IF .


COMPUTE C3=SUM((A=1 AND B=1)*1,
               (A=1 AND B=2)*2,
               (A=2 AND B=1)*3,
               (A=2 AND B=2)*4).

COMPUTE C4=SUM((A=1)*B,
               (A=2 AND B=1)*3,
               (A=2 AND B=2)*4).

COMPUTE C5=(A-1)*2 + B.

FORMATS ALL (F1.0).
LIST.

 
A B C C1 C2 C3 C4 C5
 
1 1 1  1  1  1  1  1
1 2 2  2  2  2  2  2
2 1 3  3  3  3  3  3
2 2 4  4  4  4  4  4
 
 
Number of cases read:  4    Number of cases listed:  4

Bruce Weaver wrote
"Curious Bruce, which of my many 'cute' postings over the decades have driven you to the manual?"

I suppose I'm thinking back several years to a time when comp.soft-sys.stat.spss was still an active group without much SPAM (yes, that long ago!).  Back then, I didn't know much about syntax generally, and I knew next to nothing about the macro language.  (Looking at some the syntax I wrote back then would no doubt have me doing the palm-to-forehead maneuver.)    I can't point to any specific posts right now, but I do remember that some things you (or Neila Nessa) posted were impenetrable gibberish to me at that point.  Not even STRONG COFFEE would have helped.  But I chalk that up to my relative ignorance at the time.  I don't think it is necessary (or desirable) to have every post in a forum such as this be understandable by complete novices.  If that was a requirement, intermediate and advanced users would not have much opportunity to learn anything.  I hope we continue seeing a mix of posts to this list that includes some things that make my head spin!

By the way, I don't think any of the other posters in this thread were suggesting that every post needs to be understandable by complete novices.  (Just thought I'd throw that in before someone corrects me.)  ;-)


David Marso wrote
Note that the RANK command can be stated as simply:

RANK SEQ BY ID .

(Ascending (A), RANK , PRINT=YES,TIES=MEAN are ALL  default).
My tendency is to specify only the non default .

Curious Bruce, which of my many 'cute' postings over the decades have driven you to the manual?

FWIW:  In all of my production code there are copious comments
(in a recent project of about 4000 lines about 800-900 are comments).  
I deliberately leave out any WTF is going on type comments in NG postings in the hope that the
consumer after scratching his/her head for a few minutes will make an effort to self-educate (FM, help system etc...). TINSTAAFL!!!
---
Bruce Weaver wrote
And I will confess that some of David's solutions have indeed sent me to the FM!  

One more method, while we're at it.  

COMPUTE Case = $casenum.
RANK VARIABLES=Case (A) BY ID
  /RANK
  /PRINT=YES
  /TIES=MEAN.

Output (using the same data as before):

 Case    ID RCase
 
    1     1     1
    2     1     2
    3     1     3
    4     1     4
    5     1     5
    6     1     6

    7     2     1
    8     2     2
    9     2     3

   10     3     1
   11     3     2
   12     3     3
   13     3     4
   14     3     5
   15     3     6
   16     3     7

   17     4     1
   18     4     2
   19     4     3

   20     5     1
   21     5     2
   22     5     3
   23     5     4
 
Number of cases read:  23    Number of cases listed:  23

David Marso wrote
We could also go with:

SPLIT FILE BY ID.
COMPUTE x=1.
CREATE cum = CSUM(x).

Is that to the point enough?

I must confess to  an inclination towards the somewhat  'occult' aspects of SPSS.
SUM(something,nothing)=something etc...
I hope that my 'cute' solutions inspire more RTFM and exploration by whoever might read them.
I still abhor the DO IF blah blah blah END IF pattern!!!

Richard Ristow wrote
Let me weigh in, since this began with a comment on my style.

At 06:54 AM 12/5/2013, David Marso wrote:
>I don't see why people use that ponderous DO IF $CASENUM = 1 ....
>ELSE blah blah blah (7 lines more or less) approach

I freely admit that my code is far from compact; possibly, I use more
lines of code for a solution than would any other regular poster.

Here and elsewhere, I write code to be readable, before almost any
other consideration. That's important for a list posting, when the
main purpose of the code is to *be* read; but I've settled on the
practice for production code, as well.

Practically all code is read sometime, by its author if by no one
else; and it is essential that the reader have a clear sense what the
*program* does. It distracts from this if the reader needs to pause
even a moment to make out what a *line* does, just as an obscure
sentence in English can distract from comprehension of the text.

David Marso wrote,

>a counter can be build with ONE LINE OF REASONABLY INTUITIVE CODE!!!!!
>
>COMPUTE SEQ=SUM(1,LAG(SEQ)*(LAG(ID) EQ ID)).

For me, describing a coding technique as "reasonably intuitive" is an
immediate flag that the reader must pause to understand it; as, in
fact, I would have, if I encountered that line in code I was reading.
(And that would very much apply even if I'd written the code. Passage
of time would leave the code, but lose what I was thinking when I wrote it.)

At 08:51 AM 12/5/2013, David Marso wrote:
>Assumptions of my one liner.
>Something is equal to something else or it isn't (true=1 false=0).
>Multiplication of X by 0 = 0, by 1 = X
>0+1 = 1, X+1 = X+1.
>Anybody having a problem with this might ponder their choice of
>careers or majors?

I'm afraid I'm not sympathetic to this. It's the "emperor's new
clothes" argument: "if you don't see it, that just proves you're
stupid." If there's any rule for clear writing, it's that you don't
get to blame the reader for not understanding you.

There's real programming satisfaction in writing solutions like
David's, what I describe as 'cute' solutions. I admire people who can
write them, like David; or like Mel, the Real Programmer
(http://www.catb.org/jargon/html/story-of-mel.html). But I've come to
resist them, for the reasons I've given.

Finally: But, don't speed and compactness matter? Compactness matters
*occasionally*, like when you're putting a spacecraft guidance system
into a rugged but very small computer. As for speed, I think
clearly-written code is rarely much slower; and if it needs speed
improvements, having the logic clear guides you to which high-use
portions need to be optimized.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"



To start a new topic under SPSSX Discussion, email [hidden email]
To unsubscribe from SPSSX Discussion, click here.
NAML

Art Kendall
Social Research Consultants
12