"Non-exist" data in regression analysis

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

"Non-exist" data in regression analysis

Johnson Lau
Dear all,

I'm dealing with a data structure as follow and don't know how to run a multiple regression/logistic regression with it.

There are 100 students taking 2 examinations. 50 of them took both exams and 50 of them just took exam2 (ie. 50 students took exam 1 and all students took exam 2). I would like to predict the score of exam2 (an interval scale) with score of exam1 and whether a student took exam 1. Some example is shown as follow:

variable name:    tookexam1    scoreexam1   scoreexam2   passexam2
case 1            1            50           60           1
case 2            0             .           70           1
case 3            0             .           40           0
case 4            1            40           40           0
.....

Therefore, i'd like to predict scoreexam2 with tookexam1 and scoreexam1, where tookexam1 is valid for all cases (either 0 or 1), scoreexam1 is valid if and only if tookexam1=1.

Similarly, i'd like to predict passexam2 (a dichotomous variable) with tookexam1 and scoreexam1.

With bivariate correlation, I have found that tookexam1 and scoreexam1 are correlated with scoreexam2. However, no matter I run a multiple regression or logistic regression, cases with missing vale in scoreexam1 will be excluded. In addition to these, I'd also like to run ANCOVA with more predictors with similar data structure.
I know that pairwise exclusion is allowed in multiple linear regression but I'm not sure if this is a correct choice. Nonetheless, pairwise option is not found in logistic regression and UNIANOVA commands.

Would anyone kindly suggest solution for this? Thanks a lot.

Regards,

Johnson Lau
Reply | Threaded
Open this post in threaded view
|

Re: "Non-exist" data in regression analysis

Richard Ristow
At 11:25 PM 1/18/2007, Johnson Lau wrote:

>There are 100 students taking 2 examinations. 50 students took exam 1
>and all students took exam 2). I would like to predict the score of
>exam2 (an interval scale) with score of exam1 and whether a student
>took exam 1. Some example is shown as follow:
>
>variable name:  tookexam1  scoreexam1  scoreexam2  passexam2
>case 1             1         50          60           1
>case 2             0          .          70           1
>case 3             0          .          40           0
>case 4             1         40          40           0
>.....
>
>Therefore, i'd like to predict scoreexam2 with tookexam1 and
>scoreexam1, where tookexam1 is valid for all cases (either 0 or 1),
>scoreexam1 is valid if and only if tookexam1=1. However, no matter I
>run a multiple regression or logistic regression, cases with missing
>vale in scoreexam1 will be excluded.

Here's something to try (methodologists on the list, please comment):
Try a model in which simply taking exam 1 contributes a certain amount
the exam 2 score; every point scored on exam 2 scores a certain amount
more. Then you get something like this <WRR: not kept separately>:


TEMPORARY.
RECODE ScoreExam1 (MISSING = 0).

REGRESSION
     /DESCRIPTIVES = MEAN STDDEV CORR
     /DEPENDENT      ScoreExam2
     /METHOD  =ENTER TookExam1 ScoreExam1.

The trouble with that is, TookExam1 and ScoreExam1 will then have a
ridiculously high correlation. A solution to that is to take a 'normal'
Exam2 score as somewhere in the midrange, say 50. Like this:

TEMPORARY.
COMPUTE ScoreExam1 = ScoreExam1 - 50.
RECODE  ScoreExam1 (MISSING = 0).

REGRESSION
     /DESCRIPTIVES = MEAN STDDEV CORR
     /DEPENDENT      ScoreExam2
     /METHOD  =ENTER TookExam1 ScoreExam1.

I ran both of these, but they don't show anything of interest. With
four cases and two independent variables, they hardly could, though.
It's not legitimate to run the regression at all, with that little
data. (At least, the syntax is checked.) But you might want to try
something similar, with your real dataset.


===================
APPENDIX: Test data
===================
NEW FILE.
DATA LIST LIST
   /CaseWord (A4)
    CaseNum  TookExam1 ScoreExam1 ScoreExam2 PassExam2 (5F2).

* variable name: tookexam1  scoreexam1  scoreexam2  passexam2.
BEGIN DATA
case 1             1         50          60           1
case 2             0          .          70           1
case 3             0          .          40           0
case 4             1         40          40           0
END DATA.
LIST.




>  In addition to these, I'd also like to run ANCOVA with more
> predictors with similar data structure.
>I know that pairwise exclusion is allowed in multiple linear
>regression but I'm not sure if this is a correct choice. Nonetheless,
>pairwise option is not found in logistic regression and UNIANOVA
>commands.
>
>Would anyone kindly suggest solution for this? Thanks a lot.
>
>Regards,
>
>Johnson Lau
>
>
>
>--
>No virus found in this incoming message.
>Checked by AVG Free Edition.
>Version: 7.5.432 / Virus Database: 268.16.12/631 - Release Date:
>1/16/2007 8:25 AM
Reply | Threaded
Open this post in threaded view
|

Re: "Non-exist" data in regression analysis

Hector Maletta
        Some comments on the question and on Richard's response:
        1. Of course SPSS excludes cases listwise from a regression analysis
involving a list of variables. Any cases with a missing value in at least
one variable would be excluded.
        2. For clarity's sake I suggest Richard mistyped "2" for "1" the
second time he mentions an exam in his sentence: " Try a model in which
simply taking exam 1 contributes a certain amount [to] the exam 2 score;
every point scored on exam 2 scores a certain amount more." It should read:
"Try a model in which simply taking exam 1 contributes a certain amount [to]
the exam 2 score; every point scored on exam 1 scores a certain amount
more."
        4. The equation for Richard's first proposal would be Score Exam 2 =
b0 + b1[tookexam1] + b2[score exam 1]. Assigning a zero value for Score Exam
1 to those students not taking Exam 1 would serve this purpose, because
those students would no longer be excluded from the analysis. By the way,
make sure the 0 is not a user missing value for either variable.
        5. The two resulting independent variables, tookexam1 and
scoreexam1, would certainly be correlated, because all those with
tookexam1=0 will have scoreexam1=0 as well, but I do not think the LINEAR
correlation would be as high as to cause collinearity and singularity, thus
impeding regression.
        6. Changing the origin of scorexam1 would not modify this: the
degree (as opposed to the sign) of linear correlation between two variables
is invariant to changes in the position of the origin or changes in the unit
of measurement of either variable.
        As a conclusion, Richard's first proposal is fine as a response to
your question. You may recode the system missing scores the way he proposes
and estimate your two-variable equation as he suggests (just check for
collinearity, just in case).
        Now, why and whether you should want to have the equation in that
way is a completely different problem. Is there any empirical or theoretical
reason to expect that taking exam 1 would in itself influence the score of
exam 2? Is it absolutely necessary that the effect of having taken exam 1 is
controlled by the score obtained at exam 1? Why the two questions have to be
tackled together?
        You may alternatively choose to split the problem in two: (a) test
the effect of having or not having taken exam 1, using the whole complement
of 100 cases with two variables: tookexam1 and scoreexam2; (b) test the
effect of score 1 over score 2, using only the 50 cases who took both exams
with two variables: scoreexam1 and scorexam2. There is no information loss
when you split the problem, as there is no possible variation in the
correlation (or lack thereof) between taking the first exam and having a
score from it, and no possible effect of having (or not having) taken exam 1
on the score obtained at the same exam 1, except the obvious fact that only
those taking the exam get a score.

        Hector


        -----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Richard Ristow
Enviado el: 19 January 2007 19:55
Para: [hidden email]
Asunto: Re: "Non-exist" data in regression analysis

        At 11:25 PM 1/18/2007, Johnson Lau wrote:

        >There are 100 students taking 2 examinations. 50 students took exam
1
        >and all students took exam 2). I would like to predict the score of
        >exam2 (an interval scale) with score of exam1 and whether a student
        >took exam 1. Some example is shown as follow:
        >
        >variable name:  tookexam1  scoreexam1  scoreexam2  passexam2
        >case 1             1         50          60           1
        >case 2             0          .          70           1
        >case 3             0          .          40           0
        >case 4             1         40          40           0
        >.....
        >
        >Therefore, i'd like to predict scoreexam2 with tookexam1 and
        >scoreexam1, where tookexam1 is valid for all cases (either 0 or 1),
        >scoreexam1 is valid if and only if tookexam1=1. However, no matter
I
        >run a multiple regression or logistic regression, cases with
missing
        >vale in scoreexam1 will be excluded.

        Here's something to try (methodologists on the list, please
comment):
        Try a model in which simply taking exam 1 contributes a certain
amount
        the exam 2 score; every point scored on exam 2 scores a certain
amount
        more. Then you get something like this <WRR: not kept separately>:


        TEMPORARY.
        RECODE ScoreExam1 (MISSING = 0).

        REGRESSION
             /DESCRIPTIVES = MEAN STDDEV CORR
             /DEPENDENT      ScoreExam2
             /METHOD  =ENTER TookExam1 ScoreExam1.

        The trouble with that is, TookExam1 and ScoreExam1 will then have a
        ridiculously high correlation. A solution to that is to take a
'normal'
        Exam2 score as somewhere in the midrange, say 50. Like this:

        TEMPORARY.
        COMPUTE ScoreExam1 = ScoreExam1 - 50.
        RECODE  ScoreExam1 (MISSING = 0).

        REGRESSION
             /DESCRIPTIVES = MEAN STDDEV CORR
             /DEPENDENT      ScoreExam2
             /METHOD  =ENTER TookExam1 ScoreExam1.

        I ran both of these, but they don't show anything of interest. With
        four cases and two independent variables, they hardly could, though.
        It's not legitimate to run the regression at all, with that little
        data. (At least, the syntax is checked.) But you might want to try
        something similar, with your real dataset.


        ===================
        APPENDIX: Test data
        ===================
        NEW FILE.
        DATA LIST LIST
           /CaseWord (A4)
            CaseNum  TookExam1 ScoreExam1 ScoreExam2 PassExam2 (5F2).

        * variable name: tookexam1  scoreexam1  scoreexam2  passexam2.
        BEGIN DATA
        case 1             1         50          60           1
        case 2             0          .          70           1
        case 3             0          .          40           0
        case 4             1         40          40           0
        END DATA.
        LIST.




        >  In addition to these, I'd also like to run ANCOVA with more
        > predictors with similar data structure.
        >I know that pairwise exclusion is allowed in multiple linear
        >regression but I'm not sure if this is a correct choice.
Nonetheless,
        >pairwise option is not found in logistic regression and UNIANOVA
        >commands.
        >
        >Would anyone kindly suggest solution for this? Thanks a lot.
        >
        >Regards,
        >
        >Johnson Lau
        >
        >
        >
        >--
        >No virus found in this incoming message.
        >Checked by AVG Free Edition.
        >Version: 7.5.432 / Virus Database: 268.16.12/631 - Release Date:
        >1/16/2007 8:25 AM
Reply | Threaded
Open this post in threaded view
|

Re: "Non-exist" data in regression analysis

Richard Ristow
A comment on one point. I had to think, to remember why this one works.

At 06:48 PM 1/19/2007, Hector Maletta wrote:

[...]

>         4. The equation for Richard's first proposal would be Score
> Exam 2 = b0 + b1[tookexam1] + b2[score exam 1]. Assigning a zero
> value for Score Exam 1 to those students not taking Exam 1 would
> serve this purpose, because those students would no longer be
> excluded from the analysis. By the way, make sure the 0 is not a user
> missing value for either variable.
>         5. The two resulting independent variables, tookexam1 and
> scoreexam1, would certainly be correlated, because all those with
> tookexam1=0 will have scoreexam1=0 as well, but I do not think the
> LINEAR correlation would be as high as to cause collinearity and
> singularity, thus impeding regression.

They certainly couldn't be completely collinear, because 'tookexam1' is
dichotomous and 'scoreexam1' is multi-valued. But if the variance of
'scoreexam1', among those who took exam 1, is small compared to its
mean, they could be correlated enough that it would be hard to tell how
much each was influential in a regression. This could be reflected in
very weak t-statistics for both variables, in the face of a good R**2
and F-statistics for the regression as a whole.

>         6. Changing the origin of scorexam1 would not modify this:
> the degree (as opposed to the sign) of linear correlation between two
> variables is invariant to changes in the position of the origin or
> changes in the unit of measurement of either variable.

This is the one I had to think twice about. Hector's absolutely right
about what he says. But the change I suggested, from

RECODE ScoreExam1 (MISSING = 0).

to

COMPUTE ScoreExam1 = ScoreExam1 - 50.
RECODE  ScoreExam1 (MISSING = 0).

is not just a shift in origin; it changes how the values of ScoreExam1
are distributed. With the COMPUTE, students who didn't take exam 1 are
assigned a value near the middle of the range of those who did. Without
it, those students are assigned value 0, which may well be an outlier.
The former can give a much lower correlation with TookExam1, the
indicator variable separating the two groups.

An alternative, with the same effect as the above, is to skip the
COMPUTE and instead use

RECODE ScoreExam1 (MISSING = 50).

Finally, all these three formulations give the same final regression
estimate, by which I mean the same set of predicted values. I won't
prove that, here.
Reply | Threaded
Open this post in threaded view
|

Re: "Non-exist" data in regression analysis

Johnson Lau
In reply to this post by Hector Maletta
Thanks for your comments.

The example is just for illustration since my real question is more
complicated. I'm analysing some election data. Two general elections
were held at 2000 and 2005. For all candidates in 2005 I got their
percentage of votes, and they could be divided into 4 groups: 1. Not
participated the 2000 election; 2. Participated 2000 election but
lost; 3. Participated 2000 election and won; 4. Participated 2000
election and won due to no competitor ("duly-elected"). For group 2
and 3, I also got their percentage of votes in 2000. For group 1, vote
percentage for 2000 is not existing since they did not participated in
that election. For group 4, vote percentage for 2000 is not existing
since there was no competitor and voters didn't have to vote.

The data structure would be like this:
variable name: result2000 percent2000  result2005 percent2005
   case 1             1              .                  2                 .37
   case 2             2              .39              2                  .39
   case 3             3              .58              3                  .69
   case 4             4              .                 3                   .70
   case 5             3              .70              4                   .
   case 6             2              .18              2
   .21
   case 7             1              .                 3                   .66
   case 8             4              .                 4                   .
  ...........

for result2000 and result2005, 1=not participated, 2=lost, 3=won,
4=duly-elected. "percent" is missing if and only if "result"=1 or 4.

I'd like to predict percent2005 with result2000 and percent2000. Dummy
variables will be assigned for result2000. Similarly, logistic
regression will be used to predict result2005 with result2000 and
percent2000.

I'd like to include both percent2000 and result2000 in the same model
since I expect some interaction between them (therefore I might
include an interaction term for them). Also, I'd like to include more
independent variable such as gender, age, political affiliation into a
single model. I might have to drop percent2000 if I can't figure out a
solution, despite that the r between percent2000 and percent2004 is
larger than 0.5

For Richard's suggestion, does it mean I can simpily substitute any
single number for all missing values?

Thanks again!

2007/1/20, Hector Maletta <[hidden email]>:

>        Some comments on the question and on Richard's response:
>        1. Of course SPSS excludes cases listwise from a regression analysis
> involving a list of variables. Any cases with a missing value in at least
> one variable would be excluded.
>        2. For clarity's sake I suggest Richard mistyped "2" for "1" the
> second time he mentions an exam in his sentence: " Try a model in which
> simply taking exam 1 contributes a certain amount [to] the exam 2 score;
> every point scored on exam 2 scores a certain amount more." It should read:
> "Try a model in which simply taking exam 1 contributes a certain amount [to]
> the exam 2 score; every point scored on exam 1 scores a certain amount
> more."
>        4. The equation for Richard's first proposal would be Score Exam 2 =
> b0 + b1[tookexam1] + b2[score exam 1]. Assigning a zero value for Score Exam
> 1 to those students not taking Exam 1 would serve this purpose, because
> those students would no longer be excluded from the analysis. By the way,
> make sure the 0 is not a user missing value for either variable.
>        5. The two resulting independent variables, tookexam1 and
> scoreexam1, would certainly be correlated, because all those with
> tookexam1=0 will have scoreexam1=0 as well, but I do not think the LINEAR
> correlation would be as high as to cause collinearity and singularity, thus
> impeding regression.
>        6. Changing the origin of scorexam1 would not modify this: the
> degree (as opposed to the sign) of linear correlation between two variables
> is invariant to changes in the position of the origin or changes in the unit
> of measurement of either variable.
>        As a conclusion, Richard's first proposal is fine as a response to
> your question. You may recode the system missing scores the way he proposes
> and estimate your two-variable equation as he suggests (just check for
> collinearity, just in case).
>        Now, why and whether you should want to have the equation in that
> way is a completely different problem. Is there any empirical or theoretical
> reason to expect that taking exam 1 would in itself influence the score of
> exam 2? Is it absolutely necessary that the effect of having taken exam 1 is
> controlled by the score obtained at exam 1? Why the two questions have to be
> tackled together?
>        You may alternatively choose to split the problem in two: (a) test
> the effect of having or not having taken exam 1, using the whole complement
> of 100 cases with two variables: tookexam1 and scoreexam2; (b) test the
> effect of score 1 over score 2, using only the 50 cases who took both exams
> with two variables: scoreexam1 and scorexam2. There is no information loss
> when you split the problem, as there is no possible variation in the
> correlation (or lack thereof) between taking the first exam and having a
> score from it, and no possible effect of having (or not having) taken exam 1
> on the score obtained at the same exam 1, except the obvious fact that only
> those taking the exam get a score.
>
>        Hector
>
>
>        -----Mensaje original-----
> De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
> Richard Ristow
> Enviado el: 19 January 2007 19:55
> Para: [hidden email]
> Asunto: Re: "Non-exist" data in regression analysis
>
>        At 11:25 PM 1/18/2007, Johnson Lau wrote:
>
>        >There are 100 students taking 2 examinations. 50 students took exam
> 1
>        >and all students took exam 2). I would like to predict the score of
>        >exam2 (an interval scale) with score of exam1 and whether a student
>        >took exam 1. Some example is shown as follow:
>        >
>        >variable name:  tookexam1  scoreexam1  scoreexam2  passexam2
>        >case 1             1         50          60           1
>        >case 2             0          .          70           1
>        >case 3             0          .          40           0
>        >case 4             1         40          40           0
>        >.....
>        >
>        >Therefore, i'd like to predict scoreexam2 with tookexam1 and
>        >scoreexam1, where tookexam1 is valid for all cases (either 0 or 1),
>        >scoreexam1 is valid if and only if tookexam1=1. However, no matter
> I
>        >run a multiple regression or logistic regression, cases with
> missing
>        >vale in scoreexam1 will be excluded.
>
>        Here's something to try (methodologists on the list, please
> comment):
>        Try a model in which simply taking exam 1 contributes a certain
> amount
>        the exam 2 score; every point scored on exam 2 scores a certain
> amount
>        more. Then you get something like this <WRR: not kept separately>:
>
>
>        TEMPORARY.
>        RECODE ScoreExam1 (MISSING = 0).
>
>        REGRESSION
>             /DESCRIPTIVES = MEAN STDDEV CORR
>             /DEPENDENT      ScoreExam2
>             /METHOD  =ENTER TookExam1 ScoreExam1.
>
>        The trouble with that is, TookExam1 and ScoreExam1 will then have a
>        ridiculously high correlation. A solution to that is to take a
> 'normal'
>        Exam2 score as somewhere in the midrange, say 50. Like this:
>
>        TEMPORARY.
>        COMPUTE ScoreExam1 = ScoreExam1 - 50.
>        RECODE  ScoreExam1 (MISSING = 0).
>
>        REGRESSION
>             /DESCRIPTIVES = MEAN STDDEV CORR
>             /DEPENDENT      ScoreExam2
>             /METHOD  =ENTER TookExam1 ScoreExam1.
>
>        I ran both of these, but they don't show anything of interest. With
>        four cases and two independent variables, they hardly could, though.
>        It's not legitimate to run the regression at all, with that little
>        data. (At least, the syntax is checked.) But you might want to try
>        something similar, with your real dataset.
>
>
>        ===================
>        APPENDIX: Test data
>        ===================
>        NEW FILE.
>        DATA LIST LIST
>           /CaseWord (A4)
>            CaseNum  TookExam1 ScoreExam1 ScoreExam2 PassExam2 (5F2).
>
>        * variable name: tookexam1  scoreexam1  scoreexam2  passexam2.
>        BEGIN DATA
>        case 1             1         50          60           1
>        case 2             0          .          70           1
>        case 3             0          .          40           0
>        case 4             1         40          40           0
>        END DATA.
>        LIST.
>
>
>
>
>        >  In addition to these, I'd also like to run ANCOVA with more
>        > predictors with similar data structure.
>        >I know that pairwise exclusion is allowed in multiple linear
>        >regression but I'm not sure if this is a correct choice.
> Nonetheless,
>        >pairwise option is not found in logistic regression and UNIANOVA
>        >commands.
>        >
>        >Would anyone kindly suggest solution for this? Thanks a lot.
>        >
>        >Regards,
>        >
>        >Johnson Lau
>        >
>        >
>        >
>        >--
>        >No virus found in this incoming message.
>        >Checked by AVG Free Edition.
>        >Version: 7.5.432 / Virus Database: 268.16.12/631 - Release Date:
>        >1/16/2007 8:25 AM
>
Reply | Threaded
Open this post in threaded view
|

Re: "Non-exist" data in regression analysis

Hector Maletta
        J.Lau wrote:
        "For Richard's suggestion, does it mean I can simply substitute any
single number for all missing values?"

        I think Richard is not suggesting exactly that. Either you
substitute zero for the missing values, or you rescale the score by adding
(for example) 50 and then substitute 50 for the missing values.

        Hector

        -----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Johnson Lau
Enviado el: 20 January 2007 14:15
Para: [hidden email]
Asunto: Re: "Non-exist" data in regression analysis

        Thanks for your comments.

        The example is just for illustration since my real question is more
        complicated. I'm analysing some election data. Two general elections
        were held at 2000 and 2005. For all candidates in 2005 I got their
        percentage of votes, and they could be divided into 4 groups: 1. Not
        participated the 2000 election; 2. Participated 2000 election but
        lost; 3. Participated 2000 election and won; 4. Participated 2000
        election and won due to no competitor ("duly-elected"). For group 2
        and 3, I also got their percentage of votes in 2000. For group 1,
vote
        percentage for 2000 is not existing since they did not participated
in
        that election. For group 4, vote percentage for 2000 is not existing
        since there was no competitor and voters didn't have to vote.

        The data structure would be like this:
        variable name: result2000 percent2000  result2005 percent2005
           case 1             1              .                  2
.37
           case 2             2              .39              2
.39
           case 3             3              .58              3
.69
           case 4             4              .                 3
.70
           case 5             3              .70              4
.
           case 6             2              .18              2
           .21
           case 7             1              .                 3
.66
           case 8             4              .                 4
.
          ...........

        for result2000 and result2005, 1=not participated, 2=lost, 3=won,
        4=duly-elected. "percent" is missing if and only if "result"=1 or 4.

        I'd like to predict percent2005 with result2000 and percent2000.
Dummy
        variables will be assigned for result2000. Similarly, logistic
        regression will be used to predict result2005 with result2000 and
        percent2000.

        I'd like to include both percent2000 and result2000 in the same
model
        since I expect some interaction between them (therefore I might
        include an interaction term for them). Also, I'd like to include
more
        independent variable such as gender, age, political affiliation into
a
        single model. I might have to drop percent2000 if I can't figure out
a
        solution, despite that the r between percent2000 and percent2004 is
        larger than 0.5

        For Richard's suggestion, does it mean I can simpily substitute any
        single number for all missing values?

        Thanks again!

        2007/1/20, Hector Maletta <[hidden email]>:
        >        Some comments on the question and on Richard's response:
        >        1. Of course SPSS excludes cases listwise from a regression
analysis
        > involving a list of variables. Any cases with a missing value in
at least
        > one variable would be excluded.
        >        2. For clarity's sake I suggest Richard mistyped "2" for
"1" the
        > second time he mentions an exam in his sentence: " Try a model in
which
        > simply taking exam 1 contributes a certain amount [to] the exam 2
score;
        > every point scored on exam 2 scores a certain amount more." It
should read:
        > "Try a model in which simply taking exam 1 contributes a certain
amount [to]
        > the exam 2 score; every point scored on exam 1 scores a certain
amount
        > more."
        >        4. The equation for Richard's first proposal would be Score
Exam 2 =
        > b0 + b1[tookexam1] + b2[score exam 1]. Assigning a zero value for
Score Exam
        > 1 to those students not taking Exam 1 would serve this purpose,
because
        > those students would no longer be excluded from the analysis. By
the way,
        > make sure the 0 is not a user missing value for either variable.
        >        5. The two resulting independent variables, tookexam1 and
        > scoreexam1, would certainly be correlated, because all those with
        > tookexam1=0 will have scoreexam1=0 as well, but I do not think the
LINEAR
        > correlation would be as high as to cause collinearity and
singularity, thus
        > impeding regression.
        >        6. Changing the origin of scorexam1 would not modify this:
the
        > degree (as opposed to the sign) of linear correlation between two
variables
        > is invariant to changes in the position of the origin or changes
in the unit
        > of measurement of either variable.
        >        As a conclusion, Richard's first proposal is fine as a
response to
        > your question. You may recode the system missing scores the way he
proposes
        > and estimate your two-variable equation as he suggests (just check
for
        > collinearity, just in case).
        >        Now, why and whether you should want to have the equation
in that
        > way is a completely different problem. Is there any empirical or
theoretical
        > reason to expect that taking exam 1 would in itself influence the
score of
        > exam 2? Is it absolutely necessary that the effect of having taken
exam 1 is
        > controlled by the score obtained at exam 1? Why the two questions
have to be
        > tackled together?
        >        You may alternatively choose to split the problem in two:
(a) test
        > the effect of having or not having taken exam 1, using the whole
complement
        > of 100 cases with two variables: tookexam1 and scoreexam2; (b)
test the
        > effect of score 1 over score 2, using only the 50 cases who took
both exams
        > with two variables: scoreexam1 and scorexam2. There is no
information loss
        > when you split the problem, as there is no possible variation in
the
        > correlation (or lack thereof) between taking the first exam and
having a
        > score from it, and no possible effect of having (or not having)
taken exam 1
        > on the score obtained at the same exam 1, except the obvious fact
that only
        > those taking the exam get a score.
        >
        >        Hector
        >
        >
        >        -----Mensaje original-----
        > De: SPSSX(r) Discussion [mailto:[hidden email]] En
nombre de
        > Richard Ristow
        > Enviado el: 19 January 2007 19:55
        > Para: [hidden email]
        > Asunto: Re: "Non-exist" data in regression analysis
        >
        >        At 11:25 PM 1/18/2007, Johnson Lau wrote:
        >
        >        >There are 100 students taking 2 examinations. 50 students
took exam
        > 1
        >        >and all students took exam 2). I would like to predict the
score of
        >        >exam2 (an interval scale) with score of exam1 and whether
a student
        >        >took exam 1. Some example is shown as follow:
        >        >
        >        >variable name:  tookexam1  scoreexam1  scoreexam2
passexam2
        >        >case 1             1         50          60           1
        >        >case 2             0          .          70           1
        >        >case 3             0          .          40           0
        >        >case 4             1         40          40           0
        >        >.....
        >        >
        >        >Therefore, i'd like to predict scoreexam2 with tookexam1
and
        >        >scoreexam1, where tookexam1 is valid for all cases (either
0 or 1),
        >        >scoreexam1 is valid if and only if tookexam1=1. However,
no matter
        > I
        >        >run a multiple regression or logistic regression, cases
with
        > missing
        >        >vale in scoreexam1 will be excluded.
        >
        >        Here's something to try (methodologists on the list, please
        > comment):
        >        Try a model in which simply taking exam 1 contributes a
certain
        > amount
        >        the exam 2 score; every point scored on exam 2 scores a
certain
        > amount
        >        more. Then you get something like this <WRR: not kept
separately>:
        >
        >
        >        TEMPORARY.
        >        RECODE ScoreExam1 (MISSING = 0).
        >
        >        REGRESSION
        >             /DESCRIPTIVES = MEAN STDDEV CORR
        >             /DEPENDENT      ScoreExam2
        >             /METHOD  =ENTER TookExam1 ScoreExam1.
        >
        >        The trouble with that is, TookExam1 and ScoreExam1 will
then have a
        >        ridiculously high correlation. A solution to that is to
take a
        > 'normal'
        >        Exam2 score as somewhere in the midrange, say 50. Like
this:
        >
        >        TEMPORARY.
        >        COMPUTE ScoreExam1 = ScoreExam1 - 50.
        >        RECODE  ScoreExam1 (MISSING = 0).
        >
        >        REGRESSION
        >             /DESCRIPTIVES = MEAN STDDEV CORR
        >             /DEPENDENT      ScoreExam2
        >             /METHOD  =ENTER TookExam1 ScoreExam1.
        >
        >        I ran both of these, but they don't show anything of
interest. With
        >        four cases and two independent variables, they hardly
could, though.
        >        It's not legitimate to run the regression at all, with that
little
        >        data. (At least, the syntax is checked.) But you might want
to try
        >        something similar, with your real dataset.
        >
        >
        >        ===================
        >        APPENDIX: Test data
        >        ===================
        >        NEW FILE.
        >        DATA LIST LIST
        >           /CaseWord (A4)
        >            CaseNum  TookExam1 ScoreExam1 ScoreExam2 PassExam2
(5F2).
        >
        >        * variable name: tookexam1  scoreexam1  scoreexam2
passexam2.
        >        BEGIN DATA
        >        case 1             1         50          60           1
        >        case 2             0          .          70           1
        >        case 3             0          .          40           0
        >        case 4             1         40          40           0
        >        END DATA.
        >        LIST.
        >
        >
        >
        >
        >        >  In addition to these, I'd also like to run ANCOVA with
more
        >        > predictors with similar data structure.
        >        >I know that pairwise exclusion is allowed in multiple
linear
        >        >regression but I'm not sure if this is a correct choice.
        > Nonetheless,
        >        >pairwise option is not found in logistic regression and
UNIANOVA
        >        >commands.
        >        >
        >        >Would anyone kindly suggest solution for this? Thanks a
lot.
        >        >
        >        >Regards,
        >        >
        >        >Johnson Lau
        >        >
        >        >
        >        >
        >        >--
        >        >No virus found in this incoming message.
        >        >Checked by AVG Free Edition.
        >        >Version: 7.5.432 / Virus Database: 268.16.12/631 - Release
Date:
        >        >1/16/2007 8:25 AM
        >
Reply | Threaded
Open this post in threaded view
|

Re: "Non-exist" data in regression analysis

Richard Ristow
In reply to this post by Johnson Lau
At 12:15 PM 1/20/2007, Johnson Lau wrote:

>For Richard's suggestion, does it mean I can simpily substitute any
>single number for all missing values?

As a matter of mathematics it means exactly that; you substitute a
number, but it doesn't matter what one. That is true, however, ONLY IF
your model also includes the indicator ('dummy') variable that
distinguishes cases that initially had missing values, from cases that
have valid ones. (See Appendix)

As a matter of estimation, it's better if the indicator variable isn't
too highly correlated with the score variable with a number substituted
for the missing values aren't too highly correlated. (Not being 'too
highly correlated' is desirable for all independent variables in a
regression, of course.) That's why I suggest that the single number
that's filled in for missing values, isn't too far from the mean score
of the cases where the score isn't missing.

Finally, this applies only in the 'missing' group - in your original
parlance, those who didn't take exam 1. If anyone took exam 1 but you
don't have the score, either leave the score missing, or do careful
missing-value imputation.

(What I'm proposing for you is *not* missing-value imputation. It does
not, that is, try to estimate a 'true value' for those who didn't take
exam 1. It simply structures the model so it's both reasonable and
estimable.)

========
APPENDIX: "As a matter of mathematics, it doesn't matter what value you
substitute."

Warning! Linear algebra alert! If you have a known hypersensitivity to
linear algebra, or linear algebra is incompatible with your metabolism,
DO NOT PROCEED!
........
If you have n observations, then the observed values of the dependent
variable constitute a set of n numbers; this may be taken as a point,
or vector, in an n-dimensional vector space.

Further, each independent variable, whose observed values also form a
set of n numbers, may be taken as a point or vector in the same vector
space. (This counts the 'constant' as one of the variables. That's
mathematically correct, although the constant is usually considered
separately for estimation and statistical inference.)

Suppose there are k independent variables. Treating them as vectors,
and following the convention of using capital letters to denote
vectors, the regression model is

Y'=SUM(i=1,k)(a(i)*X(i))

where the a(i) are scalars - the 'regression coefficients' we all know
and love.

That is, the set of possible estimators Y' is a linear subspace of the
vector space, the linear subspace spanned by vectors (independent
variables) X(1) to X(k).

The regression problem, using the least-squares criterion as we do, is
then choosing the a(i) to minimize |Y-Y'|, where Y is the dependent
variable.

I say that two formulations of the regression problem are
mathematically equivalent if they have the same set of possible
estimators Y'; that is, the independent variables in the two
formulations span the same vector subspace.

Consider two models:

Y'=a*S+b*I
and
Y'=a'*S'+b'*I.

In these equations,
. I is the indicator variable for the set of participants who DO NOT
have a valid score
. S is the score, with value s substituted for participants who don't
have scores; S' is the score, with a different value s' substituted.

Then, S-S'=(s-s')*I; S=S'+(s-s')*I

If an estimator Y0 is attainable with the first model, so that one can
write,

Y0=a*S+b*I

then

Y0=a*(S'+(s-s')*I)   + b*I
Y0=a*S' + a*(s-s')*I + b*I
Y0=a*S' +(a*(s-s')+b)*I

and Y0 is attainable with the second model.
Reply | Threaded
Open this post in threaded view
|

Re: "Non-exist" data in regression analysis

Johnson Lau
Thanks a lot for your great suggestion. It gives good prediction for
my data. As you said, no matter what number is substituted for missing
value, the predicted value is just the same. However, with a different
number being substituted, the standardized betas change, and the
p-value for the dummy variable changes as well. How could I interpret
the effect size in this case? Thanks a lot!

Regards,

Johnson Lau

2007/1/21, Richard Ristow <[hidden email]>:

> At 12:15 PM 1/20/2007, Johnson Lau wrote:
>
> >For Richard's suggestion, does it mean I can simpily substitute any
> >single number for all missing values?
>
> As a matter of mathematics it means exactly that; you substitute a
> number, but it doesn't matter what one. That is true, however, ONLY IF
> your model also includes the indicator ('dummy') variable that
> distinguishes cases that initially had missing values, from cases that
> have valid ones. (See Appendix)
>
> As a matter of estimation, it's better if the indicator variable isn't
> too highly correlated with the score variable with a number substituted
> for the missing values aren't too highly correlated. (Not being 'too
> highly correlated' is desirable for all independent variables in a
> regression, of course.) That's why I suggest that the single number
> that's filled in for missing values, isn't too far from the mean score
> of the cases where the score isn't missing.
>
> Finally, this applies only in the 'missing' group - in your original
> parlance, those who didn't take exam 1. If anyone took exam 1 but you
> don't have the score, either leave the score missing, or do careful
> missing-value imputation.
>
> (What I'm proposing for you is *not* missing-value imputation. It does
> not, that is, try to estimate a 'true value' for those who didn't take
> exam 1. It simply structures the model so it's both reasonable and
> estimable.)
>
> ========
> APPENDIX: "As a matter of mathematics, it doesn't matter what value you
> substitute."
>
> Warning! Linear algebra alert! If you have a known hypersensitivity to
> linear algebra, or linear algebra is incompatible with your metabolism,
> DO NOT PROCEED!
> ........
> If you have n observations, then the observed values of the dependent
> variable constitute a set of n numbers; this may be taken as a point,
> or vector, in an n-dimensional vector space.
>
> Further, each independent variable, whose observed values also form a
> set of n numbers, may be taken as a point or vector in the same vector
> space. (This counts the 'constant' as one of the variables. That's
> mathematically correct, although the constant is usually considered
> separately for estimation and statistical inference.)
>
> Suppose there are k independent variables. Treating them as vectors,
> and following the convention of using capital letters to denote
> vectors, the regression model is
>
> Y'=SUM(i=1,k)(a(i)*X(i))
>
> where the a(i) are scalars - the 'regression coefficients' we all know
> and love.
>
> That is, the set of possible estimators Y' is a linear subspace of the
> vector space, the linear subspace spanned by vectors (independent
> variables) X(1) to X(k).
>
> The regression problem, using the least-squares criterion as we do, is
> then choosing the a(i) to minimize |Y-Y'|, where Y is the dependent
> variable.
>
> I say that two formulations of the regression problem are
> mathematically equivalent if they have the same set of possible
> estimators Y'; that is, the independent variables in the two
> formulations span the same vector subspace.
>
> Consider two models:
>
> Y'=a*S+b*I
> and
> Y'=a'*S'+b'*I.
>
> In these equations,
> . I is the indicator variable for the set of participants who DO NOT
> have a valid score
> . S is the score, with value s substituted for participants who don't
> have scores; S' is the score, with a different value s' substituted.
>
> Then, S-S'=(s-s')*I; S=S'+(s-s')*I
>
> If an estimator Y0 is attainable with the first model, so that one can
> write,
>
> Y0=a*S+b*I
>
> then
>
> Y0=a*(S'+(s-s')*I)   + b*I
> Y0=a*S' + a*(s-s')*I + b*I
> Y0=a*S' +(a*(s-s')+b)*I
>
> and Y0 is attainable with the second model.
>


--
Johnson Lau
Research Assistant
School of Public Health
The Chinese University of Hong Kong
Tel: (852) 2252 8705
Fax: (852) 2145 8517
Reply | Threaded
Open this post in threaded view
|

Re: "Non-exist" data in regression analysis

Richard Ristow
At 03:30 AM 1/21/2007, Johnson Lau wrote:

>Thanks a lot for your great suggestion. It gives good prediction for
>my data. As you said, no matter what number is substituted for missing
>value, the predicted value is just the same.

Excellent. I'm glad it's useful!

>However, with a different number being substituted, the standardized
>betas change, and the p-value for the dummy variable changes as well.
>How could I interpret the effect size in this case? Thanks a lot!

Short answer: choose the substituted number to minimize the correlation
between the score (with number substituted) and the dummy variable. I'm
not doing the arithmetic, but I think substituting the mean of the
valid scores will give zero correlation. As I've written, my choice
would be a round number reasonably near the mean.

This isn't fudging. It's making a legitimate choice in the
representation of your data, for greater precision in the estimate.
.......................................
Now, the detailed (long-winded?) answer:

>As you said, no matter what number is substituted for missing value,
>the predicted value is just the same.

Yes, that's right: That is - follow the proof I sent, if you want - you
will get the same predictor with any value of the substituted number.

>[But] the standardized betas change

Again,yes; you can see from the same proof, that changing the
substituted number will change the coefficients, unstandardized and
standardized. (The proof I gave is in terms of the unstandardized
coefficients, the more direct approach.)

>the p-value for the dummy variable changes as well

That also is to be expected. I'd be surprised if the p-value associated
with the 'score' variable doesn't also change.

Broadly speaking, correlation between independent variables tends to
suppress statistical significance for the effects of both variables.

Think like this: The significance test is a measure of the evidence
that the variable has an effect on the dependent variable. Normally,
you think of testing whether the effect is there, or whether it isn't.
But with correlated variables, there's not just the question whether an
effect is present; there's the question, which variable is (more)
responsible for it. Your test may fail of significance, not because the
procedure can't tell 'whether', but because it can't tell 'which'.

A common clue is that the F-test for inclusion of the variables as a
group is strongly significant, while the t-tests for the individual
variables, with the other present, are not. (This effect exists with
any correlation between the variables, but it's usually a problem only
with higher correlations - what would people estimate? I'd say, around
0.5 or higher.)

Conclusion, as above: choose your representation (your substituted
number) to keep the correlation of the two variables low. Trust the
results you get.

-With best wishes, and good luck to you,
  Richard