Dear all,
I'm dealing with a data structure as follow and don't know how to run a multiple regression/logistic regression with it. There are 100 students taking 2 examinations. 50 of them took both exams and 50 of them just took exam2 (ie. 50 students took exam 1 and all students took exam 2). I would like to predict the score of exam2 (an interval scale) with score of exam1 and whether a student took exam 1. Some example is shown as follow: variable name: tookexam1 scoreexam1 scoreexam2 passexam2 case 1 1 50 60 1 case 2 0 . 70 1 case 3 0 . 40 0 case 4 1 40 40 0 ..... Therefore, i'd like to predict scoreexam2 with tookexam1 and scoreexam1, where tookexam1 is valid for all cases (either 0 or 1), scoreexam1 is valid if and only if tookexam1=1. Similarly, i'd like to predict passexam2 (a dichotomous variable) with tookexam1 and scoreexam1. With bivariate correlation, I have found that tookexam1 and scoreexam1 are correlated with scoreexam2. However, no matter I run a multiple regression or logistic regression, cases with missing vale in scoreexam1 will be excluded. In addition to these, I'd also like to run ANCOVA with more predictors with similar data structure. I know that pairwise exclusion is allowed in multiple linear regression but I'm not sure if this is a correct choice. Nonetheless, pairwise option is not found in logistic regression and UNIANOVA commands. Would anyone kindly suggest solution for this? Thanks a lot. Regards, Johnson Lau |
At 11:25 PM 1/18/2007, Johnson Lau wrote:
>There are 100 students taking 2 examinations. 50 students took exam 1 >and all students took exam 2). I would like to predict the score of >exam2 (an interval scale) with score of exam1 and whether a student >took exam 1. Some example is shown as follow: > >variable name: tookexam1 scoreexam1 scoreexam2 passexam2 >case 1 1 50 60 1 >case 2 0 . 70 1 >case 3 0 . 40 0 >case 4 1 40 40 0 >..... > >Therefore, i'd like to predict scoreexam2 with tookexam1 and >scoreexam1, where tookexam1 is valid for all cases (either 0 or 1), >scoreexam1 is valid if and only if tookexam1=1. However, no matter I >run a multiple regression or logistic regression, cases with missing >vale in scoreexam1 will be excluded. Here's something to try (methodologists on the list, please comment): Try a model in which simply taking exam 1 contributes a certain amount the exam 2 score; every point scored on exam 2 scores a certain amount more. Then you get something like this <WRR: not kept separately>: TEMPORARY. RECODE ScoreExam1 (MISSING = 0). REGRESSION /DESCRIPTIVES = MEAN STDDEV CORR /DEPENDENT ScoreExam2 /METHOD =ENTER TookExam1 ScoreExam1. The trouble with that is, TookExam1 and ScoreExam1 will then have a ridiculously high correlation. A solution to that is to take a 'normal' Exam2 score as somewhere in the midrange, say 50. Like this: TEMPORARY. COMPUTE ScoreExam1 = ScoreExam1 - 50. RECODE ScoreExam1 (MISSING = 0). REGRESSION /DESCRIPTIVES = MEAN STDDEV CORR /DEPENDENT ScoreExam2 /METHOD =ENTER TookExam1 ScoreExam1. I ran both of these, but they don't show anything of interest. With four cases and two independent variables, they hardly could, though. It's not legitimate to run the regression at all, with that little data. (At least, the syntax is checked.) But you might want to try something similar, with your real dataset. =================== APPENDIX: Test data =================== NEW FILE. DATA LIST LIST /CaseWord (A4) CaseNum TookExam1 ScoreExam1 ScoreExam2 PassExam2 (5F2). * variable name: tookexam1 scoreexam1 scoreexam2 passexam2. BEGIN DATA case 1 1 50 60 1 case 2 0 . 70 1 case 3 0 . 40 0 case 4 1 40 40 0 END DATA. LIST. > In addition to these, I'd also like to run ANCOVA with more > predictors with similar data structure. >I know that pairwise exclusion is allowed in multiple linear >regression but I'm not sure if this is a correct choice. Nonetheless, >pairwise option is not found in logistic regression and UNIANOVA >commands. > >Would anyone kindly suggest solution for this? Thanks a lot. > >Regards, > >Johnson Lau > > > >-- >No virus found in this incoming message. >Checked by AVG Free Edition. >Version: 7.5.432 / Virus Database: 268.16.12/631 - Release Date: >1/16/2007 8:25 AM |
Some comments on the question and on Richard's response:
1. Of course SPSS excludes cases listwise from a regression analysis involving a list of variables. Any cases with a missing value in at least one variable would be excluded. 2. For clarity's sake I suggest Richard mistyped "2" for "1" the second time he mentions an exam in his sentence: " Try a model in which simply taking exam 1 contributes a certain amount [to] the exam 2 score; every point scored on exam 2 scores a certain amount more." It should read: "Try a model in which simply taking exam 1 contributes a certain amount [to] the exam 2 score; every point scored on exam 1 scores a certain amount more." 4. The equation for Richard's first proposal would be Score Exam 2 = b0 + b1[tookexam1] + b2[score exam 1]. Assigning a zero value for Score Exam 1 to those students not taking Exam 1 would serve this purpose, because those students would no longer be excluded from the analysis. By the way, make sure the 0 is not a user missing value for either variable. 5. The two resulting independent variables, tookexam1 and scoreexam1, would certainly be correlated, because all those with tookexam1=0 will have scoreexam1=0 as well, but I do not think the LINEAR correlation would be as high as to cause collinearity and singularity, thus impeding regression. 6. Changing the origin of scorexam1 would not modify this: the degree (as opposed to the sign) of linear correlation between two variables is invariant to changes in the position of the origin or changes in the unit of measurement of either variable. As a conclusion, Richard's first proposal is fine as a response to your question. You may recode the system missing scores the way he proposes and estimate your two-variable equation as he suggests (just check for collinearity, just in case). Now, why and whether you should want to have the equation in that way is a completely different problem. Is there any empirical or theoretical reason to expect that taking exam 1 would in itself influence the score of exam 2? Is it absolutely necessary that the effect of having taken exam 1 is controlled by the score obtained at exam 1? Why the two questions have to be tackled together? You may alternatively choose to split the problem in two: (a) test the effect of having or not having taken exam 1, using the whole complement of 100 cases with two variables: tookexam1 and scoreexam2; (b) test the effect of score 1 over score 2, using only the 50 cases who took both exams with two variables: scoreexam1 and scorexam2. There is no information loss when you split the problem, as there is no possible variation in the correlation (or lack thereof) between taking the first exam and having a score from it, and no possible effect of having (or not having) taken exam 1 on the score obtained at the same exam 1, except the obvious fact that only those taking the exam get a score. Hector -----Mensaje original----- De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Richard Ristow Enviado el: 19 January 2007 19:55 Para: [hidden email] Asunto: Re: "Non-exist" data in regression analysis At 11:25 PM 1/18/2007, Johnson Lau wrote: >There are 100 students taking 2 examinations. 50 students took exam 1 >and all students took exam 2). I would like to predict the score of >exam2 (an interval scale) with score of exam1 and whether a student >took exam 1. Some example is shown as follow: > >variable name: tookexam1 scoreexam1 scoreexam2 passexam2 >case 1 1 50 60 1 >case 2 0 . 70 1 >case 3 0 . 40 0 >case 4 1 40 40 0 >..... > >Therefore, i'd like to predict scoreexam2 with tookexam1 and >scoreexam1, where tookexam1 is valid for all cases (either 0 or 1), >scoreexam1 is valid if and only if tookexam1=1. However, no matter I >run a multiple regression or logistic regression, cases with missing >vale in scoreexam1 will be excluded. Here's something to try (methodologists on the list, please comment): Try a model in which simply taking exam 1 contributes a certain amount the exam 2 score; every point scored on exam 2 scores a certain amount more. Then you get something like this <WRR: not kept separately>: TEMPORARY. RECODE ScoreExam1 (MISSING = 0). REGRESSION /DESCRIPTIVES = MEAN STDDEV CORR /DEPENDENT ScoreExam2 /METHOD =ENTER TookExam1 ScoreExam1. The trouble with that is, TookExam1 and ScoreExam1 will then have a ridiculously high correlation. A solution to that is to take a 'normal' Exam2 score as somewhere in the midrange, say 50. Like this: TEMPORARY. COMPUTE ScoreExam1 = ScoreExam1 - 50. RECODE ScoreExam1 (MISSING = 0). REGRESSION /DESCRIPTIVES = MEAN STDDEV CORR /DEPENDENT ScoreExam2 /METHOD =ENTER TookExam1 ScoreExam1. I ran both of these, but they don't show anything of interest. With four cases and two independent variables, they hardly could, though. It's not legitimate to run the regression at all, with that little data. (At least, the syntax is checked.) But you might want to try something similar, with your real dataset. =================== APPENDIX: Test data =================== NEW FILE. DATA LIST LIST /CaseWord (A4) CaseNum TookExam1 ScoreExam1 ScoreExam2 PassExam2 (5F2). * variable name: tookexam1 scoreexam1 scoreexam2 passexam2. BEGIN DATA case 1 1 50 60 1 case 2 0 . 70 1 case 3 0 . 40 0 case 4 1 40 40 0 END DATA. LIST. > In addition to these, I'd also like to run ANCOVA with more > predictors with similar data structure. >I know that pairwise exclusion is allowed in multiple linear >regression but I'm not sure if this is a correct choice. Nonetheless, >pairwise option is not found in logistic regression and UNIANOVA >commands. > >Would anyone kindly suggest solution for this? Thanks a lot. > >Regards, > >Johnson Lau > > > >-- >No virus found in this incoming message. >Checked by AVG Free Edition. >Version: 7.5.432 / Virus Database: 268.16.12/631 - Release Date: >1/16/2007 8:25 AM |
A comment on one point. I had to think, to remember why this one works.
At 06:48 PM 1/19/2007, Hector Maletta wrote: [...] > 4. The equation for Richard's first proposal would be Score > Exam 2 = b0 + b1[tookexam1] + b2[score exam 1]. Assigning a zero > value for Score Exam 1 to those students not taking Exam 1 would > serve this purpose, because those students would no longer be > excluded from the analysis. By the way, make sure the 0 is not a user > missing value for either variable. > 5. The two resulting independent variables, tookexam1 and > scoreexam1, would certainly be correlated, because all those with > tookexam1=0 will have scoreexam1=0 as well, but I do not think the > LINEAR correlation would be as high as to cause collinearity and > singularity, thus impeding regression. They certainly couldn't be completely collinear, because 'tookexam1' is dichotomous and 'scoreexam1' is multi-valued. But if the variance of 'scoreexam1', among those who took exam 1, is small compared to its mean, they could be correlated enough that it would be hard to tell how much each was influential in a regression. This could be reflected in very weak t-statistics for both variables, in the face of a good R**2 and F-statistics for the regression as a whole. > 6. Changing the origin of scorexam1 would not modify this: > the degree (as opposed to the sign) of linear correlation between two > variables is invariant to changes in the position of the origin or > changes in the unit of measurement of either variable. This is the one I had to think twice about. Hector's absolutely right about what he says. But the change I suggested, from RECODE ScoreExam1 (MISSING = 0). to COMPUTE ScoreExam1 = ScoreExam1 - 50. RECODE ScoreExam1 (MISSING = 0). is not just a shift in origin; it changes how the values of ScoreExam1 are distributed. With the COMPUTE, students who didn't take exam 1 are assigned a value near the middle of the range of those who did. Without it, those students are assigned value 0, which may well be an outlier. The former can give a much lower correlation with TookExam1, the indicator variable separating the two groups. An alternative, with the same effect as the above, is to skip the COMPUTE and instead use RECODE ScoreExam1 (MISSING = 50). Finally, all these three formulations give the same final regression estimate, by which I mean the same set of predicted values. I won't prove that, here. |
In reply to this post by Hector Maletta
Thanks for your comments.
The example is just for illustration since my real question is more complicated. I'm analysing some election data. Two general elections were held at 2000 and 2005. For all candidates in 2005 I got their percentage of votes, and they could be divided into 4 groups: 1. Not participated the 2000 election; 2. Participated 2000 election but lost; 3. Participated 2000 election and won; 4. Participated 2000 election and won due to no competitor ("duly-elected"). For group 2 and 3, I also got their percentage of votes in 2000. For group 1, vote percentage for 2000 is not existing since they did not participated in that election. For group 4, vote percentage for 2000 is not existing since there was no competitor and voters didn't have to vote. The data structure would be like this: variable name: result2000 percent2000 result2005 percent2005 case 1 1 . 2 .37 case 2 2 .39 2 .39 case 3 3 .58 3 .69 case 4 4 . 3 .70 case 5 3 .70 4 . case 6 2 .18 2 .21 case 7 1 . 3 .66 case 8 4 . 4 . ........... for result2000 and result2005, 1=not participated, 2=lost, 3=won, 4=duly-elected. "percent" is missing if and only if "result"=1 or 4. I'd like to predict percent2005 with result2000 and percent2000. Dummy variables will be assigned for result2000. Similarly, logistic regression will be used to predict result2005 with result2000 and percent2000. I'd like to include both percent2000 and result2000 in the same model since I expect some interaction between them (therefore I might include an interaction term for them). Also, I'd like to include more independent variable such as gender, age, political affiliation into a single model. I might have to drop percent2000 if I can't figure out a solution, despite that the r between percent2000 and percent2004 is larger than 0.5 For Richard's suggestion, does it mean I can simpily substitute any single number for all missing values? Thanks again! 2007/1/20, Hector Maletta <[hidden email]>: > Some comments on the question and on Richard's response: > 1. Of course SPSS excludes cases listwise from a regression analysis > involving a list of variables. Any cases with a missing value in at least > one variable would be excluded. > 2. For clarity's sake I suggest Richard mistyped "2" for "1" the > second time he mentions an exam in his sentence: " Try a model in which > simply taking exam 1 contributes a certain amount [to] the exam 2 score; > every point scored on exam 2 scores a certain amount more." It should read: > "Try a model in which simply taking exam 1 contributes a certain amount [to] > the exam 2 score; every point scored on exam 1 scores a certain amount > more." > 4. The equation for Richard's first proposal would be Score Exam 2 = > b0 + b1[tookexam1] + b2[score exam 1]. Assigning a zero value for Score Exam > 1 to those students not taking Exam 1 would serve this purpose, because > those students would no longer be excluded from the analysis. By the way, > make sure the 0 is not a user missing value for either variable. > 5. The two resulting independent variables, tookexam1 and > scoreexam1, would certainly be correlated, because all those with > tookexam1=0 will have scoreexam1=0 as well, but I do not think the LINEAR > correlation would be as high as to cause collinearity and singularity, thus > impeding regression. > 6. Changing the origin of scorexam1 would not modify this: the > degree (as opposed to the sign) of linear correlation between two variables > is invariant to changes in the position of the origin or changes in the unit > of measurement of either variable. > As a conclusion, Richard's first proposal is fine as a response to > your question. You may recode the system missing scores the way he proposes > and estimate your two-variable equation as he suggests (just check for > collinearity, just in case). > Now, why and whether you should want to have the equation in that > way is a completely different problem. Is there any empirical or theoretical > reason to expect that taking exam 1 would in itself influence the score of > exam 2? Is it absolutely necessary that the effect of having taken exam 1 is > controlled by the score obtained at exam 1? Why the two questions have to be > tackled together? > You may alternatively choose to split the problem in two: (a) test > the effect of having or not having taken exam 1, using the whole complement > of 100 cases with two variables: tookexam1 and scoreexam2; (b) test the > effect of score 1 over score 2, using only the 50 cases who took both exams > with two variables: scoreexam1 and scorexam2. There is no information loss > when you split the problem, as there is no possible variation in the > correlation (or lack thereof) between taking the first exam and having a > score from it, and no possible effect of having (or not having) taken exam 1 > on the score obtained at the same exam 1, except the obvious fact that only > those taking the exam get a score. > > Hector > > > -----Mensaje original----- > De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de > Richard Ristow > Enviado el: 19 January 2007 19:55 > Para: [hidden email] > Asunto: Re: "Non-exist" data in regression analysis > > At 11:25 PM 1/18/2007, Johnson Lau wrote: > > >There are 100 students taking 2 examinations. 50 students took exam > 1 > >and all students took exam 2). I would like to predict the score of > >exam2 (an interval scale) with score of exam1 and whether a student > >took exam 1. Some example is shown as follow: > > > >variable name: tookexam1 scoreexam1 scoreexam2 passexam2 > >case 1 1 50 60 1 > >case 2 0 . 70 1 > >case 3 0 . 40 0 > >case 4 1 40 40 0 > >..... > > > >Therefore, i'd like to predict scoreexam2 with tookexam1 and > >scoreexam1, where tookexam1 is valid for all cases (either 0 or 1), > >scoreexam1 is valid if and only if tookexam1=1. However, no matter > I > >run a multiple regression or logistic regression, cases with > missing > >vale in scoreexam1 will be excluded. > > Here's something to try (methodologists on the list, please > comment): > Try a model in which simply taking exam 1 contributes a certain > amount > the exam 2 score; every point scored on exam 2 scores a certain > amount > more. Then you get something like this <WRR: not kept separately>: > > > TEMPORARY. > RECODE ScoreExam1 (MISSING = 0). > > REGRESSION > /DESCRIPTIVES = MEAN STDDEV CORR > /DEPENDENT ScoreExam2 > /METHOD =ENTER TookExam1 ScoreExam1. > > The trouble with that is, TookExam1 and ScoreExam1 will then have a > ridiculously high correlation. A solution to that is to take a > 'normal' > Exam2 score as somewhere in the midrange, say 50. Like this: > > TEMPORARY. > COMPUTE ScoreExam1 = ScoreExam1 - 50. > RECODE ScoreExam1 (MISSING = 0). > > REGRESSION > /DESCRIPTIVES = MEAN STDDEV CORR > /DEPENDENT ScoreExam2 > /METHOD =ENTER TookExam1 ScoreExam1. > > I ran both of these, but they don't show anything of interest. With > four cases and two independent variables, they hardly could, though. > It's not legitimate to run the regression at all, with that little > data. (At least, the syntax is checked.) But you might want to try > something similar, with your real dataset. > > > =================== > APPENDIX: Test data > =================== > NEW FILE. > DATA LIST LIST > /CaseWord (A4) > CaseNum TookExam1 ScoreExam1 ScoreExam2 PassExam2 (5F2). > > * variable name: tookexam1 scoreexam1 scoreexam2 passexam2. > BEGIN DATA > case 1 1 50 60 1 > case 2 0 . 70 1 > case 3 0 . 40 0 > case 4 1 40 40 0 > END DATA. > LIST. > > > > > > In addition to these, I'd also like to run ANCOVA with more > > predictors with similar data structure. > >I know that pairwise exclusion is allowed in multiple linear > >regression but I'm not sure if this is a correct choice. > Nonetheless, > >pairwise option is not found in logistic regression and UNIANOVA > >commands. > > > >Would anyone kindly suggest solution for this? Thanks a lot. > > > >Regards, > > > >Johnson Lau > > > > > > > >-- > >No virus found in this incoming message. > >Checked by AVG Free Edition. > >Version: 7.5.432 / Virus Database: 268.16.12/631 - Release Date: > >1/16/2007 8:25 AM > |
J.Lau wrote:
"For Richard's suggestion, does it mean I can simply substitute any single number for all missing values?" I think Richard is not suggesting exactly that. Either you substitute zero for the missing values, or you rescale the score by adding (for example) 50 and then substitute 50 for the missing values. Hector -----Mensaje original----- De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Johnson Lau Enviado el: 20 January 2007 14:15 Para: [hidden email] Asunto: Re: "Non-exist" data in regression analysis Thanks for your comments. The example is just for illustration since my real question is more complicated. I'm analysing some election data. Two general elections were held at 2000 and 2005. For all candidates in 2005 I got their percentage of votes, and they could be divided into 4 groups: 1. Not participated the 2000 election; 2. Participated 2000 election but lost; 3. Participated 2000 election and won; 4. Participated 2000 election and won due to no competitor ("duly-elected"). For group 2 and 3, I also got their percentage of votes in 2000. For group 1, vote percentage for 2000 is not existing since they did not participated in that election. For group 4, vote percentage for 2000 is not existing since there was no competitor and voters didn't have to vote. The data structure would be like this: variable name: result2000 percent2000 result2005 percent2005 case 1 1 . 2 .37 case 2 2 .39 2 .39 case 3 3 .58 3 .69 case 4 4 . 3 .70 case 5 3 .70 4 . case 6 2 .18 2 .21 case 7 1 . 3 .66 case 8 4 . 4 . ........... for result2000 and result2005, 1=not participated, 2=lost, 3=won, 4=duly-elected. "percent" is missing if and only if "result"=1 or 4. I'd like to predict percent2005 with result2000 and percent2000. Dummy variables will be assigned for result2000. Similarly, logistic regression will be used to predict result2005 with result2000 and percent2000. I'd like to include both percent2000 and result2000 in the same model since I expect some interaction between them (therefore I might include an interaction term for them). Also, I'd like to include more independent variable such as gender, age, political affiliation into a single model. I might have to drop percent2000 if I can't figure out a solution, despite that the r between percent2000 and percent2004 is larger than 0.5 For Richard's suggestion, does it mean I can simpily substitute any single number for all missing values? Thanks again! 2007/1/20, Hector Maletta <[hidden email]>: > Some comments on the question and on Richard's response: > 1. Of course SPSS excludes cases listwise from a regression analysis > involving a list of variables. Any cases with a missing value in at least > one variable would be excluded. > 2. For clarity's sake I suggest Richard mistyped "2" for "1" the > second time he mentions an exam in his sentence: " Try a model in which > simply taking exam 1 contributes a certain amount [to] the exam 2 score; > every point scored on exam 2 scores a certain amount more." It should read: > "Try a model in which simply taking exam 1 contributes a certain amount [to] > the exam 2 score; every point scored on exam 1 scores a certain amount > more." > 4. The equation for Richard's first proposal would be Score Exam 2 = > b0 + b1[tookexam1] + b2[score exam 1]. Assigning a zero value for Score Exam > 1 to those students not taking Exam 1 would serve this purpose, because > those students would no longer be excluded from the analysis. By the way, > make sure the 0 is not a user missing value for either variable. > 5. The two resulting independent variables, tookexam1 and > scoreexam1, would certainly be correlated, because all those with > tookexam1=0 will have scoreexam1=0 as well, but I do not think the LINEAR > correlation would be as high as to cause collinearity and singularity, thus > impeding regression. > 6. Changing the origin of scorexam1 would not modify this: the > degree (as opposed to the sign) of linear correlation between two variables > is invariant to changes in the position of the origin or changes in the unit > of measurement of either variable. > As a conclusion, Richard's first proposal is fine as a response to > your question. You may recode the system missing scores the way he proposes > and estimate your two-variable equation as he suggests (just check for > collinearity, just in case). > Now, why and whether you should want to have the equation in that > way is a completely different problem. Is there any empirical or theoretical > reason to expect that taking exam 1 would in itself influence the score of > exam 2? Is it absolutely necessary that the effect of having taken exam 1 is > controlled by the score obtained at exam 1? Why the two questions have to be > tackled together? > You may alternatively choose to split the problem in two: (a) test > the effect of having or not having taken exam 1, using the whole complement > of 100 cases with two variables: tookexam1 and scoreexam2; (b) test the > effect of score 1 over score 2, using only the 50 cases who took both exams > with two variables: scoreexam1 and scorexam2. There is no information loss > when you split the problem, as there is no possible variation in the > correlation (or lack thereof) between taking the first exam and having a > score from it, and no possible effect of having (or not having) taken exam 1 > on the score obtained at the same exam 1, except the obvious fact that only > those taking the exam get a score. > > Hector > > > -----Mensaje original----- > De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de > Richard Ristow > Enviado el: 19 January 2007 19:55 > Para: [hidden email] > Asunto: Re: "Non-exist" data in regression analysis > > At 11:25 PM 1/18/2007, Johnson Lau wrote: > > >There are 100 students taking 2 examinations. 50 students took exam > 1 > >and all students took exam 2). I would like to predict the score of > >exam2 (an interval scale) with score of exam1 and whether a student > >took exam 1. Some example is shown as follow: > > > >variable name: tookexam1 scoreexam1 scoreexam2 passexam2 > >case 1 1 50 60 1 > >case 2 0 . 70 1 > >case 3 0 . 40 0 > >case 4 1 40 40 0 > >..... > > > >Therefore, i'd like to predict scoreexam2 with tookexam1 and > >scoreexam1, where tookexam1 is valid for all cases (either 0 or 1), > >scoreexam1 is valid if and only if tookexam1=1. However, no matter > I > >run a multiple regression or logistic regression, cases with > missing > >vale in scoreexam1 will be excluded. > > Here's something to try (methodologists on the list, please > comment): > Try a model in which simply taking exam 1 contributes a certain > amount > the exam 2 score; every point scored on exam 2 scores a certain > amount > more. Then you get something like this <WRR: not kept separately>: > > > TEMPORARY. > RECODE ScoreExam1 (MISSING = 0). > > REGRESSION > /DESCRIPTIVES = MEAN STDDEV CORR > /DEPENDENT ScoreExam2 > /METHOD =ENTER TookExam1 ScoreExam1. > > The trouble with that is, TookExam1 and ScoreExam1 will then have a > ridiculously high correlation. A solution to that is to take a > 'normal' > Exam2 score as somewhere in the midrange, say 50. Like this: > > TEMPORARY. > COMPUTE ScoreExam1 = ScoreExam1 - 50. > RECODE ScoreExam1 (MISSING = 0). > > REGRESSION > /DESCRIPTIVES = MEAN STDDEV CORR > /DEPENDENT ScoreExam2 > /METHOD =ENTER TookExam1 ScoreExam1. > > I ran both of these, but they don't show anything of interest. With > four cases and two independent variables, they hardly could, though. > It's not legitimate to run the regression at all, with that little > data. (At least, the syntax is checked.) But you might want to try > something similar, with your real dataset. > > > =================== > APPENDIX: Test data > =================== > NEW FILE. > DATA LIST LIST > /CaseWord (A4) > CaseNum TookExam1 ScoreExam1 ScoreExam2 PassExam2 (5F2). > > * variable name: tookexam1 scoreexam1 scoreexam2 passexam2. > BEGIN DATA > case 1 1 50 60 1 > case 2 0 . 70 1 > case 3 0 . 40 0 > case 4 1 40 40 0 > END DATA. > LIST. > > > > > > In addition to these, I'd also like to run ANCOVA with more > > predictors with similar data structure. > >I know that pairwise exclusion is allowed in multiple linear > >regression but I'm not sure if this is a correct choice. > Nonetheless, > >pairwise option is not found in logistic regression and UNIANOVA > >commands. > > > >Would anyone kindly suggest solution for this? Thanks a lot. > > > >Regards, > > > >Johnson Lau > > > > > > > >-- > >No virus found in this incoming message. > >Checked by AVG Free Edition. > >Version: 7.5.432 / Virus Database: 268.16.12/631 - Release Date: > >1/16/2007 8:25 AM > |
In reply to this post by Johnson Lau
At 12:15 PM 1/20/2007, Johnson Lau wrote:
>For Richard's suggestion, does it mean I can simpily substitute any >single number for all missing values? As a matter of mathematics it means exactly that; you substitute a number, but it doesn't matter what one. That is true, however, ONLY IF your model also includes the indicator ('dummy') variable that distinguishes cases that initially had missing values, from cases that have valid ones. (See Appendix) As a matter of estimation, it's better if the indicator variable isn't too highly correlated with the score variable with a number substituted for the missing values aren't too highly correlated. (Not being 'too highly correlated' is desirable for all independent variables in a regression, of course.) That's why I suggest that the single number that's filled in for missing values, isn't too far from the mean score of the cases where the score isn't missing. Finally, this applies only in the 'missing' group - in your original parlance, those who didn't take exam 1. If anyone took exam 1 but you don't have the score, either leave the score missing, or do careful missing-value imputation. (What I'm proposing for you is *not* missing-value imputation. It does not, that is, try to estimate a 'true value' for those who didn't take exam 1. It simply structures the model so it's both reasonable and estimable.) ======== APPENDIX: "As a matter of mathematics, it doesn't matter what value you substitute." Warning! Linear algebra alert! If you have a known hypersensitivity to linear algebra, or linear algebra is incompatible with your metabolism, DO NOT PROCEED! ........ If you have n observations, then the observed values of the dependent variable constitute a set of n numbers; this may be taken as a point, or vector, in an n-dimensional vector space. Further, each independent variable, whose observed values also form a set of n numbers, may be taken as a point or vector in the same vector space. (This counts the 'constant' as one of the variables. That's mathematically correct, although the constant is usually considered separately for estimation and statistical inference.) Suppose there are k independent variables. Treating them as vectors, and following the convention of using capital letters to denote vectors, the regression model is Y'=SUM(i=1,k)(a(i)*X(i)) where the a(i) are scalars - the 'regression coefficients' we all know and love. That is, the set of possible estimators Y' is a linear subspace of the vector space, the linear subspace spanned by vectors (independent variables) X(1) to X(k). The regression problem, using the least-squares criterion as we do, is then choosing the a(i) to minimize |Y-Y'|, where Y is the dependent variable. I say that two formulations of the regression problem are mathematically equivalent if they have the same set of possible estimators Y'; that is, the independent variables in the two formulations span the same vector subspace. Consider two models: Y'=a*S+b*I and Y'=a'*S'+b'*I. In these equations, . I is the indicator variable for the set of participants who DO NOT have a valid score . S is the score, with value s substituted for participants who don't have scores; S' is the score, with a different value s' substituted. Then, S-S'=(s-s')*I; S=S'+(s-s')*I If an estimator Y0 is attainable with the first model, so that one can write, Y0=a*S+b*I then Y0=a*(S'+(s-s')*I) + b*I Y0=a*S' + a*(s-s')*I + b*I Y0=a*S' +(a*(s-s')+b)*I and Y0 is attainable with the second model. |
Thanks a lot for your great suggestion. It gives good prediction for
my data. As you said, no matter what number is substituted for missing value, the predicted value is just the same. However, with a different number being substituted, the standardized betas change, and the p-value for the dummy variable changes as well. How could I interpret the effect size in this case? Thanks a lot! Regards, Johnson Lau 2007/1/21, Richard Ristow <[hidden email]>: > At 12:15 PM 1/20/2007, Johnson Lau wrote: > > >For Richard's suggestion, does it mean I can simpily substitute any > >single number for all missing values? > > As a matter of mathematics it means exactly that; you substitute a > number, but it doesn't matter what one. That is true, however, ONLY IF > your model also includes the indicator ('dummy') variable that > distinguishes cases that initially had missing values, from cases that > have valid ones. (See Appendix) > > As a matter of estimation, it's better if the indicator variable isn't > too highly correlated with the score variable with a number substituted > for the missing values aren't too highly correlated. (Not being 'too > highly correlated' is desirable for all independent variables in a > regression, of course.) That's why I suggest that the single number > that's filled in for missing values, isn't too far from the mean score > of the cases where the score isn't missing. > > Finally, this applies only in the 'missing' group - in your original > parlance, those who didn't take exam 1. If anyone took exam 1 but you > don't have the score, either leave the score missing, or do careful > missing-value imputation. > > (What I'm proposing for you is *not* missing-value imputation. It does > not, that is, try to estimate a 'true value' for those who didn't take > exam 1. It simply structures the model so it's both reasonable and > estimable.) > > ======== > APPENDIX: "As a matter of mathematics, it doesn't matter what value you > substitute." > > Warning! Linear algebra alert! If you have a known hypersensitivity to > linear algebra, or linear algebra is incompatible with your metabolism, > DO NOT PROCEED! > ........ > If you have n observations, then the observed values of the dependent > variable constitute a set of n numbers; this may be taken as a point, > or vector, in an n-dimensional vector space. > > Further, each independent variable, whose observed values also form a > set of n numbers, may be taken as a point or vector in the same vector > space. (This counts the 'constant' as one of the variables. That's > mathematically correct, although the constant is usually considered > separately for estimation and statistical inference.) > > Suppose there are k independent variables. Treating them as vectors, > and following the convention of using capital letters to denote > vectors, the regression model is > > Y'=SUM(i=1,k)(a(i)*X(i)) > > where the a(i) are scalars - the 'regression coefficients' we all know > and love. > > That is, the set of possible estimators Y' is a linear subspace of the > vector space, the linear subspace spanned by vectors (independent > variables) X(1) to X(k). > > The regression problem, using the least-squares criterion as we do, is > then choosing the a(i) to minimize |Y-Y'|, where Y is the dependent > variable. > > I say that two formulations of the regression problem are > mathematically equivalent if they have the same set of possible > estimators Y'; that is, the independent variables in the two > formulations span the same vector subspace. > > Consider two models: > > Y'=a*S+b*I > and > Y'=a'*S'+b'*I. > > In these equations, > . I is the indicator variable for the set of participants who DO NOT > have a valid score > . S is the score, with value s substituted for participants who don't > have scores; S' is the score, with a different value s' substituted. > > Then, S-S'=(s-s')*I; S=S'+(s-s')*I > > If an estimator Y0 is attainable with the first model, so that one can > write, > > Y0=a*S+b*I > > then > > Y0=a*(S'+(s-s')*I) + b*I > Y0=a*S' + a*(s-s')*I + b*I > Y0=a*S' +(a*(s-s')+b)*I > > and Y0 is attainable with the second model. > -- Johnson Lau Research Assistant School of Public Health The Chinese University of Hong Kong Tel: (852) 2252 8705 Fax: (852) 2145 8517 |
At 03:30 AM 1/21/2007, Johnson Lau wrote:
>Thanks a lot for your great suggestion. It gives good prediction for >my data. As you said, no matter what number is substituted for missing >value, the predicted value is just the same. Excellent. I'm glad it's useful! >However, with a different number being substituted, the standardized >betas change, and the p-value for the dummy variable changes as well. >How could I interpret the effect size in this case? Thanks a lot! Short answer: choose the substituted number to minimize the correlation between the score (with number substituted) and the dummy variable. I'm not doing the arithmetic, but I think substituting the mean of the valid scores will give zero correlation. As I've written, my choice would be a round number reasonably near the mean. This isn't fudging. It's making a legitimate choice in the representation of your data, for greater precision in the estimate. ....................................... Now, the detailed (long-winded?) answer: >As you said, no matter what number is substituted for missing value, >the predicted value is just the same. Yes, that's right: That is - follow the proof I sent, if you want - you will get the same predictor with any value of the substituted number. >[But] the standardized betas change Again,yes; you can see from the same proof, that changing the substituted number will change the coefficients, unstandardized and standardized. (The proof I gave is in terms of the unstandardized coefficients, the more direct approach.) >the p-value for the dummy variable changes as well That also is to be expected. I'd be surprised if the p-value associated with the 'score' variable doesn't also change. Broadly speaking, correlation between independent variables tends to suppress statistical significance for the effects of both variables. Think like this: The significance test is a measure of the evidence that the variable has an effect on the dependent variable. Normally, you think of testing whether the effect is there, or whether it isn't. But with correlated variables, there's not just the question whether an effect is present; there's the question, which variable is (more) responsible for it. Your test may fail of significance, not because the procedure can't tell 'whether', but because it can't tell 'which'. A common clue is that the F-test for inclusion of the variables as a group is strongly significant, while the t-tests for the individual variables, with the other present, are not. (This effect exists with any correlation between the variables, but it's usually a problem only with higher correlations - what would people estimate? I'd say, around 0.5 or higher.) Conclusion, as above: choose your representation (your substituted number) to keep the correlation of the two variables low. Trust the results you get. -With best wishes, and good luck to you, Richard |
Free forum by Nabble | Edit this page |