|
Hi, all,
I do not know how to deal with collinearity when building a multivariate linear regression (MLR) model. In my case, there are more than 70 independent variables (IVs). I first tried to use VIF to detect collinearity as follows: (1) build a MLR model ("Enter" method) to obtain the VIF values for IVs (2) find the IV with the largest VIF value (larger than 10) and then delete it (3) rebuild a MLR model with the remaining IVs (4) repeat step (2) and (3) until all the remaining IVs have a VIF value less than 10 After that, I use the forward or backward method to select a subset of the remaining IVs to build the MLR model. However, the final model has a very low R square (below 0.1). I also tried to use another method to build the MLR model. In this time, I use condition number to detect collinearity. The purpose is to obtain a MLR model that satisfies: (a) its R square is as large as possbile; and (b) its condition number (CN) is less than 30. I use the following method (I am not sure whether it is right) (1) build the MLR model with all the IVs (using forward or backward method) (2) examine the condition number (CN) of the MLR model: if CN < 30, then Ok, otherwise delete an IV from the MLR model and then goto (3) (3) rebuild a MLR model with the remaining IVs (4) repeat step (2) and (3) The problem is which criteria should be used to delete an IV in step (2) when CN > 30? I tried two method: (a) delete the IV with the largest VIF value; or (b) delete an IV randomly. I found that (b) may result in a better MLR model. My purpose is to obtain a MLR model that satisfies: (a) its R square is as large as possbile; and (b) its condition number (CN) is less than 30. How to do it? Thank you very much. Yuming Zhou ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
You will not find a clear procedure that tells you how to do it. Both
methods complement each other. However, I usually prefer the condition index criteria and if you output the variance proportion factors it will improve your selection of the variable having the largest impact on cleaning the regression function. I have not used SPSS for collinearity diagnostics, but I assume you can get both, the condition index and the variance proportion factor for each variable. Fermin Ornelas, Ph.D. Management Analyst III, AZ DES 1789 W. Jefferson Street Phoenix, AZ 85032 Tel: (602) 542-5639 E-mail: [hidden email] -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of zhou yuming Sent: Tuesday, October 30, 2007 7:14 PM To: [hidden email] Subject: How to deal with collinearity? Thanks Hi, all, I do not know how to deal with collinearity when building a multivariate linear regression (MLR) model. In my case, there are more than 70 independent variables (IVs). I first tried to use VIF to detect collinearity as follows: (1) build a MLR model ("Enter" method) to obtain the VIF values for IVs (2) find the IV with the largest VIF value (larger than 10) and then delete it (3) rebuild a MLR model with the remaining IVs (4) repeat step (2) and (3) until all the remaining IVs have a VIF value less than 10 After that, I use the forward or backward method to select a subset of the remaining IVs to build the MLR model. However, the final model has a very low R square (below 0.1). I also tried to use another method to build the MLR model. In this time, I use condition number to detect collinearity. The purpose is to obtain a MLR model that satisfies: (a) its R square is as large as possbile; and (b) its condition number (CN) is less than 30. I use the following method (I am not sure whether it is right) (1) build the MLR model with all the IVs (using forward or backward method) (2) examine the condition number (CN) of the MLR model: if CN < 30, then Ok, otherwise delete an IV from the MLR model and then goto (3) (3) rebuild a MLR model with the remaining IVs (4) repeat step (2) and (3) The problem is which criteria should be used to delete an IV in step (2) when CN > 30? I tried two method: (a) delete the IV with the largest VIF value; or (b) delete an IV randomly. I found that (b) may result in a better MLR model. My purpose is to obtain a MLR model that satisfies: (a) its R square is as large as possbile; and (b) its condition number (CN) is less than 30. How to do it? Thank you very much. Yuming Zhou ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR CONFIDENTIAL information and is intended only for the use of the specific individual(s) to whom it is addressed. It may contain information that is privileged and confidential under state and federal law. This information may be used or disclosed only in accordance with law, and you may be subject to penalties under law for improper use or further disclosure of the information in this e-mail and its attachments. If you have received this e-mail in error, please immediately notify the person named above by reply e-mail, and then delete the original e-mail. Thank you. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by zhou yuming
At 10:14 PM 10/30/2007, zhou yuming wrote:
>I do not know how to deal with collinearity when building a >multivariate linear regression (MLR) model. > >In my case, there are more than 70 independent variables (IVs). Your analysis sounds like it's in trouble, collinearity or not. First: that many independents eats sample size for breakfast. By the rule of ten observations per independent variable, you need at least 700; but with the complexity you've got, I'd want more like ten times that many - say, 10,000, to use round numbers. Second, you have a big multiple-comparison problem. You expect (in the technical sense) 3.5 coefficients significant at p<.05, supposing no association whatever between the dependents and the independent; and about a chance in three of at least one significant at p<.01, on the same assumption. One reason you need such a large sample size is to have some statistical power left after correcting for multiple comparisons. Third, what are you going to say about your results? It can be mighty hard, sometimes near impossible, to make a coherent discussion of the results of an estimation like that. Indeed, what does your description of your model look like? And fourth, do you know your data well to start with? I suppose the 70 variable group into subject areas. In such a case, collinearities within a subject area are common, collinearities between subject areas are rarer and generally deserve discussion. Anyway, I'd look to reduce the dimensionality drastically, by reducing each subject area to a summary variable or two. Depending on your problem and your tastes, that could be anything from selecting the most illuminating variables, through simple averaging, to factor analysis within the subject areas. And, the very best of luck to you, Richard > I first >tried to use VIF to detect collinearity as follows: >(1) build a MLR model ("Enter" method) to obtain the VIF values for >IVs >(2) find the IV with the largest VIF value (larger than 10) and then >delete >it >(3) rebuild a MLR model with the remaining IVs >(4) repeat step (2) and (3) until all the remaining IVs have a VIF >value >less than 10 >After that, I use the forward or backward method to select a subset of >the >remaining IVs to build the MLR model. However, the final model has a >very low R square (below 0.1). > >I also tried to use another method to build the MLR model. In this >time, I >use condition number to detect collinearity. The purpose is to obtain >a MLR >model that satisfies: (a) its R square is as large as possbile; and >(b) its condition number (CN) is less than 30. I use the following >method (I >am not sure whether it is right) >(1) build the MLR model with all the IVs (using forward or backward >method) >(2) examine the condition number (CN) of the MLR model: > if CN < 30, then Ok, otherwise delete an IV from the MLR > model and >then goto (3) >(3) rebuild a MLR model with the remaining IVs >(4) repeat step (2) and (3) > > >The problem is which criteria should be used to delete an IV in step >(2) >when CN > 30? I tried two method: (a) delete the IV with the largest >VIF >value; or (b) delete an IV randomly. I found that (b) may result in a >better >MLR model. > >My purpose is to obtain a MLR model that satisfies: (a) its R square >is as >large as possbile; and (b) its condition number (CN) is less than 30. >How to >do it? Thank you very much. > > >Yuming Zhou > >===================== >To manage your subscription to SPSSX-L, send a message to >[hidden email] (not to SPSSX-L), with no body text except >the >command. To leave the list, send the command >SIGNOFF SPSSX-L >For a list of commands to manage subscriptions, send the command >INFO REFCARD > > > >-- >No virus found in this incoming message. >Checked by AVG Free Edition. >Version: 7.5.503 / Virus Database: 269.15.12/1098 - Release Date: >10/29/2007 9:28 AM ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Just an observation on this reply. If you are referring to multiple
comparison tests for the means or the medians, this type of testing is not relevant here. Having said that, you observation regarding the sample size is valid. For someone to be looking at 70 predictor variables there should be a large number of observations. In my own experience, for this number of variables my development samples were 20,000 or more observations. Fermin Ornelas, Ph.D. Management Analyst III, AZ DES 1789 W. Jefferson Street Phoenix, AZ 85032 Tel: (602) 542-5639 E-mail: [hidden email] -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow Sent: Thursday, November 01, 2007 1:49 PM To: [hidden email] Subject: Re: How to deal with collinearity? Thanks At 10:14 PM 10/30/2007, zhou yuming wrote: >I do not know how to deal with collinearity when building a >multivariate linear regression (MLR) model. > >In my case, there are more than 70 independent variables (IVs). Your analysis sounds like it's in trouble, collinearity or not. First: that many independents eats sample size for breakfast. By the rule of ten observations per independent variable, you need at least 700; but with the complexity you've got, I'd want more like ten times that many - say, 10,000, to use round numbers. Second, you have a big multiple-comparison problem. You expect (in the technical sense) 3.5 coefficients significant at p<.05, supposing no association whatever between the dependents and the independent; and about a chance in three of at least one significant at p<.01, on the same assumption. One reason you need such a large sample size is to have some statistical power left after correcting for multiple comparisons. Third, what are you going to say about your results? It can be mighty hard, sometimes near impossible, to make a coherent discussion of the results of an estimation like that. Indeed, what does your description of your model look like? And fourth, do you know your data well to start with? I suppose the 70 variable group into subject areas. In such a case, collinearities within a subject area are common, collinearities between subject areas are rarer and generally deserve discussion. Anyway, I'd look to reduce the dimensionality drastically, by reducing each subject area to a summary variable or two. Depending on your problem and your tastes, that could be anything from selecting the most illuminating variables, through simple averaging, to factor analysis within the subject areas. And, the very best of luck to you, Richard > I first >tried to use VIF to detect collinearity as follows: >(1) build a MLR model ("Enter" method) to obtain the VIF values for >IVs >(2) find the IV with the largest VIF value (larger than 10) and then >delete >it >(3) rebuild a MLR model with the remaining IVs >(4) repeat step (2) and (3) until all the remaining IVs have a VIF >value >less than 10 >After that, I use the forward or backward method to select a subset of >the >remaining IVs to build the MLR model. However, the final model has a >very low R square (below 0.1). > >I also tried to use another method to build the MLR model. In this >time, I >use condition number to detect collinearity. The purpose is to obtain >a MLR >model that satisfies: (a) its R square is as large as possbile; and >(b) its condition number (CN) is less than 30. I use the following >method (I >am not sure whether it is right) >(1) build the MLR model with all the IVs (using forward or backward >method) >(2) examine the condition number (CN) of the MLR model: > if CN < 30, then Ok, otherwise delete an IV from the MLR > model and >then goto (3) >(3) rebuild a MLR model with the remaining IVs >(4) repeat step (2) and (3) > > >The problem is which criteria should be used to delete an IV in step >(2) >when CN > 30? I tried two method: (a) delete the IV with the largest >VIF >value; or (b) delete an IV randomly. I found that (b) may result in a >better >MLR model. > >My purpose is to obtain a MLR model that satisfies: (a) its R square >is as >large as possbile; and (b) its condition number (CN) is less than 30. >How to >do it? Thank you very much. > > >Yuming Zhou > >===================== >To manage your subscription to SPSSX-L, send a message to >[hidden email] (not to SPSSX-L), with no body text except >the >command. To leave the list, send the command >SIGNOFF SPSSX-L >For a list of commands to manage subscriptions, send the command >INFO REFCARD > > > >-- >No virus found in this incoming message. >Checked by AVG Free Edition. >Version: 7.5.503 / Virus Database: 269.15.12/1098 - Release Date: >10/29/2007 9:28 AM ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR CONFIDENTIAL information and is intended only for the use of the specific individual(s) to whom it is addressed. It may contain information that is privileged and confidential under state and federal law. This information may be used or disclosed only in accordance with law, and you may be subject to penalties under law for improper use or further disclosure of the information in this e-mail and its attachments. If you have received this e-mail in error, please immediately notify the person named above by reply e-mail, and then delete the original e-mail. Thank you. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
At 12:51 PM 11/2/2007, Ornelas, Fermin wrote:
>If you are referring to multiple comparison tests for the means or the >medians, this type of testing is not relevant here. In that narrow sense, it is not relevant. But in a broad sense, it is very relevant: running many significance tests of any kind, raises the risk of false 'significant' results. In this case, I'm thinking of the t-tests for the regression coefficients. Again, with 70 coefficients and no actual association at all, you expect 3.5 significant at p<.05, with about 1/3 chance of at least one significant at p<.01. I'm not an expert, and I gather that the BONFERRONI correction can be too conservative; but applying it, i.e. dividing the criterion p-value by 70, you should trust t-tests for which reported p<.05/70=0.0007. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Hi
I think you have to try to reduce your variable. Two way I've in mind: - factor analysis (PCA) in order to have a regression on factor and then by variable; I think that having the weight of the component and the score of the variable related to the component, you could be able to have something similar to the "importance" of the variable on the depenedent variable; - create a unique variable from the variable which are collinear, i.e.createa score deriving from this variables. May be this approach is not "orthodox", more "qualitative", it depends from your goal (to measure exactly or understand connections and relative weight ?) Bye Rita 2007/11/3, Richard Ristow <[hidden email]>: > > At 12:51 PM 11/2/2007, Ornelas, Fermin wrote: > > >If you are referring to multiple comparison tests for the means or the > >medians, this type of testing is not relevant here. > > In that narrow sense, it is not relevant. But in a broad sense, it is > very relevant: running many significance tests of any kind, raises the > risk of false 'significant' results. In this case, I'm thinking of the > t-tests for the regression coefficients. Again, with 70 coefficients > and no actual association at all, you expect 3.5 significant at p<.05, > with about 1/3 chance of at least one significant at p<.01. > > I'm not an expert, and I gather that the BONFERRONI correction can be > too conservative; but applying it, i.e. dividing the criterion p-value > by 70, you should trust t-tests for which reported p<.05/70=0.0007. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by zhou yuming
Hi, all,
Thanks for your reply. It seems that another regression technique, PLS, is be very suitable for this case (i.e. (1) collinearity, and (2) many IVs with relatively less data points). Best regards Yuming Zhou 2007/10/31, zhou yuming <[hidden email]>: > > Hi, all, > > I do not know how to deal with collinearity when building a multivariate > linear regression (MLR) model. > > In my case, there are more than 70 independent variables (IVs). I first > tried to use VIF to detect collinearity as follows: > (1) build a MLR model ("Enter" method) to obtain the VIF values for IVs > (2) find the IV with the largest VIF value (larger than 10) and then > delete it > (3) rebuild a MLR model with the remaining IVs > (4) repeat step (2) and (3) until all the remaining IVs have a VIF value > less than 10 > After that, I use the forward or backward method to select a subset of the > remaining IVs to build the MLR model. However, the final model has a > very low R square (below 0.1). > > I also tried to use another method to build the MLR model. In this time, I > use condition number to detect collinearity. The purpose is to obtain a MLR > model that satisfies: (a) its R square is as large as possbile; and > (b) its condition number (CN) is less than 30. I use the following method (I > am not sure whether it is right) > (1) build the MLR model with all the IVs (using forward or backward > method) > (2) examine the condition number (CN) of the MLR model: > if CN < 30, then Ok, otherwise delete an IV from the MLR model > and then goto (3) > (3) rebuild a MLR model with the remaining IVs > (4) repeat step (2) and (3) > > > The problem is which criteria should be used to delete an IV in step (2) > when CN > 30? I tried two method: (a) delete the IV with the largest VIF > value; or (b) delete an IV randomly. I found that (b) may result in a better > MLR model. > > My purpose is to obtain a MLR model that satisfies: (a) its R square is as > large as possbile; and (b) its condition number (CN) is less than 30. How to > do it? Thank you very much. > > > Yuming Zhou > > > > > > > > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Richard Ristow
I am not sure I follow your argument, but when this data problem occurs
the standard errors are inflated, expected signs are not consistent, and removing variables out the regression function causes significant changes in the values of the regression coefficient and their signs. In more severe cases of dependence among the predictors causes X'X not to be of full rank and you cannot get its inverse. Under this scenario most software packages will give you a warning regarding the validity of the parameter estimates with the parameter estimates missing for the perfectly collinear predictors. Regarding the Bonferroni's procedure this testing is used for multiple comparisons and one usually establishes a low alpha value say .20 that get adjusted by the number of multiple comparisons among the means or medians. For example for non parametric methods, if I am comparing 3 means then the adjustment for alpha becomes is .2/(3*2)= .033. This value is used to calculate the critical value. Fermin Ornelas, Ph.D. Management Analyst III, AZ DES 1789 W. Jefferson Street Phoenix, AZ 85032 Tel: (602) 542-5639 E-mail: [hidden email] -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow Sent: Friday, November 02, 2007 7:38 PM To: [hidden email] Subject: Re: How to deal with collinearity? Thanks At 12:51 PM 11/2/2007, Ornelas, Fermin wrote: >If you are referring to multiple comparison tests for the means or the >medians, this type of testing is not relevant here. In that narrow sense, it is not relevant. But in a broad sense, it is very relevant: running many significance tests of any kind, raises the risk of false 'significant' results. In this case, I'm thinking of the t-tests for the regression coefficients. Again, with 70 coefficients and no actual association at all, you expect 3.5 significant at p<.05, with about 1/3 chance of at least one significant at p<.01. I'm not an expert, and I gather that the BONFERRONI correction can be too conservative; but applying it, i.e. dividing the criterion p-value by 70, you should trust t-tests for which reported p<.05/70=0.0007. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR CONFIDENTIAL information and is intended only for the use of the specific individual(s) to whom it is addressed. It may contain information that is privileged and confidential under state and federal law. This information may be used or disclosed only in accordance with law, and you may be subject to penalties under law for improper use or further disclosure of the information in this e-mail and its attachments. If you have received this e-mail in error, please immediately notify the person named above by reply e-mail, and then delete the original e-mail. Thank you. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
| Free forum by Nabble | Edit this page |
