|
At my work I appear to have taken on the role of the one-eyed man in the
valley of the statistically blind. I keep getting asked to explain why centering helps with problems of collinearity in multiple regression. However, my maths isn't really up to it. I have done a few toy regressions with two IV's by hand and so can see what it does in the matrix algebra. But I still don't really understand why it helps. Is it something to do with the change in the determinant of the X'X matrix when you use centered variables? Sorry if this a bit unclear, I am on the very edge of my maths understanding. Thanks - Mike |
|
I've not heard of using centering to address problems of collinearity except
perhaps in the case of interaction effects added to models. Do you have a reference for this. I was taught that the principal problem with collinearity is that it produces really unstable results--you'll get widely different coefficients from different samples drawn from a population. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mike Ford Sent: Tuesday, September 18, 2007 8:49 AM To: [hidden email] Subject: Regression, centering and collinearity At my work I appear to have taken on the role of the one-eyed man in the valley of the statistically blind. I keep getting asked to explain why centering helps with problems of collinearity in multiple regression. However, my maths isn't really up to it. I have done a few toy regressions with two IV's by hand and so can see what it does in the matrix algebra. But I still don't really understand why it helps. Is it something to do with the change in the determinant of the X'X matrix when you use centered variables? Sorry if this a bit unclear, I am on the very edge of my maths understanding. Thanks - Mike |
|
In reply to this post by Mike Ford
Collinearity means that any independent variable is an exact (or
nearly) exact linear function of one or more other variables in the set of independent variables. If you simply change the point of origin, i.e. subtract the mean to make the mean equal zero, this cannot change that linear dependency in the least. For example if your units are cities and one of your variables (say, distance in kilometres from New York) had collinearity with other variables before centering, it will have collinearity after centering as well. The new variable will be "distance in km from New York minus average distance in km from New York", and will have the same linear correlations as before. You may also change kilometres into miles or light-years, and it would be the same. This is because centering is a linear transformation (adding or subtracting). If you do a nonlinear transformation instead (say, take the logarithm of the distance, or suchlike), the collinearity may disappear when you substitute the new variable, but you must have some solid theoretical reason to think that your dependent variable depends on the log distance and not on distance itself, otherwise your solution is just ad hoc accommodation to a quirkiness in your data. Hector -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mike Ford Sent: 18 September 2007 11:49 To: [hidden email] Subject: Regression, centering and collinearity At my work I appear to have taken on the role of the one-eyed man in the valley of the statistically blind. I keep getting asked to explain why centering helps with problems of collinearity in multiple regression. However, my maths isn't really up to it. I have done a few toy regressions with two IV's by hand and so can see what it does in the matrix algebra. But I still don't really understand why it helps. Is it something to do with the change in the determinant of the X'X matrix when you use centered variables? Sorry if this a bit unclear, I am on the very edge of my maths understanding. Thanks - Mike |
|
In reply to this post by Mike Ford
When you estimate a model with collinear variables the standard errors
are inflated and the parameter estimates become unstable, i.e. switch signs. The individual test statistics are usually non significant but the overall F-test in the ANOVA table may say otherwise. In addition your R-square may also be high. When the collinearity is severe, the X'X matrix is not of full rank due to the fact that the one or more of its columns are linearly dependent. That is, in your data set the contribution of two or more variables to the behavior of the dependent variable is redundant. If this collinear relation is severe most software packages will issue you a warning about the reliability of your estimates. I have not tried to center the data, but I assume that this is an effort to remove linear dependence in your variables. There is also ridge regression. In my own experience there is not much you can do to deal with this problem. If for some research reason you have to keep a variable in the model, then the objective is to find a set of variable that minimize the problem. There is also another issue to consider: the purpose of the model. If you intend to predict then the model will hold reasonably well. But for hypothesis testing when collineariy is severe all the bets are off for the reasons mentioned in my first paragraph. Fermin Ornelas -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mike Ford Sent: Tuesday, September 18, 2007 7:49 AM To: [hidden email] Subject: Regression, centering and collinearity At my work I appear to have taken on the role of the one-eyed man in the valley of the statistically blind. I keep getting asked to explain why centering helps with problems of collinearity in multiple regression. However, my maths isn't really up to it. I have done a few toy regressions with two IV's by hand and so can see what it does in the matrix algebra. But I still don't really understand why it helps. Is it something to do with the change in the determinant of the X'X matrix when you use centered variables? Sorry if this a bit unclear, I am on the very edge of my maths understanding. Thanks - Mike NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR CONFIDENTIAL information and is intended only for the use of the specific individual(s) to whom it is addressed. It may contain information that is privileged and confidential under state and federal law. This information may be used or disclosed only in accordance with law, and you may be subject to penalties under law for improper use or further disclosure of the information in this e-mail and its attachments. If you have received this e-mail in error, please immediately notify the person named above by reply e-mail, and then delete the original e-mail. Thank you. |
|
In reply to this post by ViAnn Beadle
ViAnn:
When you have PERFECT colinearity (exact linear relationship between independent variables) you do not get any results: the covariance matrix is singular, its determinant is zero, and there is no regression solution. When you have instead APPROXIMATE colinearity, the matrix is nearly singular (the determinant is nearly zero, but not exactly zero). In that case, the results are unstable: a slight change in the data may cause a large change in the results, i.e. in the regression coefficients. Of course, the first case is also unstable in the sense that another sample may not give you a perfect zero determinant, but that's not the point since a perfect zero means you have no regression results at all, neither stable nor unstable. Hector -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of ViAnn Beadle Sent: 18 September 2007 14:52 To: [hidden email] Subject: Re: Regression, centering and collinearity I've not heard of using centering to address problems of collinearity except perhaps in the case of interaction effects added to models. Do you have a reference for this. I was taught that the principal problem with collinearity is that it produces really unstable results--you'll get widely different coefficients from different samples drawn from a population. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mike Ford Sent: Tuesday, September 18, 2007 8:49 AM To: [hidden email] Subject: Regression, centering and collinearity At my work I appear to have taken on the role of the one-eyed man in the valley of the statistically blind. I keep getting asked to explain why centering helps with problems of collinearity in multiple regression. However, my maths isn't really up to it. I have done a few toy regressions with two IV's by hand and so can see what it does in the matrix algebra. But I still don't really understand why it helps. Is it something to do with the change in the determinant of the X'X matrix when you use centered variables? Sorry if this a bit unclear, I am on the very edge of my maths understanding. Thanks - Mike |
|
I don't think that the OP was discussing PERFECT collinearity here. If there
is, no point to using all the perfectly correlated variables into the equation since one is an exact proxy for the other. -----Original Message----- From: Hector Maletta [mailto:[hidden email]] Sent: Tuesday, September 18, 2007 1:05 PM To: 'ViAnn Beadle'; [hidden email] Subject: RE: Regression, centering and collinearity ViAnn: When you have PERFECT colinearity (exact linear relationship between independent variables) you do not get any results: the covariance matrix is singular, its determinant is zero, and there is no regression solution. When you have instead APPROXIMATE colinearity, the matrix is nearly singular (the determinant is nearly zero, but not exactly zero). In that case, the results are unstable: a slight change in the data may cause a large change in the results, i.e. in the regression coefficients. Of course, the first case is also unstable in the sense that another sample may not give you a perfect zero determinant, but that's not the point since a perfect zero means you have no regression results at all, neither stable nor unstable. Hector -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of ViAnn Beadle Sent: 18 September 2007 14:52 To: [hidden email] Subject: Re: Regression, centering and collinearity I've not heard of using centering to address problems of collinearity except perhaps in the case of interaction effects added to models. Do you have a reference for this. I was taught that the principal problem with collinearity is that it produces really unstable results--you'll get widely different coefficients from different samples drawn from a population. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mike Ford Sent: Tuesday, September 18, 2007 8:49 AM To: [hidden email] Subject: Regression, centering and collinearity At my work I appear to have taken on the role of the one-eyed man in the valley of the statistically blind. I keep getting asked to explain why centering helps with problems of collinearity in multiple regression. However, my maths isn't really up to it. I have done a few toy regressions with two IV's by hand and so can see what it does in the matrix algebra. But I still don't really understand why it helps. Is it something to do with the change in the determinant of the X'X matrix when you use centered variables? Sorry if this a bit unclear, I am on the very edge of my maths understanding. Thanks - Mike |
|
In reply to this post by Ornelas, Fermin
Hi all,
I think various contributors have made excellent points. I'm just going to chip in my 2-cents here. 1. Yes, centering does not change how the IVs are correlated with each other. 2. However, if you have higher order terms, then centering may help because it removes the problem of multicollinearity caused by the measurement scales of the component IVs. For example, X1 and X2 are two IVs with some correlation. Centering both X1 and X2 does not change how they are correlated. Graphically, there is no change. However, if you introduce an interaction item, e.g. X1*X2 or X1^2, then that term will most likely be correlated with X1 or X2. In that case, centering may help. The following reference may be useful: Aiken and West, 1991. Multiple Regression: Testing and Interpreting Interactions. Hope that helps. Best, dg > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of > Mike Ford > Sent: Tuesday, September 18, 2007 7:49 AM > To: [hidden email] > Subject: Regression, centering and collinearity > > At my work I appear to have taken on the role of the one-eyed man in the > valley of the statistically blind. > > I keep getting asked to explain why centering helps with problems of > collinearity in multiple regression. However, my maths isn't really up > to > it. > > I have done a few toy regressions with two IV's by hand and so can see > what > it does in the matrix algebra. But I still don't really understand why > it > helps. > > Is it something to do with the change in the determinant of the X'X > matrix > when you use centered variables? > > Sorry if this a bit unclear, I am on the very edge of my maths > understanding. > > Thanks > > - Mike > > NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR > CONFIDENTIAL information and is intended only for the use of the > specific > individual(s) to whom it is addressed. It may contain information that > is > privileged and confidential under state and federal law. This > information > may be used or disclosed only in accordance with law, and you may be > subject to penalties under law for improper use or further disclosure of > > the information in this e-mail and its attachments. If you have received > > this e-mail in error, please immediately notify the person named above > by > reply e-mail, and then delete the original e-mail. Thank you. > |
|
In reply to this post by Mike Ford
A simple way to show this is to use a simulation. I simulated the data
as follows. I let X1 be a normally distributed variable with mean 50 and standard deviation of 10 (I rounded to make it easier to work with). Then I squared X1 (call it X2, which is correlated .99 with x1) and created a new variable Y which was a function of X1 and X2 with an added normally distributed error term so that the overall R squared would be about .8. Now if you create X1c (centered) and X2c = X1c squared, you will find that X2c has a different correaltion with Y. More descriptively, if you do the regression of Y on X1 and X2 vs with X1c and X2c, you will see a decrease in the standard errors for two of the predictors even though the over all R squared is not changed. Inflated standard errors with non centered variables shows evidenc of multicollinearity. Paul R. Swank, Ph.D. Professor Director of Reseach Children's Learning Institute University of Texas Health Science Center-Houston -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mike Ford Sent: Tuesday, September 18, 2007 9:49 AM To: [hidden email] Subject: Regression, centering and collinearity At my work I appear to have taken on the role of the one-eyed man in the valley of the statistically blind. I keep getting asked to explain why centering helps with problems of collinearity in multiple regression. However, my maths isn't really up to it. I have done a few toy regressions with two IV's by hand and so can see what it does in the matrix algebra. But I still don't really understand why it helps. Is it something to do with the change in the determinant of the X'X matrix when you use centered variables? Sorry if this a bit unclear, I am on the very edge of my maths understanding. Thanks - Mike |
|
In reply to this post by David Gomulya
In addition, centering is often used when including a variable and its square as predictors. Centering in this circumstance will reduce their correlation. The regression coefficient for the quadratic term will not change, but the transformation will change the coefficient of the linear term. David Greenberg, Sociology Department, NYU
----- Original Message ----- From: David Gomulya <[hidden email]> Date: Tuesday, September 18, 2007 3:26 pm Subject: Re: Regression, centering and collinearity To: [hidden email] > Hi all, > > I think various contributors have made excellent points. I'm just > going to chip in my 2-cents here. > > 1. Yes, centering does not change how the IVs are correlated with each > other. > 2. However, if you have higher order terms, then centering may help > because it removes the problem of multicollinearity caused by the > measurement scales of the component IVs. > > For example, X1 and X2 are two IVs with some correlation. Centering > both X1 and X2 does not change how they are correlated. Graphically, > there is no change. However, if you introduce an interaction item, > e.g. X1*X2 or X1^2, then that term will most likely be correlated with > X1 or X2. In that case, centering may help. > > The following reference may be useful: > Aiken and West, 1991. Multiple Regression: Testing and Interpreting Interactions. > > Hope that helps. > Best, > dg > > > > -----Original Message----- > > From: SPSSX(r) Discussion [mailto:[hidden email]] On > Behalf Of > > Mike Ford > > Sent: Tuesday, September 18, 2007 7:49 AM > > To: [hidden email] > > Subject: Regression, centering and collinearity > > > > At my work I appear to have taken on the role of the one-eyed man in > the > > valley of the statistically blind. > > > > I keep getting asked to explain why centering helps with problems of > > collinearity in multiple regression. However, my maths isn't really > up > > to > > it. > > > > I have done a few toy regressions with two IV's by hand and so can see > > what > > it does in the matrix algebra. But I still don't really understand why > > it > > helps. > > > > Is it something to do with the change in the determinant of the X'X > > matrix > > when you use centered variables? > > > > Sorry if this a bit unclear, I am on the very edge of my maths > > understanding. > > > > Thanks > > > > - Mike > > > > NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR > > CONFIDENTIAL information and is intended only for the use of the > > specific > > individual(s) to whom it is addressed. It may contain information that > > is > > privileged and confidential under state and federal law. This > > information > > may be used or disclosed only in accordance with law, and you may be > > subject to penalties under law for improper use or further > disclosure of > > > > the information in this e-mail and its attachments. If you have received > > > > this e-mail in error, please immediately notify the person named above > > by > > reply e-mail, and then delete the original e-mail. Thank you. > > |
|
In reply to this post by Hector Maletta
With perfect collinearity there are in fact multiple solutions that include all the perfectly related predictors. The Regression procedure will not give such a solution, because it uses the covariance/correlation matrix to compute the solution. However, with CATREG, that uses the backfitting algorithm*, you will get a solution including all predictors when there is perfect collinearity, but such a solution is not unique: for example if predictors x5 and x6 have correlation 1, and you exclude x5 from the analysis and x6 has beta .66, CATREG can give a solution including both x5 and x6, resulting in beta x5 0.32 and beta x6 0.34, or x5 0.78 and x6 -0.12, or any other combination of beta's for x5 and x6 that sum to the beta for x5 or x6 when excluding the other. All these solutions are equivalent in terms of model fit. But for the purpose of parsimonious predictor selection the optimal solution would be x5 or x6 beta .66 and the other zero, wich is what you get with the Regression procedure, that will exlude one of x5 and x6. So, when applying CATREG for predictor selection, and a low tolerance warning is issued, but there are no zero beta's, it is advisable to use the option of saving the transformed variables and then run the Regression procedure on them to check if one or more predictors are exluded. On the other hand, for prediction purposes including all predictors could be desirable, but with perfect or high collinearity, regularized regression (Ridge, Lasso) is more appropiate. *With backfitting coefficients are found iteratively for one predictor at a time, removing the influence of the other predictors from the dependent variable when updating the coefficient for a particular predictor; thus the solution is computed from the data itself not from the covariance/correlation matrix. Regards, Anita van der Kooij Data Theory Group Leiden University ________________________________ From: SPSSX(r) Discussion on behalf of Hector Maletta Sent: Tue 18/09/2007 21:04 To: [hidden email] Subject: Re: Regression, centering and collinearity ViAnn: When you have PERFECT colinearity (exact linear relationship between independent variables) you do not get any results: the covariance matrix is singular, its determinant is zero, and there is no regression solution. When you have instead APPROXIMATE colinearity, the matrix is nearly singular (the determinant is nearly zero, but not exactly zero). In that case, the results are unstable: a slight change in the data may cause a large change in the results, i.e. in the regression coefficients. Of course, the first case is also unstable in the sense that another sample may not give you a perfect zero determinant, but that's not the point since a perfect zero means you have no regression results at all, neither stable nor unstable. Hector -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of ViAnn Beadle Sent: 18 September 2007 14:52 To: [hidden email] Subject: Re: Regression, centering and collinearity I've not heard of using centering to address problems of collinearity except perhaps in the case of interaction effects added to models. Do you have a reference for this. I was taught that the principal problem with collinearity is that it produces really unstable results--you'll get widely different coefficients from different samples drawn from a population. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mike Ford Sent: Tuesday, September 18, 2007 8:49 AM To: [hidden email] Subject: Regression, centering and collinearity At my work I appear to have taken on the role of the one-eyed man in the valley of the statistically blind. I keep getting asked to explain why centering helps with problems of collinearity in multiple regression. However, my maths isn't really up to it. I have done a few toy regressions with two IV's by hand and so can see what it does in the matrix algebra. But I still don't really understand why it helps. Is it something to do with the change in the determinant of the X'X matrix when you use centered variables? Sorry if this a bit unclear, I am on the very edge of my maths understanding. Thanks - Mike ********************************************************************** This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. ********************************************************************** |
|
In reply to this post by ViAnn Beadle
Greetings All.
What I would like to be able to do is draw multiple samples from an SPSS data file and save each sample to a new file. For drawing one sample, I have this syntax: DATASET COPY DASample1. DATASET ACTIVATE DASample1. FILTER OFF. USE ALL. SAMPLE 100 from 6493. DATASET ACTIVATE DataSet1. EXECUTE . How do I get SPSS to rename DASample1 to DASample1 ... DASamplen, so that I get n samples in separate files? Can this be done with DO IF? Thanks in advance for any advice. Mike |
|
In reply to this post by Mike Ford
Thank you for the responses. I should have made clear that I was not talking
about perfect collinearity. Having seen centering suggested as a method to reduce moderate collinearity in a number of sources, I have used it and thought it worked because of a change in the condition indices of models when I use centered variables. For example, with this makey-up toy data... y x1 x2 68 4.1 73.6 71 4.6 17.4 62 3.8 45.5 75 4.4 75.0 58 3.2 45.3 60 3.1 25.4 67 3.8 8.8 68 4.1 11.0 71 4.3 23.7 69 3.7 18.0 68 3.5 14.7 67 3.2 47.8 63 3.7 22.2 62 3.3 16.5 60 3.4 22.7 63 4 36.2 65 4.1 59.2 67 3.8 43.6 63 3.4 27.2 61 3.6 45.5 The condition index of the model = 21.7, which I would worry about if I got with real data. However if I center the DV and IVs and do the regression (with no constant) the condition index for the model of the centered data = 1.2. Obviously the beta, SE beta values etc. are the same for IVs in both cases. So, to my understanding the condition indices are saying that the model with the centered data is better than the model with the original data. Given the responses I have got to my posting, this change in the condition indices confuses me even more now. It must relate to a change in the cross products and sums of squares but I don't really understand. Thank you! |
|
In reply to this post by Kooij, A.J. van der
If you really have perfect collinearity and just want predictions or a parsimonious model (sort of), you can use Partial Least Squares, which works even with perfect collinearity - even with more variables than cases! Of course, you won't get significance tests for individual variables with that approach.
If you have SPSS 16, there is a dialog box and an enhanced PLS module that will be available from SPSS Developer Central (www.spss.com/devcentral) very shortly (maybe today). If you have SPSS 15, there is already a more basic PLS module that I wrote that handles only one dependent variable. You can download it from Developer Central now. These modules require some extra downloadable numerical libraries detailed in the documentation and illustrate the ease and power of developing statistical methods within SPSS via Python programmability. With SPSS 16, if you are an R person, you can also download from the R repository and use partial least squares modules in R within SPSS taking advantage of the R plug-in now available on Developer Central. You can find a simple example of PLS using the SPSS 15 module on Developer Central in my PowerPoint presentation on Developer Central called Programmability in SPSS 14, 15, and 16. Regards, Jon Peck -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Kooij, A.J. van der Sent: Tuesday, September 18, 2007 5:21 PM To: [hidden email] Subject: [SPSSX-L] Regression and collinearity: A note on CATREG With perfect collinearity there are in fact multiple solutions that include all the perfectly related predictors. The Regression procedure will not give such a solution, because it uses the covariance/correlation matrix to compute the solution. However, with CATREG, that uses the backfitting algorithm*, you will get a solution including all predictors when there is perfect collinearity, but such a solution is not unique: for example if predictors x5 and x6 have correlation 1, and you exclude x5 from the analysis and x6 has beta .66, CATREG can give a solution including both x5 and x6, resulting in beta x5 0.32 and beta x6 0.34, or x5 0.78 and x6 -0.12, or any other combination of beta's for x5 and x6 that sum to the beta for x5 or x6 when excluding the other. All these solutions are equivalent in terms of model fit. But for the purpose of parsimonious predictor selection the optimal solution would be x5 or x6 beta .66 and the other zero, wich is what you get with the Regress! ion procedure, that will exlude one of x5 and x6. So, when applying CATREG for predictor selection, and a low tolerance warning is issued, but there are no zero beta's, it is advisable to use the option of saving the transformed variables and then run the Regression procedure on them to check if one or more predictors are exluded. On the other hand, for prediction purposes including all predictors could be desirable, but with perfect or high collinearity, regularized regression (Ridge, Lasso) is more appropiate. *With backfitting coefficients are found iteratively for one predictor at a time, removing the influence of the other predictors from the dependent variable when updating the coefficient for a particular predictor; thus the solution is computed from the data itself not from the covariance/correlation matrix. Regards, Anita van der Kooij Data Theory Group Leiden University ________________________________ From: SPSSX(r) Discussion on behalf of Hector Maletta Sent: Tue 18/09/2007 21:04 To: [hidden email] Subject: Re: Regression, centering and collinearity ViAnn: When you have PERFECT colinearity (exact linear relationship between independent variables) you do not get any results: the covariance matrix is singular, its determinant is zero, and there is no regression solution. When you have instead APPROXIMATE colinearity, the matrix is nearly singular (the determinant is nearly zero, but not exactly zero). In that case, the results are unstable: a slight change in the data may cause a large change in the results, i.e. in the regression coefficients. Of course, the first case is also unstable in the sense that another sample may not give you a perfect zero determinant, but that's not the point since a perfect zero means you have no regression results at all, neither stable nor unstable. Hector -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of ViAnn Beadle Sent: 18 September 2007 14:52 To: [hidden email] Subject: Re: Regression, centering and collinearity I've not heard of using centering to address problems of collinearity except perhaps in the case of interaction effects added to models. Do you have a reference for this. I was taught that the principal problem with collinearity is that it produces really unstable results--you'll get widely different coefficients from different samples drawn from a population. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mike Ford Sent: Tuesday, September 18, 2007 8:49 AM To: [hidden email] Subject: Regression, centering and collinearity At my work I appear to have taken on the role of the one-eyed man in the valley of the statistically blind. I keep getting asked to explain why centering helps with problems of collinearity in multiple regression. However, my maths isn't really up to it. I have done a few toy regressions with two IV's by hand and so can see what it does in the matrix algebra. But I still don't really understand why it helps. Is it something to do with the change in the determinant of the X'X matrix when you use centered variables? Sorry if this a bit unclear, I am on the very edge of my maths understanding. Thanks - Mike ********************************************************************** This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. ********************************************************************** |
|
In reply to this post by Mike Ford
When you fit a model without an intercept the collinearity diagnostics
will give you a lower condition index. To me this is not a good idea since most often the final model will include an intercept in the regression function. I may reply to the second part later... but brushing up on linear algebra will help you understand how x'x and xy are being affected. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mike Ford Sent: Wednesday, September 19, 2007 2:03 AM To: [hidden email] Subject: Re: Regression, centering and collinearity Thank you for the responses. I should have made clear that I was not talking about perfect collinearity. Having seen centering suggested as a method to reduce moderate collinearity in a number of sources, I have used it and thought it worked because of a change in the condition indices of models when I use centered variables. For example, with this makey-up toy data... y x1 x2 68 4.1 73.6 71 4.6 17.4 62 3.8 45.5 75 4.4 75.0 58 3.2 45.3 60 3.1 25.4 67 3.8 8.8 68 4.1 11.0 71 4.3 23.7 69 3.7 18.0 68 3.5 14.7 67 3.2 47.8 63 3.7 22.2 62 3.3 16.5 60 3.4 22.7 63 4 36.2 65 4.1 59.2 67 3.8 43.6 63 3.4 27.2 61 3.6 45.5 The condition index of the model = 21.7, which I would worry about if I got with real data. However if I center the DV and IVs and do the regression (with no constant) the condition index for the model of the centered data = 1.2. Obviously the beta, SE beta values etc. are the same for IVs in both cases. So, to my understanding the condition indices are saying that the model with the centered data is better than the model with the original data. Given the responses I have got to my posting, this change in the condition indices confuses me even more now. It must relate to a change in the cross products and sums of squares but I don't really understand. Thank you! NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR CONFIDENTIAL information and is intended only for the use of the specific individual(s) to whom it is addressed. It may contain information that is privileged and confidential under state and federal law. This information may be used or disclosed only in accordance with law, and you may be subject to penalties under law for improper use or further disclosure of the information in this e-mail and its attachments. If you have received this e-mail in error, please immediately notify the person named above by reply e-mail, and then delete the original e-mail. Thank you. |
|
In reply to this post by Mike Ford
The condition indices are computed from the eigenvalues: the square root of the largest eigenvalue divided by the eigenvalue of a dimension (with your toy data ci dimension 2 = SQRT(2.829/.167) = 4.118, ci dimension 2 = SQRT(2.829/.006) = 21.726).
The idea underlying condition index is that if all predictors are uncorrelated, all eigenvalues are equal, thus then all condition indices are equal. The sum of the eigenvalues is fixed, so if there are some high eigenvalues, there will also be some low eigenvalues. High eigenvalues indicate that variables are correlated; the higher the first few eigenvalues are, the more the variables are correlated. Thus, when multicollinearity is high, there are high eigenvalues and the lowest eigenvalue will be close to zero and thus will have a high condition index. When there is no multicollinearity, all condition indices are 1. BUT, all eigenvalues equal if variables are completely uncorrelated only holds if the variables are centered. So, inspecting condition indices to detect possible collinearity only makes sense with centered predictors. The sizes of the eigenvalues much depend on the mean of the predictors. For example, if you compute x1 as x1 + 10, the highest ci increases from 21.726 to 80.067. The correlation between x1 en x2 is only .20, which is not very high, so there is no collinearity problem, as is also indicated by the high tolerance en low VIF; these 2 collinearity diagnostics are not influenced by the mean of the predictors. Even if there is no collinearity at all, the condition index can indicate otherwise. For example, use x1 and x3 (below) as predictors: they have correlation zero, but highest condition index is 43.882. Regards, Anita van der Kooij Data Theory Group Leiden University x3 50.62816 49.45849 46.73164 55.35264 47.29943 49.93272 51.49518 50.22482 51.78362 54.31597 54.81773 56.31553 48.34303 50.35235 47.63079 46.14870 47.53538 51.71940 50.65978 47.25467 ________________________________ From: SPSSX(r) Discussion on behalf of Mike Ford Sent: Wed 19/09/2007 11:02 To: [hidden email] Subject: Re: Regression, centering and collinearity Thank you for the responses. I should have made clear that I was not talking about perfect collinearity. Having seen centering suggested as a method to reduce moderate collinearity in a number of sources, I have used it and thought it worked because of a change in the condition indices of models when I use centered variables. For example, with this makey-up toy data... y x1 x2 68 4.1 73.6 71 4.6 17.4 62 3.8 45.5 75 4.4 75.0 58 3.2 45.3 60 3.1 25.4 67 3.8 8.8 68 4.1 11.0 71 4.3 23.7 69 3.7 18.0 68 3.5 14.7 67 3.2 47.8 63 3.7 22.2 62 3.3 16.5 60 3.4 22.7 63 4 36.2 65 4.1 59.2 67 3.8 43.6 63 3.4 27.2 61 3.6 45.5 The condition index of the model = 21.7, which I would worry about if I got with real data. However if I center the DV and IVs and do the regression (with no constant) the condition index for the model of the centered data = 1.2. Obviously the beta, SE beta values etc. are the same for IVs in both cases. So, to my understanding the condition indices are saying that the model with the centered data is better than the model with the original data. Given the responses I have got to my posting, this change in the condition indices confuses me even more now. It must relate to a change in the cross products and sums of squares but I don't really understand. Thank you! ********************************************************************** This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. ********************************************************************** |
|
In reply to this post by Kooij, A.J. van der
Anita, it's with great trepidation that I even consider disagreeing
with you; but I'm not sure about some of the points in this post. At 06:20 PM 9/18/2007, Kooij, A.J. van der wrote: >With perfect collinearity there are in fact multiple solutions that >include all the perfectly related predictors. There certainly are multiple solutions; a whole vector subspace of them, whose dimensionality is given by the number k of linearly independent sets of coefficients for which the predicted values are identically 0. But need those solutions include all the perfectly related predictors? In fact, shouldn't you ordinarily be able to find a solution in which the coefficients of k of them are 0? >The Regression procedure will not give such a solution, because it >uses the covariance/correlation matrix to compute the solution. REGRESSION won't give that space of solutions, but does the reason have to do with using the variance/covariance matrix? If the covariance matrix is singular, good old INV(X'*X)(X'Y) won't work, directly or in its numerically stable reformulations; but isn't it quite easy to identify the space of solutions, by reducing the variance/covariance matrix? REGRESSION just doesn't use the algorithms that do this. >However, with CATREG, that uses the backfitting algorithm*, you will >get a solution including all predictors when there is perfect >collinearity, but such a solution is not unique: for example if >predictors x5 and x6 have correlation 1, and you exclude x5 from the >analysis and x6 has beta .66, CATREG can give a solution including >both x5 and x6, resulting in beta x5 0.32 and beta x6 0.34, or x5 0.78 >and x6 -0.12, or any other combination of beta's for x5 and x6 that >sum to the beta for x5 or x6 when excluding the other. All these >solutions are equivalent in terms of model fit. But for the purpose of >parsimonious predictor selection the optimal solution would be x5 or >x6 beta .66 and the other zero, wich is what you get with the >Regression procedure, that will exlude one of x5 and x6. Exactly so, on all points. When x5 and x6 are only highly correlated, AND when there's reason to believe they are truly measuring different entities, a common solution is to transform to reduce the correlation. There are more precise methods, but (if x5 and x6 are about on the same scale), simple transforms like replacing x5 and x6 by their mean and their difference, often work well. |
|
>Anita, it's with great trepidation that I even consider disagreeing
>with you; Very nicely put, thank you! I will print your post with this line highlighted and put it on some walls. >but I'm not sure about some of the points in this post. At 06:20 PM 9/18/2007, Kooij, A.J. van der wrote: >>With perfect collinearity there are in fact multiple solutions that >>include all the perfectly related predictors. >There certainly are multiple solutions; a whole vector subspace of >them, whose dimensionality is given by the number k of linearly >independent sets of coefficients for which the predicted values are >identically 0. >But need those solutions include all the perfectly related predictors? No. >In fact, shouldn't you ordinarily be able to find a solution in which >the coefficients of k of them are 0? Yes. But the current version of CATREG doesn't. I posted the note because users of CATREG are not always aware of this. >>The Regression procedure will not give such a solution, because it >>uses the covariance/correlation matrix to compute the solution. >REGRESSION won't give that space of solutions, but does the reason >have to do with using the variance/covariance matrix? If the >covariance matrix is singular, good old INV(X'*X)(X'Y) won't work, >directly or in its numerically stable reformulations; but isn't it >quite easy to identify the space of solutions, by reducing the >variance/covariance matrix? Yes, but as you say, REGRESSION doesn't. This requires a regularized regression procedure (ridge regression or Lasso for example) REGRESSION just doesn't use the algorithms that do this. Neither does CATREG. But we are currently working on incorporating ridge regression, Lasso, and Elastic Net into CATREG. Regards, Anita van der Kooij Data Theory Group Leiden University ________________________________ From: Richard Ristow [mailto:[hidden email]] Sent: Fri 21/09/2007 07:43 To: Kooij, A.J. van der; [hidden email] Subject: Re: Regression and collinearity: A note on CATREG Anita, it's with great trepidation that I even consider disagreeing with you; but I'm not sure about some of the points in this post. At 06:20 PM 9/18/2007, Kooij, A.J. van der wrote: >With perfect collinearity there are in fact multiple solutions that >include all the perfectly related predictors. There certainly are multiple solutions; a whole vector subspace of them, whose dimensionality is given by the number k of linearly independent sets of coefficients for which the predicted values are identically 0. But need those solutions include all the perfectly related predictors? In fact, shouldn't you ordinarily be able to find a solution in which the coefficients of k of them are 0? >The Regression procedure will not give such a solution, because it >uses the covariance/correlation matrix to compute the solution. REGRESSION won't give that space of solutions, but does the reason have to do with using the variance/covariance matrix? If the covariance matrix is singular, good old INV(X'*X)(X'Y) won't work, directly or in its numerically stable reformulations; but isn't it quite easy to identify the space of solutions, by reducing the variance/covariance matrix? REGRESSION just doesn't use the algorithms that do this. >However, with CATREG, that uses the backfitting algorithm*, you will >get a solution including all predictors when there is perfect >collinearity, but such a solution is not unique: for example if >predictors x5 and x6 have correlation 1, and you exclude x5 from the >analysis and x6 has beta .66, CATREG can give a solution including >both x5 and x6, resulting in beta x5 0.32 and beta x6 0.34, or x5 0.78 >and x6 -0.12, or any other combination of beta's for x5 and x6 that >sum to the beta for x5 or x6 when excluding the other. All these >solutions are equivalent in terms of model fit. But for the purpose of >parsimonious predictor selection the optimal solution would be x5 or >x6 beta .66 and the other zero, wich is what you get with the >Regression procedure, that will exlude one of x5 and x6. Exactly so, on all points. When x5 and x6 are only highly correlated, AND when there's reason to believe they are truly measuring different entities, a common solution is to transform to reduce the correlation. There are more precise methods, but (if x5 and x6 are about on the same scale), simple transforms like replacing x5 and x6 by their mean and their difference, often work well. ********************************************************************** This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. ********************************************************************** |
| Free forum by Nabble | Edit this page |
