Regression, centering and collinearity

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Regression, centering and collinearity

Mike Ford
At my work I appear to have taken on the role of the one-eyed man in the
valley of the statistically blind.

I keep getting asked to explain why centering helps with problems of
collinearity in multiple regression. However, my maths isn't really up to it.

I have done a few toy regressions with two IV's by hand and so can see what
it does in the matrix algebra. But I still don't really understand why it
helps.

Is it something to do with the change in the determinant of the X'X matrix
when you use centered variables?

Sorry if this a bit unclear, I am on the very edge of my maths understanding.

Thanks

- Mike
Reply | Threaded
Open this post in threaded view
|

Re: Regression, centering and collinearity

ViAnn Beadle
I've not heard of using centering to address problems of collinearity except
perhaps in the case of interaction effects added to models. Do you have a
reference for this. I was taught that the principal problem with
collinearity is that it produces really unstable results--you'll get widely
different coefficients from different samples drawn from a population.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Mike Ford
Sent: Tuesday, September 18, 2007 8:49 AM
To: [hidden email]
Subject: Regression, centering and collinearity

At my work I appear to have taken on the role of the one-eyed man in the
valley of the statistically blind.

I keep getting asked to explain why centering helps with problems of
collinearity in multiple regression. However, my maths isn't really up to
it.

I have done a few toy regressions with two IV's by hand and so can see what
it does in the matrix algebra. But I still don't really understand why it
helps.

Is it something to do with the change in the determinant of the X'X matrix
when you use centered variables?

Sorry if this a bit unclear, I am on the very edge of my maths
understanding.

Thanks

- Mike
Reply | Threaded
Open this post in threaded view
|

Re: Regression, centering and collinearity

Hector Maletta
In reply to this post by Mike Ford
         Collinearity means that any independent variable is an exact (or
nearly) exact linear function of one or more other variables in the set of
independent variables. If you simply change the point of origin, i.e.
subtract the mean to make the mean equal zero, this cannot change that
linear dependency in the least.
         For example if your units are cities and one of your variables
(say, distance in kilometres from New York) had collinearity with other
variables before centering, it will have collinearity after centering as
well. The new variable will be "distance in km from New York minus average
distance in km from New York", and will have the same linear correlations as
before. You may also change kilometres into miles or light-years, and it
would be the same.
         This is because centering is a linear transformation (adding or
subtracting). If you do a nonlinear transformation instead (say, take the
logarithm of the distance, or suchlike), the collinearity may disappear when
you substitute the new variable, but you must have some solid theoretical
reason to think that your dependent variable depends on the log distance and
not on distance itself, otherwise your solution is just ad hoc accommodation
to a quirkiness in your data.

         Hector

         -----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Mike Ford
Sent: 18 September 2007 11:49
To: [hidden email]
Subject: Regression, centering and collinearity

         At my work I appear to have taken on the role of the one-eyed man
in the
         valley of the statistically blind.

         I keep getting asked to explain why centering helps with problems
of
         collinearity in multiple regression. However, my maths isn't really
up to it.

         I have done a few toy regressions with two IV's by hand and so can
see what
         it does in the matrix algebra. But I still don't really understand
why it
         helps.

         Is it something to do with the change in the determinant of the X'X
matrix
         when you use centered variables?

         Sorry if this a bit unclear, I am on the very edge of my maths
understanding.

         Thanks

         - Mike
Reply | Threaded
Open this post in threaded view
|

Re: Regression, centering and collinearity

Ornelas, Fermin
In reply to this post by Mike Ford
When you estimate a model with collinear variables the standard errors
are inflated and the parameter estimates become unstable, i.e. switch
signs. The individual test statistics are usually non significant but
the overall F-test in the ANOVA table may say otherwise. In addition
your R-square may also be high.

When the collinearity is severe, the X'X matrix is not of full rank due
to the fact that the one or more of its columns are linearly dependent.
That is, in your data set the contribution of two or more variables to
the behavior of the dependent variable is redundant. If this collinear
relation is severe most software packages will issue you a warning about
the reliability of your estimates.

I have not tried to center the data, but I assume that this is an effort
to remove linear dependence in your variables. There is also ridge
regression. In my own experience there is not much you can do to deal
with this problem. If for some research reason you have to keep a
variable in the model, then the objective is to find a set of variable
that minimize the problem. There is also another issue to consider: the
purpose of the model. If you intend to predict then the model will hold
reasonably well. But for hypothesis testing when collineariy is severe
all the bets are off for the reasons mentioned in my first paragraph.

Fermin Ornelas

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Mike Ford
Sent: Tuesday, September 18, 2007 7:49 AM
To: [hidden email]
Subject: Regression, centering and collinearity

At my work I appear to have taken on the role of the one-eyed man in the
valley of the statistically blind.

I keep getting asked to explain why centering helps with problems of
collinearity in multiple regression. However, my maths isn't really up
to
it.

I have done a few toy regressions with two IV's by hand and so can see
what
it does in the matrix algebra. But I still don't really understand why
it
helps.

Is it something to do with the change in the determinant of the X'X
matrix
when you use centered variables?

Sorry if this a bit unclear, I am on the very edge of my maths
understanding.

Thanks

- Mike

NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR
CONFIDENTIAL information and is intended only for the use of the
specific
individual(s) to whom it is addressed.  It may contain information that
is
privileged and confidential under state and federal law.  This
information
may be used or disclosed only in accordance with law, and you may be
subject to penalties under law for improper use or further disclosure of

the information in this e-mail and its attachments. If you have received

this e-mail in error, please immediately notify the person named above
by
reply e-mail, and then delete the original e-mail.  Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Regression, centering and collinearity

Hector Maletta
In reply to this post by ViAnn Beadle
         ViAnn:
         When you have PERFECT colinearity (exact linear relationship
between independent variables) you do not get any results: the covariance
matrix is singular, its determinant is zero, and there is no regression
solution. When you have instead APPROXIMATE colinearity, the matrix is
nearly singular (the determinant is nearly zero, but not exactly zero). In
that case, the results are unstable: a slight change in the data may cause a
large change in the results, i.e. in the regression coefficients.
         Of course, the first case is also unstable in the sense that
another sample may not give you a perfect zero determinant, but that's not
the point since a perfect zero means you have no regression results at all,
neither stable nor unstable.

         Hector

         -----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
ViAnn Beadle
Sent: 18 September 2007 14:52
To: [hidden email]
Subject: Re: Regression, centering and collinearity

         I've not heard of using centering to address problems of
collinearity except
         perhaps in the case of interaction effects added to models. Do you
have a
         reference for this. I was taught that the principal problem with
         collinearity is that it produces really unstable results--you'll
get widely
         different coefficients from different samples drawn from a
population.

         -----Original Message-----
         From: SPSSX(r) Discussion [mailto:[hidden email]] On
Behalf Of
         Mike Ford
         Sent: Tuesday, September 18, 2007 8:49 AM
         To: [hidden email]
         Subject: Regression, centering and collinearity

         At my work I appear to have taken on the role of the one-eyed man
in the
         valley of the statistically blind.

         I keep getting asked to explain why centering helps with problems
of
         collinearity in multiple regression. However, my maths isn't really
up to
         it.

         I have done a few toy regressions with two IV's by hand and so can
see what
         it does in the matrix algebra. But I still don't really understand
why it
         helps.

         Is it something to do with the change in the determinant of the X'X
matrix
         when you use centered variables?

         Sorry if this a bit unclear, I am on the very edge of my maths
         understanding.

         Thanks

         - Mike
Reply | Threaded
Open this post in threaded view
|

Re: Regression, centering and collinearity

ViAnn Beadle
I don't think that the OP was discussing PERFECT collinearity here. If there
is, no point to using all the perfectly correlated variables into the
equation since one is an exact proxy for the other.

-----Original Message-----
From: Hector Maletta [mailto:[hidden email]]
Sent: Tuesday, September 18, 2007 1:05 PM
To: 'ViAnn Beadle'; [hidden email]
Subject: RE: Regression, centering and collinearity

         ViAnn:
         When you have PERFECT colinearity (exact linear relationship
between independent variables) you do not get any results: the covariance
matrix is singular, its determinant is zero, and there is no regression
solution. When you have instead APPROXIMATE colinearity, the matrix is
nearly singular (the determinant is nearly zero, but not exactly zero). In
that case, the results are unstable: a slight change in the data may cause a
large change in the results, i.e. in the regression coefficients.
         Of course, the first case is also unstable in the sense that
another sample may not give you a perfect zero determinant, but that's not
the point since a perfect zero means you have no regression results at all,
neither stable nor unstable.

         Hector

         -----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
ViAnn Beadle
Sent: 18 September 2007 14:52
To: [hidden email]
Subject: Re: Regression, centering and collinearity

         I've not heard of using centering to address problems of
collinearity except
         perhaps in the case of interaction effects added to models. Do you
have a
         reference for this. I was taught that the principal problem with
         collinearity is that it produces really unstable results--you'll
get widely
         different coefficients from different samples drawn from a
population.

         -----Original Message-----
         From: SPSSX(r) Discussion [mailto:[hidden email]] On
Behalf Of
         Mike Ford
         Sent: Tuesday, September 18, 2007 8:49 AM
         To: [hidden email]
         Subject: Regression, centering and collinearity

         At my work I appear to have taken on the role of the one-eyed man
in the
         valley of the statistically blind.

         I keep getting asked to explain why centering helps with problems
of
         collinearity in multiple regression. However, my maths isn't really
up to
         it.

         I have done a few toy regressions with two IV's by hand and so can
see what
         it does in the matrix algebra. But I still don't really understand
why it
         helps.

         Is it something to do with the change in the determinant of the X'X
matrix
         when you use centered variables?

         Sorry if this a bit unclear, I am on the very edge of my maths
         understanding.

         Thanks

         - Mike
Reply | Threaded
Open this post in threaded view
|

Re: Regression, centering and collinearity

David Gomulya
In reply to this post by Ornelas, Fermin
Hi all,

I think various contributors have made excellent points. I'm just going to chip in my 2-cents here.

1. Yes, centering does not change how the IVs are correlated with each other.
2. However, if you have higher order terms, then centering may help because it removes the problem of multicollinearity caused by the measurement scales of the component IVs.

For example, X1 and X2 are two IVs with some correlation. Centering both X1 and X2 does not change how they are correlated. Graphically, there is no change. However, if you introduce an interaction item, e.g. X1*X2 or X1^2, then that term will most likely be correlated with X1 or X2. In that case, centering may help.

The following reference may be useful:
Aiken and West, 1991. Multiple Regression: Testing and Interpreting Interactions.

Hope that helps.
Best,
dg


> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
> Mike Ford
> Sent: Tuesday, September 18, 2007 7:49 AM
> To: [hidden email]
> Subject: Regression, centering and collinearity
>
> At my work I appear to have taken on the role of the one-eyed man in the
> valley of the statistically blind.
>
> I keep getting asked to explain why centering helps with problems of
> collinearity in multiple regression. However, my maths isn't really up
> to
> it.
>
> I have done a few toy regressions with two IV's by hand and so can see
> what
> it does in the matrix algebra. But I still don't really understand why
> it
> helps.
>
> Is it something to do with the change in the determinant of the X'X
> matrix
> when you use centered variables?
>
> Sorry if this a bit unclear, I am on the very edge of my maths
> understanding.
>
> Thanks
>
> - Mike
>
> NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR
> CONFIDENTIAL information and is intended only for the use of the
> specific
> individual(s) to whom it is addressed.  It may contain information that
> is
> privileged and confidential under state and federal law.  This
> information
> may be used or disclosed only in accordance with law, and you may be
> subject to penalties under law for improper use or further disclosure of
>
> the information in this e-mail and its attachments. If you have received
>
> this e-mail in error, please immediately notify the person named above
> by
> reply e-mail, and then delete the original e-mail.  Thank you.
>
Reply | Threaded
Open this post in threaded view
|

Re: Regression, centering and collinearity

Swank, Paul R
In reply to this post by Mike Ford
A simple way to show this is to use a simulation. I simulated the data
as follows. I let X1 be a normally distributed variable with mean 50 and
standard deviation of 10 (I rounded to make it easier to work with).
Then I squared X1 (call it X2, which is correlated .99 with x1) and
created a new variable Y which was a function of X1 and X2 with an added
normally distributed error term so that the overall R squared would be
about .8. Now if you create X1c (centered) and X2c = X1c squared, you
will find that X2c has a different correaltion with Y. More
descriptively, if you do the regression of Y on X1 and X2 vs with X1c
and X2c, you will see a decrease in the standard errors for two of the
predictors even though the over all R squared is not changed. Inflated
standard errors with non centered variables shows evidenc of
multicollinearity.

Paul R. Swank, Ph.D. Professor
Director of Reseach
Children's Learning Institute
University of Texas Health Science Center-Houston


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Mike Ford
Sent: Tuesday, September 18, 2007 9:49 AM
To: [hidden email]
Subject: Regression, centering and collinearity

At my work I appear to have taken on the role of the one-eyed man in the
valley of the statistically blind.

I keep getting asked to explain why centering helps with problems of
collinearity in multiple regression. However, my maths isn't really up
to it.

I have done a few toy regressions with two IV's by hand and so can see
what it does in the matrix algebra. But I still don't really understand
why it helps.

Is it something to do with the change in the determinant of the X'X
matrix when you use centered variables?

Sorry if this a bit unclear, I am on the very edge of my maths
understanding.

Thanks

- Mike
Reply | Threaded
Open this post in threaded view
|

Re: Regression, centering and collinearity

David Greenberg
In reply to this post by David Gomulya
In addition, centering is often used when including a variable and its square as predictors. Centering in this circumstance will reduce their correlation. The regression coefficient for the quadratic term will not change, but the transformation will change the coefficient of the linear term. David Greenberg, Sociology Department, NYU

----- Original Message -----
From: David Gomulya <[hidden email]>
Date: Tuesday, September 18, 2007 3:26 pm
Subject: Re: Regression, centering and collinearity
To: [hidden email]


> Hi all,
>
> I think various contributors have made excellent points. I'm just
> going to chip in my 2-cents here.
>
> 1. Yes, centering does not change how the IVs are correlated with each
> other.
> 2. However, if you have higher order terms, then centering may help
> because it removes the problem of multicollinearity caused by the
> measurement scales of the component IVs.
>
> For example, X1 and X2 are two IVs with some correlation. Centering
> both X1 and X2 does not change how they are correlated. Graphically,
> there is no change. However, if you introduce an interaction item,
> e.g. X1*X2 or X1^2, then that term will most likely be correlated with
> X1 or X2. In that case, centering may help.
>
> The following reference may be useful:
> Aiken and West, 1991. Multiple Regression: Testing and Interpreting Interactions.
>
> Hope that helps.
> Best,
> dg
>
>
> > -----Original Message-----
> > From: SPSSX(r) Discussion [mailto:[hidden email]] On
> Behalf Of
> > Mike Ford
> > Sent: Tuesday, September 18, 2007 7:49 AM
> > To: [hidden email]
> > Subject: Regression, centering and collinearity
> >
> > At my work I appear to have taken on the role of the one-eyed man in
> the
> > valley of the statistically blind.
> >
> > I keep getting asked to explain why centering helps with problems of
> > collinearity in multiple regression. However, my maths isn't really
> up
> > to
> > it.
> >
> > I have done a few toy regressions with two IV's by hand and so can see
> > what
> > it does in the matrix algebra. But I still don't really understand why
> > it
> > helps.
> >
> > Is it something to do with the change in the determinant of the X'X
> > matrix
> > when you use centered variables?
> >
> > Sorry if this a bit unclear, I am on the very edge of my maths
> > understanding.
> >
> > Thanks
> >
> > - Mike
> >
> > NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR
> > CONFIDENTIAL information and is intended only for the use of the
> > specific
> > individual(s) to whom it is addressed.  It may contain information that
> > is
> > privileged and confidential under state and federal law.  This
> > information
> > may be used or disclosed only in accordance with law, and you may be
> > subject to penalties under law for improper use or further
> disclosure of
> >
> > the information in this e-mail and its attachments. If you have received
> >
> > this e-mail in error, please immediately notify the person named above
> > by
> > reply e-mail, and then delete the original e-mail.  Thank you.
> >
Reply | Threaded
Open this post in threaded view
|

Regression and collinearity: A note on CATREG

Kooij, A.J. van der
In reply to this post by Hector Maletta
 
With perfect collinearity there are in fact multiple solutions that include all the perfectly related predictors. The Regression procedure will not give such a solution, because it uses the covariance/correlation matrix to compute the solution. However, with CATREG, that uses the backfitting algorithm*, you will get a solution including all predictors when there is perfect collinearity, but such a solution is not unique: for example if predictors x5 and x6 have correlation 1, and you exclude x5 from the analysis and x6 has beta .66, CATREG can give a solution including both x5 and x6, resulting in beta x5 0.32 and beta x6 0.34, or x5 0.78 and x6 -0.12, or any other combination of beta's for x5 and x6 that sum to the beta for x5 or x6 when excluding the other. All these solutions are equivalent in terms of model fit. But for the purpose of parsimonious predictor selection the optimal solution would be x5 or x6 beta .66 and the other zero, wich is what you get with the Regression procedure, that will exlude one of x5 and x6. So, when applying CATREG for predictor selection, and a low tolerance warning is issued, but there are no zero beta's, it is advisable to use the option of saving the transformed variables and then run the Regression procedure on them to check if one or more predictors are exluded.
On the other hand, for prediction purposes including all predictors could be desirable, but with perfect or high collinearity, regularized regression (Ridge, Lasso) is more appropiate.
 
*With backfitting coefficients are found iteratively for one predictor at a time, removing the influence of the other predictors from the dependent variable when updating the coefficient for a particular predictor; thus the solution is computed from the data itself not from the covariance/correlation matrix.
 
Regards,
Anita van der Kooij
Data Theory Group
Leiden University

________________________________

From: SPSSX(r) Discussion on behalf of Hector Maletta
Sent: Tue 18/09/2007 21:04
To: [hidden email]
Subject: Re: Regression, centering and collinearity




         ViAnn:
         When you have PERFECT colinearity (exact linear relationship
between independent variables) you do not get any results: the covariance
matrix is singular, its determinant is zero, and there is no regression
solution. When you have instead APPROXIMATE colinearity, the matrix is
nearly singular (the determinant is nearly zero, but not exactly zero). In
that case, the results are unstable: a slight change in the data may cause a
large change in the results, i.e. in the regression coefficients.
         Of course, the first case is also unstable in the sense that
another sample may not give you a perfect zero determinant, but that's not
the point since a perfect zero means you have no regression results at all,
neither stable nor unstable.

         Hector

         -----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
ViAnn Beadle
Sent: 18 September 2007 14:52
To: [hidden email]
Subject: Re: Regression, centering and collinearity

         I've not heard of using centering to address problems of
collinearity except
         perhaps in the case of interaction effects added to models. Do you
have a
         reference for this. I was taught that the principal problem with
         collinearity is that it produces really unstable results--you'll
get widely
         different coefficients from different samples drawn from a
population.

         -----Original Message-----
         From: SPSSX(r) Discussion [mailto:[hidden email]] On
Behalf Of
         Mike Ford
         Sent: Tuesday, September 18, 2007 8:49 AM
         To: [hidden email]
         Subject: Regression, centering and collinearity

         At my work I appear to have taken on the role of the one-eyed man
in the
         valley of the statistically blind.

         I keep getting asked to explain why centering helps with problems
of
         collinearity in multiple regression. However, my maths isn't really
up to
         it.

         I have done a few toy regressions with two IV's by hand and so can
see what
         it does in the matrix algebra. But I still don't really understand
why it
         helps.

         Is it something to do with the change in the determinant of the X'X
matrix
         when you use centered variables?

         Sorry if this a bit unclear, I am on the very edge of my maths
         understanding.

         Thanks

         - Mike



**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Drawing multiple samples

Mike Marshall-2
In reply to this post by ViAnn Beadle
Greetings All.
What I would like to be able to do is draw multiple samples from an SPSS
data file and save each sample to a new file.
For drawing one sample, I have this syntax:

DATASET COPY DASample1.
DATASET ACTIVATE DASample1.
FILTER OFF.
USE ALL.
SAMPLE 100 from  6493.
DATASET ACTIVATE DataSet1.
EXECUTE .

How do I get SPSS to rename DASample1 to DASample1 ... DASamplen, so
that I get n samples in separate files? Can this be done with DO IF?

Thanks in advance for any advice.
Mike
Reply | Threaded
Open this post in threaded view
|

Re: Regression, centering and collinearity

Mike Ford
In reply to this post by Mike Ford
Thank you for the responses. I should have made clear that I was not talking
about perfect collinearity.

Having seen centering suggested as a method to reduce moderate collinearity
in a number of sources, I have used it and thought it worked because of a
change in the condition indices of models when I use centered variables. For
example, with this makey-up toy data...

y x1 x2
68 4.1 73.6
71 4.6 17.4
62 3.8 45.5
75 4.4 75.0
58 3.2 45.3
60 3.1 25.4
67 3.8 8.8
68 4.1 11.0
71 4.3 23.7
69 3.7 18.0
68 3.5 14.7
67 3.2 47.8
63 3.7 22.2
62 3.3 16.5
60 3.4 22.7
63 4 36.2
65 4.1 59.2
67 3.8 43.6
63 3.4 27.2
61 3.6 45.5

The condition index of the model = 21.7, which I would worry about if I got
with real data. However if I center the DV and IVs and do the regression
(with no constant) the condition index for the model of the centered data =
1.2.

Obviously the beta, SE beta values etc. are the same for IVs in both cases.

So, to my understanding the condition indices are saying that the model with
the centered data is better than the model with the original data. Given the
responses I have got to my posting, this change in the condition indices
confuses me even more now.

It must relate to a change in the cross products and sums of squares but I
don't really understand.

Thank you!
Reply | Threaded
Open this post in threaded view
|

Re: Regression and collinearity: A note on CATREG

Peck, Jon
In reply to this post by Kooij, A.J. van der
If you really have perfect collinearity and just want predictions or a parsimonious model (sort of), you can use Partial Least Squares, which works even with perfect collinearity - even with more variables than cases!  Of course, you won't get significance tests for individual variables with that approach.

If you have SPSS 16, there is a dialog box and an enhanced PLS module that will be available from SPSS Developer Central (www.spss.com/devcentral) very shortly (maybe today).  If you have SPSS 15, there is already a more basic PLS module that I wrote that handles only one dependent variable. You can download it from Developer Central now.

These modules require some extra downloadable numerical libraries detailed in the documentation and illustrate the ease and power of developing statistical methods within SPSS via Python programmability.

With SPSS 16, if you are an R person, you can also download from the R repository and use partial least squares modules in R within SPSS taking advantage of the R plug-in now available on Developer Central.

You can find a simple example of PLS using the SPSS 15 module on Developer Central in my PowerPoint presentation on Developer Central called Programmability in SPSS 14, 15, and 16.

Regards,
Jon Peck

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Kooij, A.J. van der
Sent: Tuesday, September 18, 2007 5:21 PM
To: [hidden email]
Subject: [SPSSX-L] Regression and collinearity: A note on CATREG


With perfect collinearity there are in fact multiple solutions that include all the perfectly related predictors. The Regression procedure will not give such a solution, because it uses the covariance/correlation matrix to compute the solution. However, with CATREG, that uses the backfitting algorithm*, you will get a solution including all predictors when there is perfect collinearity, but such a solution is not unique: for example if predictors x5 and x6 have correlation 1, and you exclude x5 from the analysis and x6 has beta .66, CATREG can give a solution including both x5 and x6, resulting in beta x5 0.32 and beta x6 0.34, or x5 0.78 and x6 -0.12, or any other combination of beta's for x5 and x6 that sum to the beta for x5 or x6 when excluding the other. All these solutions are equivalent in terms of model fit. But for the purpose of parsimonious predictor selection the optimal solution would be x5 or x6 beta .66 and the other zero, wich is what you get with the Regress!
 ion
procedure, that will exlude one of x5 and x6. So, when applying CATREG for predictor selection, and a low tolerance warning is issued, but there are no zero beta's, it is advisable to use the option of saving the transformed variables and then run the Regression procedure on them to check if one or more predictors are exluded.
On the other hand, for prediction purposes including all predictors could be desirable, but with perfect or high collinearity, regularized regression (Ridge, Lasso) is more appropiate.

*With backfitting coefficients are found iteratively for one predictor at a time, removing the influence of the other predictors from the dependent variable when updating the coefficient for a particular predictor; thus the solution is computed from the data itself not from the covariance/correlation matrix.

Regards,
Anita van der Kooij
Data Theory Group
Leiden University

________________________________

From: SPSSX(r) Discussion on behalf of Hector Maletta
Sent: Tue 18/09/2007 21:04
To: [hidden email]
Subject: Re: Regression, centering and collinearity




         ViAnn:
         When you have PERFECT colinearity (exact linear relationship
between independent variables) you do not get any results: the covariance
matrix is singular, its determinant is zero, and there is no regression
solution. When you have instead APPROXIMATE colinearity, the matrix is
nearly singular (the determinant is nearly zero, but not exactly zero). In
that case, the results are unstable: a slight change in the data may cause a
large change in the results, i.e. in the regression coefficients.
         Of course, the first case is also unstable in the sense that
another sample may not give you a perfect zero determinant, but that's not
the point since a perfect zero means you have no regression results at all,
neither stable nor unstable.

         Hector

         -----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
ViAnn Beadle
Sent: 18 September 2007 14:52
To: [hidden email]
Subject: Re: Regression, centering and collinearity

         I've not heard of using centering to address problems of
collinearity except
         perhaps in the case of interaction effects added to models. Do you
have a
         reference for this. I was taught that the principal problem with
         collinearity is that it produces really unstable results--you'll
get widely
         different coefficients from different samples drawn from a
population.

         -----Original Message-----
         From: SPSSX(r) Discussion [mailto:[hidden email]] On
Behalf Of
         Mike Ford
         Sent: Tuesday, September 18, 2007 8:49 AM
         To: [hidden email]
         Subject: Regression, centering and collinearity

         At my work I appear to have taken on the role of the one-eyed man
in the
         valley of the statistically blind.

         I keep getting asked to explain why centering helps with problems
of
         collinearity in multiple regression. However, my maths isn't really
up to
         it.

         I have done a few toy regressions with two IV's by hand and so can
see what
         it does in the matrix algebra. But I still don't really understand
why it
         helps.

         Is it something to do with the change in the determinant of the X'X
matrix
         when you use centered variables?

         Sorry if this a bit unclear, I am on the very edge of my maths
         understanding.

         Thanks

         - Mike



**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Regression, centering and collinearity

Ornelas, Fermin
In reply to this post by Mike Ford
When you fit a model without an intercept the collinearity diagnostics
will  give you a lower condition index. To me this is not a good idea
since most often the final model will include an intercept in the
regression function.

I may reply to the second part later... but brushing up on linear
algebra will help you understand how x'x and xy are being affected.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Mike Ford
Sent: Wednesday, September 19, 2007 2:03 AM
To: [hidden email]
Subject: Re: Regression, centering and collinearity

Thank you for the responses. I should have made clear that I was not
talking
about perfect collinearity.

Having seen centering suggested as a method to reduce moderate
collinearity
in a number of sources, I have used it and thought it worked because of
a
change in the condition indices of models when I use centered variables.

For
example, with this makey-up toy data...

y x1 x2
68 4.1 73.6
71 4.6 17.4
62 3.8 45.5
75 4.4 75.0
58 3.2 45.3
60 3.1 25.4
67 3.8 8.8
68 4.1 11.0
71 4.3 23.7
69 3.7 18.0
68 3.5 14.7
67 3.2 47.8
63 3.7 22.2
62 3.3 16.5
60 3.4 22.7
63 4 36.2
65 4.1 59.2
67 3.8 43.6
63 3.4 27.2
61 3.6 45.5

The condition index of the model = 21.7, which I would worry about if I
got
with real data. However if I center the DV and IVs and do the regression
(with no constant) the condition index for the model of the centered
data
=
1.2.

Obviously the beta, SE beta values etc. are the same for IVs in both
cases.

So, to my understanding the condition indices are saying that the model
with
the centered data is better than the model with the original data. Given

the
responses I have got to my posting, this change in the condition indices
confuses me even more now.

It must relate to a change in the cross products and sums of squares but
I
don't really understand.

Thank you!

NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR
CONFIDENTIAL information and is intended only for the use of the
specific
individual(s) to whom it is addressed.  It may contain information that
is
privileged and confidential under state and federal law.  This
information
may be used or disclosed only in accordance with law, and you may be
subject to penalties under law for improper use or further disclosure of

the information in this e-mail and its attachments. If you have received

this e-mail in error, please immediately notify the person named above
by
reply e-mail, and then delete the original e-mail.  Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Regression, centering and collinearity

Kooij, A.J. van der
In reply to this post by Mike Ford
The condition indices are computed from the eigenvalues:  the square root of the largest eigenvalue divided by the eigenvalue of a dimension (with your toy data ci dimension 2 = SQRT(2.829/.167) = 4.118, ci  dimension 2 = SQRT(2.829/.006) = 21.726).

 

The idea underlying condition index is that if all predictors are uncorrelated, all eigenvalues are equal, thus then all condition indices are equal. The sum of the eigenvalues is fixed, so if there are some high eigenvalues, there will also be some low eigenvalues. High eigenvalues indicate that variables are correlated; the higher the first few eigenvalues are, the more the variables are correlated. Thus, when multicollinearity is high, there are high eigenvalues and the lowest eigenvalue will be close to zero and thus will have a high condition index. When there is no multicollinearity, all condition indices are 1.

 

BUT, all eigenvalues equal if variables are completely uncorrelated only holds if the variables are centered.

So, inspecting condition indices to detect possible collinearity only makes sense with centered predictors.

 

The sizes of the eigenvalues much depend on the mean of the predictors. For example, if you compute x1 as x1 + 10, the highest ci increases from 21.726 to 80.067.  The correlation between x1 en x2 is only .20, which is not very high, so there is no collinearity problem, as is also indicated by the high tolerance en low VIF; these 2 collinearity diagnostics are not influenced by the mean of the predictors.

Even if there is no collinearity at all, the condition index can indicate otherwise. For example, use x1 and x3 (below) as predictors: they have correlation zero, but highest condition index is 43.882.

 

Regards,

Anita van der Kooij

Data Theory Group

Leiden University

 

          x3

     50.62816

     49.45849

     46.73164

     55.35264

     47.29943

     49.93272

     51.49518

     50.22482

     51.78362

     54.31597

     54.81773

     56.31553

     48.34303

     50.35235

     47.63079

     46.14870

     47.53538

     51.71940

     50.65978

     47.25467


________________________________

From: SPSSX(r) Discussion on behalf of Mike Ford
Sent: Wed 19/09/2007 11:02
To: [hidden email]
Subject: Re: Regression, centering and collinearity



Thank you for the responses. I should have made clear that I was not talking
about perfect collinearity.

Having seen centering suggested as a method to reduce moderate collinearity
in a number of sources, I have used it and thought it worked because of a
change in the condition indices of models when I use centered variables. For
example, with this makey-up toy data...

y x1 x2
68 4.1 73.6
71 4.6 17.4
62 3.8 45.5
75 4.4 75.0
58 3.2 45.3
60 3.1 25.4
67 3.8 8.8
68 4.1 11.0
71 4.3 23.7
69 3.7 18.0
68 3.5 14.7
67 3.2 47.8
63 3.7 22.2
62 3.3 16.5
60 3.4 22.7
63 4 36.2
65 4.1 59.2
67 3.8 43.6
63 3.4 27.2
61 3.6 45.5

The condition index of the model = 21.7, which I would worry about if I got
with real data. However if I center the DV and IVs and do the regression
(with no constant) the condition index for the model of the centered data =
1.2.

Obviously the beta, SE beta values etc. are the same for IVs in both cases.

So, to my understanding the condition indices are saying that the model with
the centered data is better than the model with the original data. Given the
responses I have got to my posting, this change in the condition indices
confuses me even more now.

It must relate to a change in the cross products and sums of squares but I
don't really understand.

Thank you!



**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Regression and collinearity: A note on CATREG

Richard Ristow
In reply to this post by Kooij, A.J. van der
Anita, it's with great trepidation that I even consider disagreeing
with you; but I'm not sure about some of the points in this post.

At 06:20 PM 9/18/2007, Kooij, A.J. van der wrote:

>With perfect collinearity there are in fact multiple solutions that
>include all the perfectly related predictors.

There certainly are multiple solutions; a whole vector subspace of
them, whose dimensionality is given by the number k of linearly
independent sets of coefficients for which the predicted values are
identically 0.

But need those solutions include all the perfectly related predictors?
In fact, shouldn't you ordinarily be able to find a solution in which
the coefficients of k of them are 0?

>The Regression procedure will not give such a solution, because it
>uses the covariance/correlation matrix to compute the solution.

REGRESSION won't give that space of solutions, but does the reason have
to do with using the variance/covariance matrix? If the covariance
matrix is singular, good old INV(X'*X)(X'Y) won't work, directly or in
its numerically stable reformulations; but isn't it quite easy to
identify the space of solutions, by reducing the variance/covariance
matrix? REGRESSION just doesn't use the algorithms that do this.

>However, with CATREG, that uses the backfitting algorithm*, you will
>get a solution including all predictors when there is perfect
>collinearity, but such a solution is not unique: for example if
>predictors x5 and x6 have correlation 1, and you exclude x5 from the
>analysis and x6 has beta .66, CATREG can give a solution including
>both x5 and x6, resulting in beta x5 0.32 and beta x6 0.34, or x5 0.78
>and x6 -0.12, or any other combination of beta's for x5 and x6 that
>sum to the beta for x5 or x6 when excluding the other. All these
>solutions are equivalent in terms of model fit. But for the purpose of
>parsimonious predictor selection the optimal solution would be x5 or
>x6 beta .66 and the other zero, wich is what you get with the
>Regression procedure, that will exlude one of x5 and x6.

Exactly so, on all points.

When x5 and x6 are only highly correlated, AND when there's reason to
believe they are truly measuring different entities, a common solution
is to transform to reduce the correlation. There are more precise
methods, but (if x5 and x6 are about on the same scale), simple
transforms like replacing x5 and x6 by their mean and their difference,
often work well.
Reply | Threaded
Open this post in threaded view
|

Re: Regression and collinearity: A note on CATREG

Kooij, A.J. van der
>Anita, it's with great trepidation that I even consider disagreeing
>with you;

Very nicely put, thank you! I will print your post with this line

highlighted and put it on some walls.

>but I'm not sure about some of the points in this post.

 

At 06:20 PM 9/18/2007, Kooij, A.J. van der wrote:

>>With perfect collinearity there are in fact multiple solutions that
>>include all the perfectly related predictors.

>There certainly are multiple solutions; a whole vector subspace of
>them, whose dimensionality is given by the number k of linearly
>independent sets of coefficients for which the predicted values are
>identically 0.
>But need those solutions include all the perfectly related predictors?

No.
>In fact, shouldn't you ordinarily be able to find a solution in which
>the coefficients of k of them are 0?

Yes. But the current version of CATREG doesn't. I posted the note because

users of CATREG are not always aware of this.


>>The Regression procedure will not give such a solution, because it
>>uses the covariance/correlation matrix to compute the solution.

>REGRESSION won't give that space of solutions, but does the reason

>have to do with using the variance/covariance matrix? If the

>covariance matrix is singular, good old INV(X'*X)(X'Y) won't work,

>directly or in its numerically stable reformulations; but isn't it

>quite easy to identify the space of solutions, by reducing the

>variance/covariance matrix?

Yes, but as you say, REGRESSION doesn't. This requires a regularized

regression procedure (ridge regression or Lasso for example)

REGRESSION just doesn't use the algorithms that do this.

Neither does CATREG. But we are currently working on incorporating
ridge regression, Lasso, and Elastic Net into CATREG.

Regards,
Anita van der Kooij
Data Theory Group
Leiden University

________________________________

From: Richard Ristow [mailto:[hidden email]]
Sent: Fri 21/09/2007 07:43
To: Kooij, A.J. van der; [hidden email]
Subject: Re: Regression and collinearity: A note on CATREG



Anita, it's with great trepidation that I even consider disagreeing
with you; but I'm not sure about some of the points in this post.

At 06:20 PM 9/18/2007, Kooij, A.J. van der wrote:

>With perfect collinearity there are in fact multiple solutions that
>include all the perfectly related predictors.

There certainly are multiple solutions; a whole vector subspace of
them, whose dimensionality is given by the number k of linearly
independent sets of coefficients for which the predicted values are
identically 0.

But need those solutions include all the perfectly related predictors?
In fact, shouldn't you ordinarily be able to find a solution in which
the coefficients of k of them are 0?

>The Regression procedure will not give such a solution, because it
>uses the covariance/correlation matrix to compute the solution.

REGRESSION won't give that space of solutions, but does the reason have
to do with using the variance/covariance matrix? If the covariance
matrix is singular, good old INV(X'*X)(X'Y) won't work, directly or in
its numerically stable reformulations; but isn't it quite easy to
identify the space of solutions, by reducing the variance/covariance
matrix? REGRESSION just doesn't use the algorithms that do this.

>However, with CATREG, that uses the backfitting algorithm*, you will
>get a solution including all predictors when there is perfect
>collinearity, but such a solution is not unique: for example if
>predictors x5 and x6 have correlation 1, and you exclude x5 from the
>analysis and x6 has beta .66, CATREG can give a solution including
>both x5 and x6, resulting in beta x5 0.32 and beta x6 0.34, or x5 0.78
>and x6 -0.12, or any other combination of beta's for x5 and x6 that
>sum to the beta for x5 or x6 when excluding the other. All these
>solutions are equivalent in terms of model fit. But for the purpose of
>parsimonious predictor selection the optimal solution would be x5 or
>x6 beta .66 and the other zero, wich is what you get with the
>Regression procedure, that will exlude one of x5 and x6.

Exactly so, on all points.

When x5 and x6 are only highly correlated, AND when there's reason to
believe they are truly measuring different entities, a common solution
is to transform to reduce the correlation. There are more precise
methods, but (if x5 and x6 are about on the same scale), simple
transforms like replacing x5 and x6 by their mean and their difference,
often work well.




**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************