How to deal with collinearity? Thanks

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

How to deal with collinearity? Thanks

zhou yuming
Hi, all,

I do not know how to deal with collinearity when building a multivariate
linear regression (MLR) model.

In my case, there are more than 70 independent variables (IVs). I first
tried to use VIF to detect collinearity as follows:
(1) build a MLR model ("Enter" method) to obtain the VIF values for IVs
(2) find the IV with the largest VIF value (larger than 10) and then delete
it
(3) rebuild a MLR model with the remaining IVs
(4) repeat step (2) and (3) until all the remaining IVs have a VIF value
less than 10
After that, I use the forward or backward method to select a subset of the
remaining IVs to build the MLR model.  However, the final model has a
very low R square (below 0.1).

I also tried to use another method to build the MLR model. In this time, I
use condition number to detect collinearity. The purpose is to obtain a MLR
model that satisfies: (a) its R square is as large as possbile; and
(b) its condition number (CN) is less than 30. I use the following method (I
am not sure whether it is right)
(1) build the MLR model with all the IVs (using forward or backward method)
(2) examine the condition number (CN) of the MLR model:
         if CN < 30, then Ok, otherwise delete an IV from the MLR model and
then goto (3)
(3) rebuild a MLR model with the remaining IVs
(4) repeat step (2) and (3)


The problem is which criteria should be used to delete an IV in step (2)
when CN > 30? I tried two method: (a) delete the IV with the largest VIF
value; or (b) delete an IV randomly. I found that (b) may result in a better
MLR model.

My purpose is to obtain a MLR model that satisfies: (a) its R square is as
large as possbile; and (b) its condition number (CN) is less than 30. How to
do it? Thank you very much.


Yuming Zhou

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: How to deal with collinearity? Thanks

Ornelas, Fermin-2
You will not find a clear procedure that tells you how to do it. Both
methods complement each other. However, I usually prefer the condition
index criteria and if you output the variance proportion factors it will
improve your selection of the variable having the largest impact on
cleaning the regression function. I have not used SPSS for collinearity
diagnostics, but I assume you can get both, the condition index and the
variance proportion factor for each variable.

Fermin Ornelas, Ph.D.
Management Analyst III, AZ DES
1789 W. Jefferson Street
Phoenix, AZ 85032
Tel: (602) 542-5639
E-mail: [hidden email]


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
zhou yuming
Sent: Tuesday, October 30, 2007 7:14 PM
To: [hidden email]
Subject: How to deal with collinearity? Thanks

Hi, all,

I do not know how to deal with collinearity when building a multivariate
linear regression (MLR) model.

In my case, there are more than 70 independent variables (IVs). I first
tried to use VIF to detect collinearity as follows:
(1) build a MLR model ("Enter" method) to obtain the VIF values for IVs
(2) find the IV with the largest VIF value (larger than 10) and then
delete
it
(3) rebuild a MLR model with the remaining IVs
(4) repeat step (2) and (3) until all the remaining IVs have a VIF value
less than 10
After that, I use the forward or backward method to select a subset of
the
remaining IVs to build the MLR model.  However, the final model has a
very low R square (below 0.1).

I also tried to use another method to build the MLR model. In this time,
I
use condition number to detect collinearity. The purpose is to obtain a
MLR
model that satisfies: (a) its R square is as large as possbile; and
(b) its condition number (CN) is less than 30. I use the following
method (I
am not sure whether it is right)
(1) build the MLR model with all the IVs (using forward or backward
method)
(2) examine the condition number (CN) of the MLR model:
         if CN < 30, then Ok, otherwise delete an IV from the MLR model
and
then goto (3)
(3) rebuild a MLR model with the remaining IVs
(4) repeat step (2) and (3)


The problem is which criteria should be used to delete an IV in step (2)
when CN > 30? I tried two method: (a) delete the IV with the largest VIF
value; or (b) delete an IV randomly. I found that (b) may result in a
better
MLR model.

My purpose is to obtain a MLR model that satisfies: (a) its R square is
as
large as possbile; and (b) its condition number (CN) is less than 30.
How to
do it? Thank you very much.


Yuming Zhou

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR
CONFIDENTIAL information and is intended only for the use of the
specific individual(s) to whom it is addressed.  It may contain
information that is privileged and confidential under state and federal
law.  This information may be used or disclosed only in accordance with
law, and you may be subject to penalties under law for improper use or
further disclosure of the information in this e-mail and its
attachments. If you have received this e-mail in error, please
immediately notify the person named above by reply e-mail, and then
delete the original e-mail.  Thank you.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: How to deal with collinearity? Thanks

Richard Ristow
In reply to this post by zhou yuming
At 10:14 PM 10/30/2007, zhou yuming wrote:

>I do not know how to deal with collinearity when building a
>multivariate linear regression (MLR) model.
>
>In my case, there are more than 70 independent variables (IVs).

Your analysis sounds like it's in trouble, collinearity or not.

First: that many independents eats sample size for breakfast. By the
rule of ten observations per independent variable, you need at least
700; but with the complexity you've got, I'd want more like ten times
that many - say, 10,000, to use round numbers.

Second, you have a big multiple-comparison problem. You expect (in the
technical sense) 3.5 coefficients significant at p<.05, supposing no
association whatever between the dependents and the independent; and
about a chance in three of at least one significant at p<.01, on the
same assumption. One reason you need such a large sample size is to
have some statistical power left after correcting for multiple
comparisons.

Third, what are you going to say about your results? It can be mighty
hard, sometimes near impossible, to make a coherent discussion of the
results of an estimation like that. Indeed, what does your description
of your model look like?

And fourth, do you know your data well to start with? I suppose the 70
variable group into subject areas. In such a case, collinearities
within a subject area are common, collinearities between subject areas
are rarer and generally deserve discussion. Anyway, I'd look to reduce
the dimensionality drastically, by reducing each subject area to a
summary variable or two. Depending on your problem and your tastes,
that could be anything from selecting the most illuminating variables,
through simple averaging, to factor analysis within the subject areas.

And, the very best of luck to you,
Richard


>  I first
>tried to use VIF to detect collinearity as follows:
>(1) build a MLR model ("Enter" method) to obtain the VIF values for
>IVs
>(2) find the IV with the largest VIF value (larger than 10) and then
>delete
>it
>(3) rebuild a MLR model with the remaining IVs
>(4) repeat step (2) and (3) until all the remaining IVs have a VIF
>value
>less than 10
>After that, I use the forward or backward method to select a subset of
>the
>remaining IVs to build the MLR model.  However, the final model has a
>very low R square (below 0.1).
>
>I also tried to use another method to build the MLR model. In this
>time, I
>use condition number to detect collinearity. The purpose is to obtain
>a MLR
>model that satisfies: (a) its R square is as large as possbile; and
>(b) its condition number (CN) is less than 30. I use the following
>method (I
>am not sure whether it is right)
>(1) build the MLR model with all the IVs (using forward or backward
>method)
>(2) examine the condition number (CN) of the MLR model:
>          if CN < 30, then Ok, otherwise delete an IV from the MLR
> model and
>then goto (3)
>(3) rebuild a MLR model with the remaining IVs
>(4) repeat step (2) and (3)
>
>
>The problem is which criteria should be used to delete an IV in step
>(2)
>when CN > 30? I tried two method: (a) delete the IV with the largest
>VIF
>value; or (b) delete an IV randomly. I found that (b) may result in a
>better
>MLR model.
>
>My purpose is to obtain a MLR model that satisfies: (a) its R square
>is as
>large as possbile; and (b) its condition number (CN) is less than 30.
>How to
>do it? Thank you very much.
>
>
>Yuming Zhou
>
>=====================
>To manage your subscription to SPSSX-L, send a message to
>[hidden email] (not to SPSSX-L), with no body text except
>the
>command. To leave the list, send the command
>SIGNOFF SPSSX-L
>For a list of commands to manage subscriptions, send the command
>INFO REFCARD
>
>
>
>--
>No virus found in this incoming message.
>Checked by AVG Free Edition.
>Version: 7.5.503 / Virus Database: 269.15.12/1098 - Release Date:
>10/29/2007 9:28 AM

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: How to deal with collinearity? Thanks

Ornelas, Fermin-2
Just an observation on this reply. If you are referring to multiple
comparison tests for the means or the medians, this type of testing is
not relevant here. Having said that, you observation regarding the
sample size is valid. For someone to be looking at 70 predictor
variables there should be a large number of observations. In my own
experience, for this number of variables my development samples were
20,000 or more observations.

Fermin Ornelas, Ph.D.
Management Analyst III, AZ DES
1789 W. Jefferson Street
Phoenix, AZ 85032
Tel: (602) 542-5639
E-mail: [hidden email]

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Richard Ristow
Sent: Thursday, November 01, 2007 1:49 PM
To: [hidden email]
Subject: Re: How to deal with collinearity? Thanks

At 10:14 PM 10/30/2007, zhou yuming wrote:

>I do not know how to deal with collinearity when building a
>multivariate linear regression (MLR) model.
>
>In my case, there are more than 70 independent variables (IVs).

Your analysis sounds like it's in trouble, collinearity or not.

First: that many independents eats sample size for breakfast. By the
rule of ten observations per independent variable, you need at least
700; but with the complexity you've got, I'd want more like ten times
that many - say, 10,000, to use round numbers.

Second, you have a big multiple-comparison problem. You expect (in the
technical sense) 3.5 coefficients significant at p<.05, supposing no
association whatever between the dependents and the independent; and
about a chance in three of at least one significant at p<.01, on the
same assumption. One reason you need such a large sample size is to
have some statistical power left after correcting for multiple
comparisons.

Third, what are you going to say about your results? It can be mighty
hard, sometimes near impossible, to make a coherent discussion of the
results of an estimation like that. Indeed, what does your description
of your model look like?

And fourth, do you know your data well to start with? I suppose the 70
variable group into subject areas. In such a case, collinearities
within a subject area are common, collinearities between subject areas
are rarer and generally deserve discussion. Anyway, I'd look to reduce
the dimensionality drastically, by reducing each subject area to a
summary variable or two. Depending on your problem and your tastes,
that could be anything from selecting the most illuminating variables,
through simple averaging, to factor analysis within the subject areas.

And, the very best of luck to you,
Richard


>  I first
>tried to use VIF to detect collinearity as follows:
>(1) build a MLR model ("Enter" method) to obtain the VIF values for
>IVs
>(2) find the IV with the largest VIF value (larger than 10) and then
>delete
>it
>(3) rebuild a MLR model with the remaining IVs
>(4) repeat step (2) and (3) until all the remaining IVs have a VIF
>value
>less than 10
>After that, I use the forward or backward method to select a subset of
>the
>remaining IVs to build the MLR model.  However, the final model has a
>very low R square (below 0.1).
>
>I also tried to use another method to build the MLR model. In this
>time, I
>use condition number to detect collinearity. The purpose is to obtain
>a MLR
>model that satisfies: (a) its R square is as large as possbile; and
>(b) its condition number (CN) is less than 30. I use the following
>method (I
>am not sure whether it is right)
>(1) build the MLR model with all the IVs (using forward or backward
>method)
>(2) examine the condition number (CN) of the MLR model:
>          if CN < 30, then Ok, otherwise delete an IV from the MLR
> model and
>then goto (3)
>(3) rebuild a MLR model with the remaining IVs
>(4) repeat step (2) and (3)
>
>
>The problem is which criteria should be used to delete an IV in step
>(2)
>when CN > 30? I tried two method: (a) delete the IV with the largest
>VIF
>value; or (b) delete an IV randomly. I found that (b) may result in a
>better
>MLR model.
>
>My purpose is to obtain a MLR model that satisfies: (a) its R square
>is as
>large as possbile; and (b) its condition number (CN) is less than 30.
>How to
>do it? Thank you very much.
>
>
>Yuming Zhou
>
>=====================
>To manage your subscription to SPSSX-L, send a message to
>[hidden email] (not to SPSSX-L), with no body text except
>the
>command. To leave the list, send the command
>SIGNOFF SPSSX-L
>For a list of commands to manage subscriptions, send the command
>INFO REFCARD
>
>
>
>--
>No virus found in this incoming message.
>Checked by AVG Free Edition.
>Version: 7.5.503 / Virus Database: 269.15.12/1098 - Release Date:
>10/29/2007 9:28 AM

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR
CONFIDENTIAL information and is intended only for the use of the
specific individual(s) to whom it is addressed.  It may contain
information that is privileged and confidential under state and federal
law.  This information may be used or disclosed only in accordance with
law, and you may be subject to penalties under law for improper use or
further disclosure of the information in this e-mail and its
attachments. If you have received this e-mail in error, please
immediately notify the person named above by reply e-mail, and then
delete the original e-mail.  Thank you.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: How to deal with collinearity? Thanks

Richard Ristow
At 12:51 PM 11/2/2007, Ornelas, Fermin wrote:

>If you are referring to multiple comparison tests for the means or the
>medians, this type of testing is not relevant here.

In that narrow sense, it is not relevant. But in a broad sense, it is
very relevant: running many significance tests of any kind, raises the
risk of false 'significant' results. In this case, I'm thinking of the
t-tests for the regression coefficients. Again, with 70 coefficients
and no actual association at all, you expect 3.5 significant at p<.05,
with about 1/3 chance of at least one significant at p<.01.

I'm not an expert, and I gather that the BONFERRONI correction can be
too conservative; but applying it, i.e. dividing the criterion p-value
by 70, you should trust t-tests for which reported p<.05/70=0.0007.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: How to deal with collinearity? Thanks

Rita Clivio
Hi

I think you have to try to reduce your variable. Two way I've in mind:
- factor analysis (PCA) in order to have a regression on factor and then by
variable; I think that having the weight of the component and the score of
the variable related to the component, you could be able to have something
similar to the "importance" of the variable on the depenedent variable;
- create a unique variable from the variable which are collinear,
i.e.createa score deriving from this variables.

May be this approach is not "orthodox", more "qualitative", it depends from
your goal (to measure exactly or understand connections and relative weight
?)

Bye

Rita


2007/11/3, Richard Ristow <[hidden email]>:

>
> At 12:51 PM 11/2/2007, Ornelas, Fermin wrote:
>
> >If you are referring to multiple comparison tests for the means or the
> >medians, this type of testing is not relevant here.
>
> In that narrow sense, it is not relevant. But in a broad sense, it is
> very relevant: running many significance tests of any kind, raises the
> risk of false 'significant' results. In this case, I'm thinking of the
> t-tests for the regression coefficients. Again, with 70 coefficients
> and no actual association at all, you expect 3.5 significant at p<.05,
> with about 1/3 chance of at least one significant at p<.01.
>
> I'm not an expert, and I gather that the BONFERRONI correction can be
> too conservative; but applying it, i.e. dividing the criterion p-value
> by 70, you should trust t-tests for which reported p<.05/70=0.0007.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: How to deal with collinearity? Thanks

zhou yuming
In reply to this post by zhou yuming
Hi, all,

Thanks for your reply.

It seems that another regression technique, PLS, is be very suitable for
this case (i.e. (1) collinearity, and (2) many IVs with relatively less data
points).


Best regards

Yuming Zhou



2007/10/31, zhou yuming <[hidden email]>:

>
> Hi, all,
>
> I do not know how to deal with collinearity when building a multivariate
> linear regression (MLR) model.
>
> In my case, there are more than 70 independent variables (IVs). I first
> tried to use VIF to detect collinearity as follows:
> (1) build a MLR model ("Enter" method) to obtain the VIF values for IVs
> (2) find the IV with the largest VIF value (larger than 10) and then
> delete it
> (3) rebuild a MLR model with the remaining IVs
> (4) repeat step (2) and (3) until all the remaining IVs have a VIF value
> less than 10
> After that, I use the forward or backward method to select a subset of the
> remaining IVs to build the MLR model.  However, the final model has a
> very low R square (below 0.1).
>
> I also tried to use another method to build the MLR model. In this time, I
> use condition number to detect collinearity. The purpose is to obtain a MLR
> model that satisfies: (a) its R square is as large as possbile; and
> (b) its condition number (CN) is less than 30. I use the following method (I
> am not sure whether it is right)
> (1) build the MLR model with all the IVs (using forward or backward
> method)
> (2) examine the condition number (CN) of the MLR model:
>          if CN < 30, then Ok, otherwise delete an IV from the MLR model
> and then goto (3)
> (3) rebuild a MLR model with the remaining IVs
> (4) repeat step (2) and (3)
>
>
> The problem is which criteria should be used to delete an IV in step (2)
> when CN > 30? I tried two method: (a) delete the IV with the largest VIF
> value; or (b) delete an IV randomly. I found that (b) may result in a better
> MLR model.
>
> My purpose is to obtain a MLR model that satisfies: (a) its R square is as
> large as possbile; and (b) its condition number (CN) is less than 30. How to
> do it? Thank you very much.
>
>
> Yuming Zhou
>
>
>
>
>
>
>
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: How to deal with collinearity? Thanks

Ornelas, Fermin-2
In reply to this post by Richard Ristow
I am not sure I follow your argument, but when this data problem occurs
the standard errors are inflated, expected signs are not consistent, and
removing variables out the regression function causes significant
changes in the values of the regression coefficient and their signs. In
more severe cases of dependence among the predictors causes X'X not to
be of full rank and you cannot get its inverse. Under this scenario most
software packages will give you a warning regarding the validity of the
parameter estimates with the parameter estimates missing for the
perfectly collinear predictors.

Regarding the Bonferroni's procedure this testing is used for multiple
comparisons and one usually establishes a low alpha value say .20 that
get adjusted by the number of multiple comparisons among the means or
medians.
For example for non parametric methods, if I am comparing 3 means then
the adjustment for alpha becomes is .2/(3*2)= .033. This value is used
to calculate the critical value.

Fermin Ornelas, Ph.D.
Management Analyst III, AZ DES
1789 W. Jefferson Street
Phoenix, AZ 85032
Tel: (602) 542-5639
E-mail: [hidden email]

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Richard Ristow
Sent: Friday, November 02, 2007 7:38 PM
To: [hidden email]
Subject: Re: How to deal with collinearity? Thanks

At 12:51 PM 11/2/2007, Ornelas, Fermin wrote:

>If you are referring to multiple comparison tests for the means or the
>medians, this type of testing is not relevant here.

In that narrow sense, it is not relevant. But in a broad sense, it is
very relevant: running many significance tests of any kind, raises the
risk of false 'significant' results. In this case, I'm thinking of the
t-tests for the regression coefficients. Again, with 70 coefficients
and no actual association at all, you expect 3.5 significant at p<.05,
with about 1/3 chance of at least one significant at p<.01.

I'm not an expert, and I gather that the BONFERRONI correction can be
too conservative; but applying it, i.e. dividing the criterion p-value
by 70, you should trust t-tests for which reported p<.05/70=0.0007.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR
CONFIDENTIAL information and is intended only for the use of the
specific individual(s) to whom it is addressed.  It may contain
information that is privileged and confidential under state and federal
law.  This information may be used or disclosed only in accordance with
law, and you may be subject to penalties under law for improper use or
further disclosure of the information in this e-mail and its
attachments. If you have received this e-mail in error, please
immediately notify the person named above by reply e-mail, and then
delete the original e-mail.  Thank you.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD