ridge regression multicolinearity

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

ridge regression multicolinearity

Anter
Hello,

I have a problem with multicolinearity in a multiple regression analysis. Two of my predictors and the outcome are correlated at .8, VIF's are around 4-6, Tolerances at .2 - .3 and Condition index at 23. My dataset has 72 cases, 5 continuous predictors (excluding the controls, with them 13 variables, dummy coded categorical controls - age, tenure etc.)

I am running SPSS 17. I understand that in order to avoid the multicolinearity ridge regression can be used, with the CATREG  command. However, I do not know what options should I choose (I have read discretization option multiply, but besides this?), how to get the p values for the predictor's weights, and most importantly how to interpret the results.

I appreciate if anyone can suggest a way to handle this or to find more information about the procedure (I've tried googling and this much I came with).


Andra Toader
E-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ridge regression multicolinearity

Bruce Weaver
Administrator
In addition to the problems you mention, you are over-fitting your model (i.e., you have too many variables for the amount of data).  For a good overview of over-fitting, check out Mike Babyak's nice article.  

   http://www.psychosomaticmedicine.org/content/66/3/411.short

Have to get to a meeting, so no time to address the other problems right now!

HTH.


Anter wrote
Hello,

I have a problem with multicolinearity in a multiple regression analysis. Two of my predictors and the outcome are correlated at .8, VIF's are around 4-6, Tolerances at .2 - .3 and Condition index at 23. My dataset has 72 cases, 5 continuous predictors (excluding the controls, with them 13 variables, dummy coded categorical controls - age, tenure etc.)

I am running SPSS 17. I understand that in order to avoid the multicolinearity ridge regression can be used, with the CATREG  command. However, I do not know what options should I choose (I have read  discretization option multiply, but besides this?), how to get the p values for the predictor's weights, and most importantly how to interpret the results.

I appreciate if anyone can suggest a way to handle this or to find more information about the procedure (I've tried googling and this much I came with).

Andra Toader
E-mail: [hidden email]
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: ridge regression multicolinearity

Poes, Matthew Joseph
In reply to this post by Anter

This topic could get dense, and I’m sure a lot of people have strong opinions on this one way or another.  My experience is that with many models and smaller samples, it simply doesn’t make a huge difference precisely how you do this, as long as you follow certain protocols.  First is that you need to know what the variable’s are (continuous, categorical, ordinal) and pick the correct way to handle those.  If you just want to run a ridge regression with all continuous linear variables, then you want “Level=nume” for variable level is numeric.  Then you have the issue of how to discretize the variables in the transformation.  For a numeric variable you use either ranking or multiplying.  Multiply is the same as a normal linear regression.  Ranking is similar to any of the non-parametric ranking procedures, and the same thought process would be used.  If the values of your variables are non-normally distributed, then a ranking procedure would give you a more normal distribution.  You pick how you want to handle missing data, I tend to go with listwise deletion.  The key to making this a ridge regression is the regularization process, which deals with the multicolinearity.  As seen in my code below, this is “regularization=ridge”  The parameters after that are the standard values.  To generate this I literally used the point and click interface and printed the syntax to show you the standard you would get. 

 

Now you have another problem, you have an arguably over-estimated model, and as such that creates issues.  I would argue you should try and estimate less variables in your model, given your sample of only 72, but if theory dictates that all variables are crucial in the model, then I strongly encourage you to add in the bootstrap estimates.

 

In terms of the interpretation of the output.  I would look the website I will include a link for, as anything I would say would likely repeat what is here.  Basically remember what you are trying to do.  Adjust the model via a constant such that the multi-colinearity is reduced, but the r squared remains roughly the same.  I tried to run a sample of data to show you, but infortunately the variables weren’t correlated enough for ridge regression to be appropriate, and those that were didn’t change as you would want in a ridge regression. 

http://www.coe.fau.edu/faculty/morris/STA7114%20Files/Lab%203/Instructions/ridge_regression.htm

 

 

CATREG VARIABLES=famrewwk sestch sesstr sesneg seshrsh seseffi fambldg famhmwk

  /ANALYSIS=famrewwk(LEVEL=NUME) WITH sestch(LEVEL=NUME) sesstr(LEVEL=NUME) sesneg(LEVEL=NUME)

    seshrsh(LEVEL=NUME) seseffi(LEVEL=NUME) fambldg(LEVEL=NUME) famhmwk(LEVEL=NUME)

  /DISCRETIZATION=famrewwk(MULTIPLYING) sestch(MULTIPLYING) sesstr(MULTIPLYING) sesneg(MULTIPLYING)

    seshrsh(MULTIPLYING) seseffi(MULTIPLYING) fambldg(MULTIPLYING) famhmwk(MULTIPLYING)

  /MISSING=famrewwk(LISTWISE) sestch(LISTWISE) sesstr(LISTWISE) sesneg(LISTWISE) seshrsh(LISTWISE)

    seseffi(LISTWISE) fambldg(LISTWISE) famhmwk(LISTWISE)

  /MAXITER=100

  /CRITITER=.00001

  /PRINT=R COEFF OCORR CORR ANOVA DESCRIP(sestch)  REGU

  /INITIAL=RANDOM

  /PLOT= REGU

  /REGULARIZATION=RIDGE(0.0,1.0,0.02)(DataSet2)

  /RESAMPLE=BOOTSTRAP(500).

 

Matthew J Poes

Research Data Specialist

Center for Prevention Research and Development                                        

University of Illinois

510 Devonshire Dr.

Champaign, IL 61820

Phone: 217-265-4576

email: [hidden email]

 

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Andra Th
Sent: Thursday, September 27, 2012 2:03 PM
To: [hidden email]
Subject: ridge regression multicolinearity

 

Hello,

I have a problem with multicolinearity in a multiple regression analysis. Two of my predictors and the outcome are correlated at .8, VIF's are around 4-6, Tolerances at .2 - .3 and Condition index at 23. My dataset has 72 cases, 5 continuous predictors (excluding the controls, with them 13 variables, dummy coded categorical controls - age, tenure etc.)

I am running SPSS 17. I understand that in order to avoid the multicolinearity ridge regression can be used, with the CATREG  command. However, I do not know what options should I choose (I have read discretization option multiply, but besides this?), how to get the p values for the predictor's weights, and most importantly how to interpret the results.

I appreciate if anyone can suggest a way to handle this or to find more information about the procedure (I've tried googling and this much I came with).


Andra Toader
E-mail:
[hidden email]

 

Reply | Threaded
Open this post in threaded view
|

Re: ridge regression multicolinearity

Anter
Thank you for the suggestions!

Indeed, I found the website when searching for interpretation ideas.

 I managed now to run the model setting the parameters at your advice. Now it comes as difficult to understand what is happening since my knowledge of this procedure is less then sufficient .. Basically, I have different weights for the predictors and now they all are significant in the model, whereas before only the two highly correlated ones were. I do not understand if the multicolinearity indices are supposed to change as well, the Tolerances are still quite low and the correlations between the transformed variables are roughly the same or larger with a few units (between the highly correlated variables ..).

Yes, I've considered that it can be a problem of overfitting, the data set is well behind the advised numbers of variables per case. But I will exclude the controls, hopefully 5 predictors will be ok. Problem was that I planned to use the data for a mediation analysis, which is now less probable. Perhaps the results would have been biased anyway.

Regards,


Andra Toader
E-mail: [hidden email]


--- On Thu, 9/27/12, Poes, Matthew Joseph <[hidden email]> wrote:

From: Poes, Matthew Joseph <[hidden email]>
Subject: Re: ridge regression multicolinearity
To: [hidden email]
Date: Thursday, September 27, 2012, 8:06 PM

This topic could get dense, and I’m sure a lot of people have strong opinions on this one way or another.  My experience is that with many models and smaller samples, it simply doesn’t make a huge difference precisely how you do this, as long as you follow certain protocols.  First is that you need to know what the variable’s are (continuous, categorical, ordinal) and pick the correct way to handle those.  If you just want to run a ridge regression with all continuous linear variables, then you want “Level=nume” for variable level is numeric.  Then you have the issue of how to discretize the variables in the transformation.  For a numeric variable you use either ranking or multiplying.  Multiply is the same as a normal linear regression.  Ranking is similar to any of the non-parametric ranking procedures, and the same thought process would be used.  If the values of your variables are non-normally distributed, then a ranking procedure would give you a more normal distribution.  You pick how you want to handle missing data, I tend to go with listwise deletion.  The key to making this a ridge regression is the regularization process, which deals with the multicolinearity.  As seen in my code below, this is “regularization=ridge”  The parameters after that are the standard values.  To generate this I literally used the point and click interface and printed the syntax to show you the standard you would get. 

 

Now you have another problem, you have an arguably over-estimated model, and as such that creates issues.  I would argue you should try and estimate less variables in your model, given your sample of only 72, but if theory dictates that all variables are crucial in the model, then I strongly encourage you to add in the bootstrap estimates.

 

In terms of the interpretation of the output.  I would look the website I will include a link for, as anything I would say would likely repeat what is here.  Basically remember what you are trying to do.  Adjust the model via a constant such that the multi-colinearity is reduced, but the r squared remains roughly the same.  I tried to run a sample of data to show you, but infortunately the variables weren’t correlated enough for ridge regression to be appropriate, and those that were didn’t change as you would want in a ridge regression. 

http://www.coe.fau.edu/faculty/morris/STA7114%20Files/Lab%203/Instructions/ridge_regression.htm

 

 

CATREG VARIABLES=famrewwk sestch sesstr sesneg seshrsh seseffi fambldg famhmwk

  /ANALYSIS=famrewwk(LEVEL=NUME) WITH sestch(LEVEL=NUME) sesstr(LEVEL=NUME) sesneg(LEVEL=NUME)

    seshrsh(LEVEL=NUME) seseffi(LEVEL=NUME) fambldg(LEVEL=NUME) famhmwk(LEVEL=NUME)

  /DISCRETIZATION=famrewwk(MULTIPLYING) sestch(MULTIPLYING) sesstr(MULTIPLYING) sesneg(MULTIPLYING)

    seshrsh(MULTIPLYING) seseffi(MULTIPLYING) fambldg(MULTIPLYING) famhmwk(MULTIPLYING)

  /MISSING=famrewwk(LISTWISE) sestch(LISTWISE) sesstr(LISTWISE) sesneg(LISTWISE) seshrsh(LISTWISE)

    seseffi(LISTWISE) fambldg(LISTWISE) famhmwk(LISTWISE)

  /MAXITER=100

  /CRITITER=.00001

  /PRINT=R COEFF OCORR CORR ANOVA DESCRIP(sestch)  REGU

  /INITIAL=RANDOM

  /PLOT= REGU

  /REGULARIZATION=RIDGE(0.0,1.0,0.02)(DataSet2)

  /RESAMPLE=BOOTSTRAP(500).

 

Matthew J Poes

Research Data Specialist

Center for Prevention Research and Development                                        

University of Illinois

510 Devonshire Dr.

Champaign, IL 61820

Phone: 217-265-4576

email: mpoes@...

 

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Andra Th
Sent: Thursday, September 27, 2012 2:03 PM
To: [hidden email]
Subject: ridge regression multicolinearity

 

Hello,

I have a problem with multicolinearity in a multiple regression analysis. Two of my predictors and the outcome are correlated at .8, VIF's are around 4-6, Tolerances at .2 - .3 and Condition index at 23. My dataset has 72 cases, 5 continuous predictors (excluding the controls, with them 13 variables, dummy coded categorical controls - age, tenure etc.)

I am running SPSS 17. I understand that in order to avoid the multicolinearity ridge regression can be used, with the CATREG  command. However, I do not know what options should I choose (I have read discretization option multiply, but besides this?), how to get the p values for the predictor's weights, and most importantly how to interpret the results.

I appreciate if anyone can suggest a way to handle this or to find more information about the procedure (I've tried googling and this much I came with).


Andra Toader
E-mail:
andra.theodor@...