Hi guys,
I'm trying to run a negative binomial regression analysis (the dependent variable is count data), and I have the following questions: 1. I need to test an interaction term of X*M, being X my independent variable and M a moderator variable. Should I standardize or center the variables X and M to avoid multicollinearity? And, would these variables be the only ones standardized/centered or also the rest of variables used in the analysis (i.e. control variables)? 2. How can I check there's no multicollinearity using a negative binomial regression? 3. Initially, I logged some of my independent variables, so I have for example "X_logged" (being X the variable mentioned before). Do I center/standardize these transformed variables (now in log) to calculate the interaction term? Or should I standardize/center the original one? That is, "X" (without the log). Sorry if this is too basic, I'm starting my PhD... THANKS for your help! |
Embedded replies. Gene Maguin
-----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Student073 Sent: Saturday, July 20, 2013 7:37 PM To: [hidden email] Subject: NEGATIVE BINOMIAL -- Interaction terms and check multicollinearity Hi guys, I'm trying to run a negative binomial regression analysis (the dependent variable is count data), and I have the following questions: >>1. I need to test an interaction term of X*M, being X my independent variable and M a moderator variable. Should I standardize or center the variables X and M to avoid multicollinearity? And, would these variables be the only ones standardized/centered or also the rest of variables used in the analysis (i.e. control variables)? Centering does not reduce collinearity (do a search on this topic in either the spss archive or on the web and you'll find demonstrations and proofs). Giving an interaction, centering reduces the correlations between the component variables and the product term, true, but it does not reduce the standard errors for the component variables or the product term, which is what you would want. >>2. How can I check there's no multicollinearity using a negative binomial regression? Multicollinearity involves the predictor variables so you should begin by looking at the correlations/associations among them. Begin with the largest correlations/associations and work your way down. You'll have one of two situations for each pair of variables: the two variables are logically-semantically-conceptually unrelated or they are related. For the subsets of variables that are related, can you justify combining them by building a composite variable. If you find more than two variables that are inter-related, you should be able to fit one factor model. For variables that are unrelated the question is whether to keep them in the analysis model or omit them; the decision should depend on your hypothesis and lit review. Part 2 is the analysis. Collinearity drives up the standard errors (SEs). So as variables are entered, watch how the standard errors of already-entered variables change. I don't know that there standard guidance about how much change is too much change. ! Keep in mind that as your enter interactions the SEs will increase, maybe dramatically. 3. Initially, I logged some of my independent variables, so I have for example "X_logged" (being X the variable mentioned before). Do I center/standardize these transformed variables (now in log) to calculate the interaction term? Or should I standardize/center the original one? That is, "X" (without the log). Sounds like you haven't tried the center then log route. Do it. You'll have your answer. Sorry if this is too basic, I'm starting my PhD... THANKS for your help! -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/NEGATIVE-BINOMIAL-Interaction-terms-and-check-multicollinearity-tp5721288.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Thanks! Your clear answer helped me a lot.
I have one more related question: If I have some independent variables highly correlated (0.7, 0.8, for example), but then I check the VIF and I find this is lower than 5 (therefore, it seems multicollinearity is not a problem), should I leave all these correlated variables, or drop some of them from the analysis? I know you mentioned I could build a composite variable... but I wonder if, given that I have no multicollinearity according to the VIF, I could leave them all. I've read on the web that high correlation does not imply multicollinearity directly (although it suggests it is possible); is this correct? Again, many thanks. |
Administrator
|
You may find some of the posts in this old Medstats discussion helpful:
https://groups.google.com/forum/#!topic/MedStats/wU-l9ycS350 See my post, and posts by Scott Millis and Peter Flom, for example.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
In reply to this post by Student073
Embedded. Gene Maguin
-----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Student073 Sent: Tuesday, July 23, 2013 6:52 AM To: [hidden email] Subject: Re: NEGATIVE BINOMIAL -- Interaction terms and check multicollinearity Thanks! Your clear answer helped me a lot. >>I have one more related question: If I have some independent variables highly correlated (0.7, 0.8, for example), but then I check the VIF and I find this is lower than 5 (therefore, it seems multicollinearity is not a problem), should I leave all these correlated variables, or drop some of them from the analysis? How are you calculating your VIF? Perhaps I'm wrong but I thought that multiple regression was the only command that yielded VIF and tolerance data. Could you be running multiple regressions to evaluate collinearity even though you have a dichotomous DV? My opinion is that the answer is not a statistics question per se; rather, it is a theory-hypothesis question. If your theory specifies that A, B, and C are each significant predictors of Y, then my opinion is that you should test that model. If, on the other hand, A, B, and C are plausibly related measures or variant measures of a construct, I think you gain more by testing and building a composite. I think you said you are working on your dissertation so I'd also say that this is a chair/committee discussion question. >>I know you mentioned I could build a composite variable... but I wonder if, given that I have no multicollinearity according to the VIF, I could leave them all. Yes, you absolutely could. >>I've read on the web that high correlation does not imply multicollinearity directly (although it suggests it is possible); is this correct? To my knowledge this is true. The IVs correlations with the DV also matters. Again, many thanks. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/NEGATIVE-BINOMIAL-Interaction-terms-and-check-multicollinearity-tp5721288p5721307.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Student073
I don't know where you get this rule of thumb, but are
you really, really totally indifferent to the notion that errors are "only" increased by a factor that is less than 5-fold? - I don't like to *see* increases of 2-fold, though they aren't always possible to avoid. The only *strict* definition of multicollinearity is the one that says the VIF is infinite, because the set of variables has redundancy. Otherwise, you are dealing with someone else's experience of what has been pragmatically useful in their area. In clinical research, scales that are correlated at 0.7 are usually at the limit of their reliability. That means that the only way you will see a higher correlation is when they also have error that is correlated -- such as, by being sub-scales of one set of ratings. So, I keep in mind that it may be an illusion that they are different in *any* interesting respect. Keep them separate, or make a composite? Your circumstance is your own. Combining them means that you simplify the narrative that you are telling from your data, and you increase the power of the test of *this* hypothesis, the one represented by the latent factor of whatever the two (or more) scores represent. "Correlated error" brings up the prospect that you might find value in using a variable that represents the *difference* between two highly correlated scores -- partly because taking the difference removes some of the error. This is a topic that falls under the heading of "suppressor variables," about which much has been written. -- Rich Ulrich > Date: Tue, 23 Jul 2013 03:51:46 -0700 > From: [hidden email] > Subject: Re: NEGATIVE BINOMIAL -- Interaction terms and check multicollinearity > To: [hidden email] > > Thanks! Your clear answer helped me a lot. > > I have one more related question: If I have some independent variables > highly correlated (0.7, 0.8, for example), but then I check the VIF and I > find this is lower than 5 (therefore, it seems multicollinearity is not a > problem), should I leave all these correlated variables, or drop some of > them from the analysis? > > I know you mentioned I could build a composite variable... but I wonder if, > given that I have no multicollinearity according to the VIF, I could leave > them all. > > I've read on the web that high correlation does not imply multicollinearity > directly (although it suggests it is possible); is this correct? > > Again, many thanks. > |
Administrator
|
In reply to this post by Maguin, Eugene
Gene, I'm not sure if you meant to imply that it would be wrong to use the REGRESSION procedure to compute tolerance and VIF when the outcome variable is dichotomous. Given that tolerance and VIF are computed using only the explanatory variables (i.e., it doesn't matter what you use as the DV--see the examples below), I would argue that it is fine to use REGRESSION for that purpose, regardless of the nature of the outcome variable. * Demonstration that VIF and Tolerance (from REGRESSION) * are not affected by which variable is used as the outcome * variable (so long as the cases used for the analysis * remain the same). * Modify the FILE HANDLE below as necessary. FILE HANDLE TheDataFile /NAME="C:\SPSSdata\survey_sample.sav". NEW FILE. DATASET CLOSE all. GET FILE = "TheDataFile". DESCRIPTIVES sex race age educ paeduc maeduc speduc. * Keep only cases that have valid data for all of these variables. SELECT IF NMISS(sex,race,age,educ,paeduc,maeduc,speduc) EQ 0. DESCRIPTIVES sex race age educ paeduc maeduc speduc. * Regression with Y = AGE and X = the 4 EDUC variables. REGRESSION /STATISTICS COEFF OUTS CI(95) R ANOVA COLLIN TOL /DEPENDENT age /METHOD=ENTER educ paeduc maeduc speduc. * Repeat, but with Y = Sex. REGRESSION /STATISTICS COEFF OUTS CI(95) R ANOVA COLLIN TOL /DEPENDENT sex /METHOD=ENTER educ paeduc maeduc speduc. * Again with Y = race. REGRESSION /STATISTICS COEFF OUTS CI(95) R ANOVA COLLIN TOL /DEPENDENT race /METHOD=ENTER educ paeduc maeduc speduc. * With Y = the case number in the file. COMPUTE CaseNum = $casenum. REGRESSION /STATISTICS COEFF OUTS CI(95) R ANOVA COLLIN TOL /DEPENDENT CaseNum /METHOD=ENTER educ paeduc maeduc speduc. * Notice that Tolerance and VIF are the same for all of these models. * Why? Because they are measures of how well each explanatory * variable in turn can be predicted from the other explanatory * variables in the model. RESULTS: With Y = Age Tol. VIF .581 1.720 .520 1.925 .549 1.821 .639 1.564 With Y = Sex Tol. VIF .581 1.720 .520 1.925 .549 1.821 .639 1.564 With Y = Race Tol. VIF .581 1.720 .520 1.925 .549 1.821 .639 1.564 With Y = Case number Tol. VIF .581 1.720 .520 1.925 .549 1.821 .639 1.564 HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
Bruce, Jon,
What I wrote with respect to the using the regression command and to saying that the DV was also involved in collinearity was wrong. Thank you for correcting my bad information. Gene Maguin -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Bruce Weaver Sent: Tuesday, July 23, 2013 3:31 PM To: [hidden email] Subject: Re: NEGATIVE BINOMIAL -- Interaction terms and check multicollinearity Maguin, Eugene wrote > How are you calculating your VIF? Perhaps I'm wrong but I thought that > multiple regression was the only command that yielded VIF and > tolerance data. Could you be running multiple regressions to evaluate > collinearity even though you have a dichotomous DV? Gene, I'm not sure if you meant to imply that it would be wrong to use the REGRESSION procedure to compute tolerance and VIF when the outcome variable is dichotomous. Given that tolerance and VIF are computed using only the explanatory variables (i.e., it doesn't matter what you use as the DV--see the examples below), I would argue that it is fine to use REGRESSION for that purpose, regardless of the nature of the outcome variable. * Demonstration that VIF and Tolerance (from REGRESSION) * are not affected by which variable is used as the outcome * variable (*so long as the cases used for the analysis* * *remain the same*). * Modify the FILE HANDLE below as necessary. FILE HANDLE TheDataFile /NAME="C:\SPSSdata\survey_sample.sav". NEW FILE. DATASET CLOSE all. GET FILE = "TheDataFile". DESCRIPTIVES sex race age educ paeduc maeduc speduc. * Keep only cases that have valid data for all of these variables. SELECT IF NMISS(sex,race,age,educ,paeduc,maeduc,speduc) EQ 0. DESCRIPTIVES sex race age educ paeduc maeduc speduc. * Regression with Y = AGE and X = the 4 EDUC variables. REGRESSION /STATISTICS COEFF OUTS CI(95) R ANOVA COLLIN TOL /DEPENDENT age /METHOD=ENTER educ paeduc maeduc speduc. * Repeat, but with Y = Sex. REGRESSION /STATISTICS COEFF OUTS CI(95) R ANOVA COLLIN TOL /DEPENDENT sex /METHOD=ENTER educ paeduc maeduc speduc. * Again with Y = race. REGRESSION /STATISTICS COEFF OUTS CI(95) R ANOVA COLLIN TOL /DEPENDENT race /METHOD=ENTER educ paeduc maeduc speduc. * With Y = the case number in the file. COMPUTE CaseNum = $casenum. REGRESSION /STATISTICS COEFF OUTS CI(95) R ANOVA COLLIN TOL /DEPENDENT CaseNum /METHOD=ENTER educ paeduc maeduc speduc. * Notice that Tolerance and VIF are the same for all of these models. * Why? Because they are measures of how well each explanatory * variable in turn can be predicted from the other explanatory * variables in the model. RESULTS: With Y = Age Tol. VIF .581 1.720 .520 1.925 .549 1.821 .639 1.564 With Y = Sex Tol. VIF .581 1.720 .520 1.925 .549 1.821 .639 1.564 With Y = Race Tol. VIF .581 1.720 .520 1.925 .549 1.821 .639 1.564 With Y = Case number Tol. VIF .581 1.720 .520 1.925 .549 1.821 .639 1.564 HTH. ----- -- Bruce Weaver [hidden email] http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." NOTE: My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/NEGATIVE-BINOMIAL-Interaction-terms-and-check-multicollinearity-tp5721288p5721312.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Student073
High correlations always imply multicollinearity, but the reverse is true. When you have more than two correlated variables, you can have high multicollinearity even in the absence of high zero-order correlation. A zero-order correlation of .80 would not necessarily produce a VIF greater than 5. David Greenberg, Sociology Department, New York University On Tue, Jul 23, 2013 at 6:51 AM, Student073 <[hidden email]> wrote: Thanks! Your clear answer helped me a lot. |
The reverse is true because the linear combination of two predictors could account for a large proportion of the variance of a third predictor (and one would
not necessarily know that by looking at the zero order correlations). MTC. From: SPSSX(r) Discussion [mailto:[hidden email]]
On Behalf Of David Greenberg High correlations always imply multicollinearity, but the reverse is true. When you have more than two correlated variables, you can have high multicollinearity even in the absence of high zero-order correlation. A zero-order correlation
of .80 would not necessarily produce a VIF greater than 5. David Greenberg, Sociology Department, New York University On Tue, Jul 23, 2013 at 6:51 AM, Student073 <[hidden email]> wrote: Thanks! Your clear answer helped me a lot. |
Administrator
|
In reply to this post by David Greenberg
David, I think you left out a NOT--i.e., the reverse is NOT true (i.e., multicollinearity does NOT imply high bivariate correlations).
Re the first part of your statement, I would add that high bivariate correlations do not always imply problematic multicollinearity. For example, if I include both X and X-squared in a model, the correlation between them may be quite high (depending on whether I centered or not, etc). But that high correlation would not indicate anything problematic. (By centering, I might make things look better in terms of a lower correlation, but it wouldn't change anything important about the model. I'd get exactly the same fitted value for each case with or without centering.) HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
Free forum by Nabble | Edit this page |