Hey everyone
I am using logistic regression in my assignments and the correlation matrix shows that only two of the four independent variables are significantly correlated with the outcome. I am wondering if I have to enter only these two variables in the regression analysis or I should enter all the 4 variables?? |
Administrator
|
Hello Sahar. Bivariate pre-screening is not recommended. See http://biostat.mc.vanderbilt.edu/wiki/Main/ManuscriptChecklist, and scroll down to the points commenting on use of stepwise regression and "Lack of insignificant [I would say non-significant] variables in the final model". See also Mike Babyak's nice article on over-fitting regression models, which also addresses the issue of bivariate pre-screening.
http://os1.amc.nl/mediawiki/images/Babyak_-_overfitting.pdf If the variables are interesting, or things you wish to control for, and you have enough events-per-variable to include them (something you can read about in the Babyak article), then you ought to keep them in the model. HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
In reply to this post by sahar
Sahar,
What you enter into a regression model depends on your purpose for running it. It is certainly possible for a variable that does not have a significant bivariate relationship with your DV to make a significant incremental contribution to the prediction of the DV when it is added to a model at a later stage. The IV in question may have a significant unique contribution in the context of other predictors in the model. However, the choice of what you enter into a regression is best made on conceptual grounds, for reasons that have been discussed at length on this list. Best Regards, Stephen Brand www.StatisticsDoc.com -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of sahar Sent: Sunday, December 02, 2012 10:40 PM To: [hidden email] Subject: Logistic regression Hey everyone I am using logistic regression in my assignments and the correlation matrix shows that only two of the four independent variables are significantly correlated with the outcome. I am wondering if I have to enter only these two variables in the regression analysis or I should enter all the 4 variables?? -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Logistic-regression-tp5716589. html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by sahar
First, we don't like to answer people's assignments for them, as the purpose of assignments is for a student to learn, and however, I do think a short lesson is in order here.
First, with a logistic regression, you can't run a correlation matrix. Your outcome is binary, correct? That means its typically two values, often 0 and 1. The correlation matrix would be with the probability distribution, which you would need to first create. Did the correlation matrix come from the logistic regression, or did you simply run a correlation on the variables? If so, your findings don't mean anything, right? I personally disagree with the view that you should build an analysis from a more basic analysis to a more advanced analysis (i.e. a correlation to a regression) as I feel that often confuses the issues. In this particular case, it doesn't even make sense. A correlation is the degree to which one variable varies along with another (i.e. as one goes up, the other goes up, as one goes down, the other goes down), right? If the variable is simply a 0 and 1, then you can't have that variable vary with another, it simply is or isn't a certain value. Instead, a logistic regression creates a probability distribution known as a logistic function that an individual case is one event over the other. This function is a function of the explanatory variables (IV's). The DV is categorical and often thought of as an event (e.g. yes or no), but is converted to a continuous variable as probability scores. In this way, a common OLS regression can be run on the DV. However, because this is a probability distribution, the residuals are not normally distributed. Remember that you probably have learned an assumption of linear regression is a normally distributed set of residuals. Since it's not possible to have this with logistic regression,! an iterative process is used known as maximum likelihood estimation. To keep this topic simple, this is why logistic regression looks and feels so different from linear regression. However, I think a student new to logistic regression would better be able to interpret logistic regression if they remember that at its essence, its very similar to an OLS regression. The difference is that whatever your actual DV is, it's now thought of as probabilities. So for example, you are looking to see if someone smokes or not based on a set of predictors. Your predictors (explanatory variables or IV's) are age, gender, and socio economic status. Your predicted, explained or DV is do they smoke or not. The variable exists as a yes or no, coded as a 0 for no and a 1 for yes. Then you run your model. The Smoking variable is converted to a probability distribution. A regression is run on this distribution, and you are presented with a set of beta coefficients for each IV on the DV. How do you interpret these? Remember that, for all practical purposes, you have a linear relationship being predicted between the IV and the DV, the difference being that it's not, did they smoke or not, it's the probability that they smoke or not. You interpret it as, for everyone 1 point increase in the IV, there is a beta coefficient value point increase in the probability that they smoke. So let's say that the coefficient for age turns out to be ! .89. That means for every 1 year older than the intercept, the person is .89 probability points higher on their likeliness of smoking. Well what is a probability point, it's not-intuitive so few researchers interpret these directly. Instead we take the exponent of the probability, which gives us the odds ratio, which is more similar to the kinds of outcomes we have in linear regression(as it meets the criteria of a true continuous variable, it goes from - infinity to positive infinity). On top of that, its intuitive to interpret. The odds ratio then of .89 is roughly 2, and can be interpreted as meaning that for every 1 year older someone is, they are 2 times more likely to smoke. This is thus an interpretation that people can understand. Matthew J Poes Research Data Specialist Center for Prevention Research and Development University of Illinois 510 Devonshire Dr. Champaign, IL 61820 Phone: 217-265-4576 email: [hidden email] -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of sahar Sent: Sunday, December 02, 2012 9:40 PM To: [hidden email] Subject: Logistic regression Hey everyone I am using logistic regression in my assignments and the correlation matrix shows that only two of the four independent variables are significantly correlated with the outcome. I am wondering if I have to enter only these two variables in the regression analysis or I should enter all the 4 variables?? -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Logistic-regression-tp5716589.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Thank you for your email. I am not in the office today and will respond to your email upon my return on Wednesday December 5.
Valerie Villella Education Coordinator & Policy and Program Analyst OANHSS 905-851-8821 ext. 228 ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Poes, Matthew Joseph
Matthew: Here is my tcw. I would differentiate between cohort studies vs. case control studies and note that OR should not be interpreted as a ratio of probabilities. Relative risk are ratios of probabilities but not ORs. Thus the statement of 2 times more likely would be incorrect and would be better if it was stated as the odds are two times greater for ........ martin
Martin F. Sherman, Ph.D. Professor of Psychology Director of Masters Education in Psychology: Thesis Track Loyola University Maryland Department of Psychology 222 B Beatty Hall 4501 North Charles Street Baltimore, MD 21210 410-617-2417 [hidden email] -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Poes, Matthew Joseph Sent: Monday, December 03, 2012 12:56 PM To: [hidden email] Subject: Re: Logistic regression First, we don't like to answer people's assignments for them, as the purpose of assignments is for a student to learn, and however, I do think a short lesson is in order here. First, with a logistic regression, you can't run a correlation matrix. Your outcome is binary, correct? That means its typically two values, often 0 and 1. The correlation matrix would be with the probability distribution, which you would need to first create. Did the correlation matrix come from the logistic regression, or did you simply run a correlation on the variables? If so, your findings don't mean anything, right? I personally disagree with the view that you should build an analysis from a more basic analysis to a more advanced analysis (i.e. a correlation to a regression) as I feel that often confuses the issues. In this particular case, it doesn't even make sense. A correlation is the degree to which one variable varies along with another (i.e. as one goes up, the other goes up, as one goes down, the other goes down), right? If the variable is simply a 0 and 1, then you can't have that variable vary with another, it simply is or isn't a certain value. Instead, a logistic regression creates a probability distribution known as a logistic function that an individual case is one event over the other. This function is a function of the explanatory variables (IV's). The DV is categorical and often thought of as an event (e.g. yes or no), but is converted to a continuous variable as probability scores. In this way, a common OLS regression can be run on the DV. However, because this is a probability distribution, the residuals are not normally distributed. Remember that you probably have learned an assumption of linear regression is a normally distributed set of residuals. Since it's not possible to have this with logistic regression,! an iterative process is used known as maximum likelihood estimation. To keep this topic simple, this is why logistic regression looks and feels so different from linear regression. However, I think a student new to logistic regression would better be able to interpret logistic regression if they remember that at its essence, its very similar to an OLS regression. The difference is that whatever your actual DV is, it's now thought of as probabilities. So for example, you are looking to see if someone smokes or not based on a set of predictors. Your predictors (explanatory variables or IV's) are age, gender, and socio economic status. Your predicted, explained or DV is do they smoke or not. The variable exists as a yes or no, coded as a 0 for no and a 1 for yes. Then you run your model. The Smoking variable is converted to a probability distribution. A regression is run on this distribution, and you are presented with a set of beta coefficients for each IV on the DV. How do you interpret these? Remember that, for all practical purposes, you have a linear relationship being predicted between the IV and the DV, the difference being that it's not, did they smoke or not, it's the probability that they smoke or not. You interpret it as, for everyone 1 point increase in the IV, there is a beta coefficient value point increase in the probability that they smoke. So let's say that the coefficient for age turns out to be ! .89. That means for every 1 year older than the intercept, the person is .89 probability points higher on their likeliness of smoking. Well what is a probability point, it's not-intuitive so few researchers interpret these directly. Instead we take the exponent of the probability, which gives us the odds ratio, which is more similar to the kinds of outcomes we have in linear regression(as it meets the criteria of a true continuous variable, it goes from - infinity to positive infinity). On top of that, its intuitive to interpret. The odds ratio then of .89 is roughly 2, and can be interpreted as meaning that for every 1 year older someone is, they are 2 times more likely to smoke. This is thus an interpretation that people can understand. Matthew J Poes Research Data Specialist Center for Prevention Research and Development University of Illinois 510 Devonshire Dr. Champaign, IL 61820 Phone: 217-265-4576 email: [hidden email] -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of sahar Sent: Sunday, December 02, 2012 9:40 PM To: [hidden email] Subject: Logistic regression Hey everyone I am using logistic regression in my assignments and the correlation matrix shows that only two of the four independent variables are significantly correlated with the outcome. I am wondering if I have to enter only these two variables in the regression analysis or I should enter all the 4 variables?? -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Logistic-regression-tp5716589.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |