|
Dear Listers, First, thanks to Martin Holt, Ryan Black, and Bruce Weaver for helpful comments on another recent logistic regression question. This one is much more basic, but very surprising (to me, anyway). I have 32 cases, divided into 16 and 16, with a dichotomous outcome. The data look like this: (Group is A or B; outcome is Yes or No) Yes No A 16 0 B 6 10 As you might expect, chi-square is highly significant: 14.5, p< .001. However, using this data in a binomial logistic regression with additional continuous predictor variables yielded weirdly high p values for Group: like p= .996. I eliminated the continuous predictors, so there was just the dichotomous predictor and dichotomous outcome. Results: The classification table showed overall correct classification as 81.3%. But Variable in the equation (Step1) was, for Group: B= 21.7, S.E.= 10048.2 Sig.= .998. Obviously the huge SE was what was making it non-significant. Finally decided the problem had to be the empty cell. I switched one of the outcome values and re-ran Yes No A 15 1 B 6 10 Results, of course, less significant chi-square, but the following for Group: B= 3.2, S.E.= 1.2 Sig.= .005. SPSS Help says, under Data Considerations: However, your solution may be more stable if your predictors have a multivariate normal distribution. Additionally, as with other forms of regression, multicollinearity among the predictors can lead to biased estimates and inflated standard errors. Inflated? I guess so, by about 10,000 times! It would have been nice if this section simply said, "Does not work with an empty cell." Anybody know a way around this problem that won't lose power? Remember, I want to include continuous predictors also. I have not tried it with plain MR, but I don't see why that would be different. Thanks! Research Consulting [hidden email] Business & Cell (any time): 215-820-8100 Home (8am-10pm, 7 days/week): 215-885-5313 Address: 108 Cliff Terrace, Wyncote, PA 19095 Visit my Web site at www.dissertationconsulting.net ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Hi Allan,
Remember that it's the **expected** counts that matter, rather than the actual counts.
The following link is excellent, taking you into and through and out the other side on 2x2 tables. It expands on the methods section published in the paper: Campbell Ian, 2007, Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations, Statistics in Medicine, 26, 3661 - 3675.
In a logistic regression it is common to accept "more than 10" per factor in the analysis, yet some, including me, prefer "more than 15". Peduzzi et al ran simulation studies and settled on 10:
Michael A. Babyak. What You See May Not Be What You Get: A Brief,
Nontechnical Introduction to Overfitting in Regression-Type Models. Psychosom Med 2004 66: 411-421. and Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. . A simulation study of the number of events per variable in logistic regression analysis.J Clin Epidemiol. 1996 Dec;49(12):1373-9. I'd concentrate on Ian Campbell's papers and you'll find an answer....but you might not like it :(
Best Wishes,
Martin Holt
From: "Allan Lundy, PhD" <[hidden email]> To: [hidden email] Sent: Saturday, 12 June, 2010 22:40:38 Subject: Logistic Regression fails with empty cell Dear Listers, First, thanks to Martin Holt, Ryan Black, and Bruce Weaver for helpful comments on another recent logistic regression question. This one is much more basic, but very surprising (to me, anyway). I have 32 cases, divided into 16 and 16, with a dichotomous outcome. The data look like this: (Group is A or B; outcome is Yes or No) Yes No A 16 0 B 6 10 As you might expect, chi-square is highly significant: 14.5, p< .001. However, using this data in a binomial logistic regression with additional continuous predictor variables yielded weirdly high p values for Group: like p= .996. I eliminated the continuous predictors, so there was just the dichotomous predictor and dichotomous outcome. Results: The classification table showed overall correct classification as 81.3%. But Variable in the equation (Step1) was, for Group: B= 21.7, S.E.= 10048.2 Sig.= .998. Obviously the huge SE was what was making it non-significant. Finally decided the problem had to be the empty cell. I switched one of the outcome values and re-ran Yes No A 15 1 B 6 10 Results, of course, less significant chi-square, but the following for Group: B= 3.2, S.E.= 1.2 Sig.= .005. SPSS Help says, under Data Considerations: However, your solution may be more stable if your predictors have a multivariate normal distribution. Additionally, as with other forms of regression, multicollinearity among the predictors can lead to biased estimates and inflated standard errors. Inflated? I guess so, by about 10,000 times! It would have been nice if this section simply said, "Does not work with an empty cell." Anybody know a way around this problem that won't lose power? Remember, I want to include continuous predictors also. I have not tried it with plain MR, but I don't see why that would be different. Thanks! Allan Lundy, PhD |
|
Administrator
|
Hi Martin. I think Allan's problem with that table is that computation of the odds ratio entails division by 0.
Yes No A 16 0 B 6 10 OR = (16*10) / (0*6) = ERROR! One common solution in this case is to add a small amount (usually 0.5) to each cell. IIRC, Agresti argues that this actually gives an improved estimate of the SE of ln(OR), even when division by 0 is not a problem. I don't know off the top of my head how one would make such an adjustment (i.e., adding 0.5 to each cell) when using the LOGISTIC REGRESSION procedure. Allan, I do agree with Martin's comments on the number of "events" per explanatory variable. And given that you have only 10 "events", you're almost certainly overfitting the model with 2 (or more) explanatory variables. HTH. Bruce
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
|
Bruce Weaver wrote:
> Hi Martin. I think Allan's problem with that table is that computation of > the odds ratio entails division by 0. > > Yes No > A 16 0 > B 6 10 > > OR = (16*10) / (0*6) = ERROR! > > One common solution in this case is to add a small amount (usually 0.5) to > each cell. IIRC, Agresti argues that this actually gives an improved > estimate of the SE of ln(OR), even when division by 0 is not a problem. I > don't know off the top of my head how one would make such an adjustment > (i.e., adding 0.5 to each cell) when using the LOGISTIC REGRESSION > procedure. Since only binaty predictors are used, one solution I have used myself is aggregate the dataset, add 0.5 to every frequency, weight by the new frequencies and run LOGISTIC REGRESSION again. Alsa, I recall that LOGLINEAR had an option to add 0.5 to each cell automatically. HTH, Marta GG ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Allan Lundy, PhD
A few observations, on a slightly different tack from previous replies.
1. When I run this analysis, under Binary Logistic Regression, with Group A=1, GroupB=2, 1=Yes, 2=No, I get an error message under Model Summary: "Estimation terminated at iteration number 20 because maximum iterations has been reached. Final solution cannot be found.". When I increase the maximum iterations, I get a more obvious error message at the head of the output: "Estimation failed due to numerical problem. Possible reasons are: (1) at least one of the convergence criteria LCON, BCON is zero or too small, or (2) the value of EPS is too small (if not specified, the default value that is used may be too small for this data set)."; and SPSS does not attempt to calculate a p-value for Group. 2. These warnings might seem appropriate. However, if I code 2=Yes and 1=No the analysis goes ahead with no warning, terminating at iteration 37 and giving even wilder results for Group. (It does not, however, seem to make any difference if I swap round the codings for Group, nor if I change the reference category, no matter which way round I code Yes and No.) 3. When I run the same data under Multinomial Logistic Regression, SPSS has equal trouble with the Parameter Estimates, including the odds ratio. However this procedure also produces Likelihood Ratio tests, which do show p<.001 for the effect of Group. (But it also produces an error message at the top of the output saying the "validity of the model fit is uncertain", so presumably it would have been unwise to rely on these results. Mike Griffiths Date: Sat, 12 Jun 2010 17:40:38 -0400 From: [hidden email] Subject: Logistic Regression fails with empty cell To: [hidden email] Dear Listers, First, thanks to Martin Holt, Ryan Black, and Bruce Weaver for helpful comments on another recent logistic regression question. This one is much more basic, but very surprising (to me, anyway). I have 32 cases, divided into 16 and 16, with a dichotomous outcome. The data look like this: (Group is A or B; outcome is Yes or No) Yes No A 16 0 B 6 10 As you might expect, chi-square is highly significant: 14.5, p< .001. However, using this data in a binomial logistic regression with additional continuous predictor variables yielded weirdly high p values for Group: like p= .996. I eliminated the continuous predictors, so there was just the dichotomous predictor and dichotomous outcome. Results: The classification table showed overall correct classification as 81.3%. But Variable in the equation (Step1) was, for Group: B= 21.7, S.E.= 10048.2 Sig.= .998. Obviously the huge SE was what was making it non-significant. Finally decided the problem had to be the empty cell. I switched one of the outcome values and re-ran Yes No A 15 1 B 6 10 Results, of course, less significant chi-square, but the following for Group: B= 3.2, S.E.= 1.2 Sig.= .005. SPSS Help says, under Data Considerations: However, your solution may be more stable if your predictors have a multivariate normal distribution. Additionally, as with other forms of regression, multicollinearity among the predictors can lead to biased estimates and inflated standard errors. Inflated? I guess so, by about 10,000 times! It would have been nice if this section simply said, "Does not work with an empty cell." Anybody know a way around this problem that won't lose power? Remember, I want to include continuous predictors also. I have not tried it with plain MR, but I don't see why that would be different. Thanks! Allan Lundy, PhD Research Consulting [hidden email] Business & Cell (any time): 215-820-8100 Home (8am-10pm, 7 days/week): 215-885-5313 Address: 108 Cliff Terrace, Wyncote, PA 19095 Visit my Web site at www.dissertationconsulting.net ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD Get a free e-mail account with Hotmail. Sign-up now. |
|
Hosmer and Lemeshow “Applied
Logistic Regression, 2nd edition” discusses this and other numerical problems in its Section
4.5. No virus found in this incoming message. |
| Free forum by Nabble | Edit this page |
