Logistic Regression fails with empty cell

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Logistic Regression fails with empty cell

Allan Lundy, PhD

Dear Listers,
First, thanks to Martin Holt, Ryan Black, and Bruce Weaver for helpful comments on another recent logistic regression question.

This one is much more basic, but very surprising (to me, anyway).  I have 32 cases, divided into 16 and 16, with a dichotomous outcome.  The data look like this:
(Group is A or B; outcome is Yes or No)

       Yes    No
A       16     0
B         6   10

As you might expect, chi-square is highly significant:  14.5, p< .001.

However, using this data in a binomial logistic regression with additional continuous predictor variables yielded weirdly high p values for Group: like p= .996.

I eliminated the continuous predictors, so there was just the dichotomous predictor and dichotomous outcome.  Results:
The classification table showed overall correct classification as 81.3%.

But Variable in the equation (Step1) was, for Group:
B= 21.7, S.E.= 10048.2   Sig.= .998.
Obviously the huge SE was what was making it non-significant.

Finally decided the problem had to be the empty cell.  I switched one of the outcome values and re-ran

       Yes    No
A       15     1
B         6   10

Results, of course, less significant chi-square, but the following for Group:
B= 3.2, S.E.= 1.2   Sig.= .005.

SPSS Help says, under Data Considerations:

However, your solution may be more stable if your predictors have a multivariate normal distribution. Additionally, as with other forms of regression, multicollinearity among the predictors can lead to biased estimates and inflated standard errors.

Inflated?  I guess so, by about 10,000 times!  It would have been nice if this section simply said, "Does not work with an empty cell."

Anybody know a way around this problem that won't lose power?  Remember, I want to include continuous predictors also.  I have not tried it with plain MR, but I don't see why that would be different.

Thanks!

Allan Lundy, PhD
Research Consulting
[hidden email]

Business & Cell (any time): 215-820-8100
Home (8am-10pm, 7 days/week): 215-885-5313
Address:  108 Cliff Terrace, Wyncote, PA 19095
Visit my Web site at www.dissertationconsulting.net ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression fails with empty cell

Martin Holt
Hi Allan,
 
Remember that it's the **expected** counts that matter, rather than the actual counts.

 
The following link is excellent, taking you into and through and out the other side on 2x2 tables. It expands on the methods section published in the paper: Campbell Ian, 2007, Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations, Statistics in Medicine, 26, 3661 - 3675.
 
 
In a logistic regression it is common to accept "more than 10" per factor in the analysis, yet some, including me, prefer "more than 15". Peduzzi et al ran simulation studies and settled on 10:
 
Michael A. Babyak.  What You See May Not Be What You Get: A Brief,
Nontechnical Introduction to Overfitting in Regression-Type Models.
Psychosom Med 2004 66: 411-421.
and
Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. . A
simulation study of the number of events per variable in logistic
regression analysis.J Clin Epidemiol. 1996 Dec;49(12):1373-9.
 I'd concentrate on Ian Campbell's papers and you'll find an answer....but you might not like it :(
 
Best Wishes,
 
Martin Holt

From: "Allan Lundy, PhD" <[hidden email]>
To: [hidden email]
Sent: Saturday, 12 June, 2010 22:40:38
Subject: Logistic Regression fails with empty cell


Dear Listers,
First, thanks to Martin Holt, Ryan Black, and Bruce Weaver for helpful comments on another recent logistic regression question.

This one is much more basic, but very surprising (to me, anyway).  I have 32 cases, divided into 16 and 16, with a dichotomous outcome.  The data look like this:
(Group is A or B; outcome is Yes or No)

       Yes    No
A       16     0
B         6   10

As you might expect, chi-square is highly significant:  14.5, p< .001.

However, using this data in a binomial logistic regression with additional continuous predictor variables yielded weirdly high p values for Group: like p= .996.

I eliminated the continuous predictors, so there was just the dichotomous predictor and dichotomous outcome.  Results:
The classification table showed overall correct classification as 81.3%.

But Variable in the equation (Step1) was, for Group:
B= 21.7, S.E.= 10048.2   Sig.= .998.
Obviously the huge SE was what was making it non-significant.

Finally decided the problem had to be the empty cell.  I switched one of the outcome values and re-ran

       Yes    No
A       15     1
B         6   10

Results, of course, less significant chi-square, but the following for Group:
B= 3.2, S.E.= 1.2   Sig.= .005.

SPSS Help says, under Data Considerations:

However, your solution may be more stable if your predictors have a multivariate normal distribution. Additionally, as with other forms of regression, multicollinearity among the predictors can lead to biased estimates and inflated standard errors.

Inflated?  I guess so, by about 10,000 times!  It would have been nice if this section simply said, "Does not work with an empty cell."

Anybody know a way around this problem that won't lose power?  Remember, I want to include continuous predictors also.  I have not tried it with plain MR, but I don't see why that would be different.

Thanks!

Allan Lundy, PhD
Research Consulting
[hidden email]

Business & Cell (any time): 215-820-8100
Home (8am-10pm, 7 days/week): 215-885-5313
Address:  108 Cliff Terrace, Wyncote, PA 19095
Visit my Web site at www.dissertationconsulting.net ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression fails with empty cell

Bruce Weaver
Administrator
Hi Martin.  I think Allan's problem with that table is that computation of the odds ratio entails division by 0.  

       Yes    No
A       16     0
B         6   10

OR = (16*10) / (0*6) = ERROR!

One common solution in this case is to add a small amount (usually 0.5) to each cell.  IIRC, Agresti argues that this actually gives an improved estimate of the SE of ln(OR), even when division by 0 is not a problem.  I don't know off the top of my head how one would make such an adjustment (i.e., adding 0.5 to each cell) when using the LOGISTIC REGRESSION procedure.

Allan, I do agree with Martin's comments on the number of "events" per explanatory variable.  And given that you have only 10 "events", you're almost certainly overfitting the model with 2 (or more) explanatory variables.

HTH.
Bruce


Martin Holt wrote
Hi Allan,

Remember that it's the **expected** counts that matter, rather than the actual counts.

 
The following link is excellent, taking you into and through and out the other side on 2x2 tables. It expands on the methods section published in the paper: Campbell Ian, 2007, Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations, Statistics in Medicine, 26, 3661 - 3675.

http://www.iancampbell.co.uk/twobytwo/methods.htm

In a logistic regression it is common to accept "more than 10" per factor in the analysis, yet some, including me, prefer "more than 15". Peduzzi et al ran simulation studies and settled on 10:

Michael A. Babyak.  What You See May Not Be What You Get: A Brief,
Nontechnical Introduction to Overfitting in Regression-Type Models.
Psychosom Med 2004 66: 411-421.
and
Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. . A
simulation study of the number of events per variable in logistic
regression analysis.J Clin Epidemiol. 1996 Dec;49(12):1373-9.

 I'd concentrate on Ian Campbell's papers and you'll find an answer....but you might not like it :(

Best Wishes,

Martin Holt

________________________________
From: "Allan Lundy, PhD" <Allan.Lundy@comcast.net>
To: SPSSX-L@LISTSERV.UGA.EDU
Sent: Saturday, 12 June, 2010 22:40:38
Subject: Logistic Regression fails with empty cell


Dear Listers,
First, thanks to Martin Holt, Ryan Black, and Bruce Weaver for helpful comments on another recent logistic regression question.

This one is much more basic, but very surprising (to me, anyway).  I have 32 cases, divided into 16 and 16, with a dichotomous outcome.  The data look like this:
(Group is A or B; outcome is Yes or No)

       Yes    No
A       16     0
B         6   10

As you might expect, chi-square is highly significant:  14.5, p< .001.

However, using this data in a binomial logistic regression with additional continuous predictor variables yielded weirdly high p values for Group: like p= .996.

I eliminated the continuous predictors, so there was just the dichotomous predictor and dichotomous outcome.  Results:
The classification table showed overall correct classification as 81.3%.

But Variable in the equation (Step1) was, for Group:
B= 21.7, S.E.= 10048.2   Sig.= .998.
Obviously the huge SE was what was making it non-significant.

Finally decided the problem had to be the empty cell.  I switched one of the outcome values and re-ran

       Yes    No
A       15     1
B         6   10

Results, of course, less significant chi-square, but the following for Group:
B= 3.2, S.E.= 1.2   Sig.= .005.

SPSS Help says, under Data Considerations:

However, your solution may be more stable if your predictors have a multivariate normal distribution. Additionally, as with other forms of regression, multicollinearity among the predictors can lead to biased estimates and inflated standard errors.

Inflated?  I guess so, by about 10,000 times!  It would have been nice if this section simply said, "Does not work with an empty cell."

Anybody know a way around this problem that won't lose power?  Remember, I want to include continuous predictors also.  I have not tried it with plain MR, but I don't see why that would be different.

Thanks!


Allan Lundy, PhD
Research Consulting
Allan.Lundy@comcast.net

Business & Cell (any time): 215-820-8100
Home (8am-10pm, 7 days/week): 215-885-5313
Address:  108 Cliff Terrace, Wyncote, PA 19095
Visit my Web site at www.dissertationconsulting.net ===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression fails with empty cell

Marta Garcia-Granero
Bruce Weaver wrote:

> Hi Martin.  I think Allan's problem with that table is that computation of
> the odds ratio entails division by 0.
>
>        Yes    No
> A       16     0
> B         6   10
>
> OR = (16*10) / (0*6) = ERROR!
>
> One common solution in this case is to add a small amount (usually 0.5) to
> each cell.  IIRC, Agresti argues that this actually gives an improved
> estimate of the SE of ln(OR), even when division by 0 is not a problem.  I
> don't know off the top of my head how one would make such an adjustment
> (i.e., adding 0.5 to each cell) when using the LOGISTIC REGRESSION
> procedure.
Hi:

Since only binaty predictors are used, one solution I have used myself
is aggregate the dataset, add 0.5 to every frequency, weight by the new
frequencies and run LOGISTIC REGRESSION again.

Alsa, I recall that LOGLINEAR had an option to add 0.5 to each cell
automatically.

HTH,
Marta GG

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression fails with empty cell

Mike Griffiths
In reply to this post by Allan Lundy, PhD
A few observations, on a slightly different tack from previous replies.
 
1.  When I run this analysis, under Binary Logistic Regression, with Group A=1, GroupB=2, 1=Yes, 2=No, I get an error message under Model Summary: "Estimation terminated at iteration number 20 because maximum iterations has been reached. Final solution cannot be found.".  When I increase the maximum iterations, I get a more obvious error message at the head of the output: "Estimation failed due to numerical problem. Possible reasons are: (1) at least one of the convergence criteria LCON, BCON is zero or too small, or (2) the value of EPS is too small (if not specified, the default value that is used may be too small for this data set)."; and SPSS does not attempt to calculate a p-value for Group.

2.  These warnings might seem appropriate.  However, if I code 2=Yes and 1=No the analysis goes ahead with no warning, terminating at iteration 37 and giving even wilder results for Group.  (It does not, however, seem to make any difference if I swap round the codings for Group, nor if I change the reference category, no matter which way round I code Yes and No.)
 
3.  When I run the same data under Multinomial Logistic Regression, SPSS has equal trouble with the Parameter Estimates, including the odds ratio.  However this procedure also produces Likelihood Ratio tests, which do show p<.001 for the effect of Group.  (But it also produces an error message at the top of the output saying the "validity of the model fit is uncertain", so presumably it would have been unwise to rely on these results.
 
Mike Griffiths

Date: Sat, 12 Jun 2010 17:40:38 -0400
From: [hidden email]
Subject: Logistic Regression fails with empty cell
To: [hidden email]


Dear Listers,
First, thanks to Martin Holt, Ryan Black, and Bruce Weaver for helpful comments on another recent logistic regression question.

This one is much more basic, but very surprising (to me, anyway).  I have 32 cases, divided into 16 and 16, with a dichotomous outcome.  The data look like this:
(Group is A or B; outcome is Yes or No)

       Yes    No
A       16     0
B         6   10

As you might expect, chi-square is highly significant:  14.5, p< .001.

However, using this data in a binomial logistic regression with additional continuous predictor variables yielded weirdly high p values for Group: like p= .996.

I eliminated the continuous predictors, so there was just the dichotomous predictor and dichotomous outcome.  Results:
The classification table showed overall correct classification as 81.3%.

But Variable in the equation (Step1) was, for Group:
B= 21.7, S.E.= 10048.2   Sig.= .998.
Obviously the huge SE was what was making it non-significant.

Finally decided the problem had to be the empty cell.  I switched one of the outcome values and re-ran

       Yes    No
A       15     1
B         6   10

Results, of course, less significant chi-square, but the following for Group:
B= 3.2, S.E.= 1.2   Sig.= .005.

SPSS Help says, under Data Considerations:

However, your solution may be more stable if your predictors have a multivariate normal distribution. Additionally, as with other forms of regression, multicollinearity among the predictors can lead to biased estimates and inflated standard errors.

Inflated?  I guess so, by about 10,000 times!  It would have been nice if this section simply said, "Does not work with an empty cell."

Anybody know a way around this problem that won't lose power?  Remember, I want to include continuous predictors also.  I have not tried it with plain MR, but I don't see why that would be different.

Thanks!

Allan Lundy, PhD
Research Consulting
[hidden email]

Business & Cell (any time): 215-820-8100
Home (8am-10pm, 7 days/week): 215-885-5313
Address:  108 Cliff Terrace, Wyncote, PA 19095
Visit my Web site at www.dissertationconsulting.net ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


Get a free e-mail account with Hotmail. Sign-up now.
Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression fails with empty cell

Anthony Babinec

Hosmer and Lemeshow “Applied Logistic Regression, 2nd edition” discusses

this and other numerical problems in its Section 4.5.  

 

Tony Babinec

[hidden email]

 

 

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.829 / Virus Database: 271.1.1/2936 - Release Date: 06/14/10 01:35:00