SPSSX Discussion

logistic regression optimal cut-off point for classification?

Classic

List

Threaded

5 messages Options

annastella

logistic regression optimal cut-off point for classification?

Hi all

I am trying to specify a logistic regression model predicting a medical condition. In my sample the occurence is 16.7 percent. So far I have run the regression using the default cut-off point for classification and I am getting poor results-around 20% of the cases occuring are predicted correctly. I have searched online and in several books to read about classification tables and cut-off points and I haven't found anything entirely clear and it's not entirely clear to me: is there a rule of thumb for this issue?

Thanking you all in advance for your help

Anna

Marta Garcia-Granero

Re: logistic regression optimal cut-off point for classification?

Hi Anna:

You should use ROC analysis to determine the optimal cut-off value. If
you are using a multiple regression model, save the predicted
probabilities and use that new variable.

HTH,
Marta

annastella wrote:

> Hi all
>
> I am trying to specify a logistic regression model predicting a medical
> condition. In my sample the occurence is 16.7 percent. So far I have run the
> regression using the default cut-off point for classification and I am
> getting poor results-around 20% of the cases occuring are predicted
> correctly. I have searched online and in several books to read about
> classification tables and cut-off points and I haven't found anything
> entirely clear and it's not entirely clear to me: is there a rule of thumb
> for this issue?
>
> Thanking you all in advance for your help
>
> Anna
> --
> View this message in context: http://www.nabble.com/logistic-regression-optimal-cut-off-point-for-classification--tp25208377p25208377.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
>

--
For miscellaneous SPSS related statistical stuff, visit:
http://gjyp.nl/marta/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

SR Millis-3

Re: logistic regression optimal cut-off point for classification?

I am in agreement with Marta.

The optimal cutoff depends on the relative costs of making false positive vs. false negative errors. In some cases, you may want to set sensitivity at a pre-specified level, or, conversely, specificity---and you then select the cutoff accordingly. Alternatively, you may want to find the cutoff that simultaneously maximizes ssensitivity and specificity.

Scott R Millis, PhD, ABPP (CN,CL,RP), CStat, CSci
Professor & Director of Research
Dept of Physical Medicine & Rehabilitation
Dept of Emergency Medicine
Wayne State University School of Medicine
261 Mack Blvd
Detroit, MI 48201
Email: [hidden email]
Tel: 313-993-8085
Fax: 313-966-7682

--- On Sun, 8/30/09, Marta García-Granero <[hidden email]> wrote:

> From: Marta García-Granero <[hidden email]>
> Subject: Re: logistic regression optimal cut-off point for classification?
> To: [hidden email]
> Date: Sunday, August 30, 2009, 7:17 AM
> Hi Anna:
>
> You should use ROC analysis to determine the optimal
> cut-off value. If
> you are using a multiple regression model, save the
> predicted
> probabilities and use that new variable.
>
> HTH,
> Marta
>
> annastella wrote:
> > Hi all
> >
> > I am trying to specify a logistic regression model
> predicting a medical
> > condition. In my sample the occurence is 16.7 percent.
> So far I have run the
> > regression using the default cut-off point for
> classification and I am
> > getting poor results-around 20% of the cases occuring
> are predicted
> > correctly. I have searched online and in several books
> to read about
> > classification tables and cut-off points and I haven't
> found anything
> > entirely clear and it's not entirely clear to me: is
> there a rule of thumb
> > for this issue?
> >
> > Thanking you all in advance for your help
> >
> > Anna
> > --
> > View this message in context: http://www.nabble.com/logistic-regression-optimal-cut-off-point-for-classification--tp25208377p25208377.html
> > Sent from the SPSSX Discussion mailing list archive at
> Nabble.com.
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message
> to
> > [hidden email]
> (not to SPSSX-L), with no body text except the
> > command. To leave the list, send the command
> > SIGNOFF SPSSX-L
> > For a list of commands to manage subscriptions, send
> the command
> > INFO REFCARD
> >
> >
>
>
> --
> For miscellaneous SPSS related statistical stuff, visit:
> http://gjyp.nl/marta/
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email]
> (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Ruben Geert van den Berg

Re: logistic regression optimal cut-off point for classification?

In reply to this post by annastella

Dear Anastella,

Logistic regression comes up with predicted values (chance of occurence of an event, e.g. presence of a disease) between 0 and 1. SPSS can also save those but in case of missing values you may need a bit of a workaround. Logistic regression only has LISTWISE deletion of missing values so in some cases imputation may be necessary anyway in order not to lose to much of your sample size.

The problem (or trade-off) is that a higher cut-off will usually yield less false positives but more false negatives. What you could do, is make an ROC curve and ask for the coordinate points and choose the cut-off with the most desirable balance (this is obviously subjective) between false positives and false negatives. If ‘Occur’ is the actual occurrence of the event and ‘Predicted’ is the predicted probability rendered by logistic regression, you could run:

INP PRO.

LOOP ID=1 to 1000.

END CAS.

END LOOP.

END FIL.

END INP PRO.

COMP Occur=RV.BER(.5).

COMP Predicted=RV.NOR(0,.2)+.3*Occur.

RECOD Predicted (LO THR 0=.05)(1 THR HI=.95)(ELS=COP).

ROC

Predicted BY Occur (1)

/PLOT = CURVE

/PRINT = COORDINATES

/CRITERIA = CUTOFF(INCLUDE) TESTPOS(LARGE) DISTRIBUTION(FREE) CI(95)

/MISSING = EXCLUDE .

Best regards,

Ruben van den Berg

> Date: Sun, 30 Aug 2009 02:48:34 -0700
> From: [hidden email]
> Subject: logistic regression optimal cut-off point for classification?
> To: [hidden email]
>
> Hi all
>
> I am trying to specify a logistic regression model predicting a medical
> condition. In my sample the occurence is 16.7 percent. So far I have run the
> regression using the default cut-off point for classification and I am
> getting poor results-around 20% of the cases occuring are predicted
> correctly. I have searched online and in several books to read about
> classification tables and cut-off points and I haven't found anything
> entirely clear and it's not entirely clear to me: is there a rule of thumb
> for this issue?
>
> Thanking you all in advance for your help
>
> Anna
> --
> View this message in context: http://www.nabble.com/logistic-regression-optimal-cut-off-point-for-classification--tp25208377p25208377.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

What can you do with the new Windows Live? Find out

Hector Maletta

Re: logistic regression optimal cut-off point for classification?

I agree with Marta’s and Ruben’s advice about using ROC, and their contention that a cost-benefit analysis between the changes of false positives and false negatives is in order.

However, I have a more fundamental consideration to make. To put it bluntly, I do not think that Logistic Regression should be used to predict individual cases.

The reason is in the nature of probabilities. When you predict a probability, what you are predicting is that (among people with such and such characteristics, i.e. sharing certain values of the predictors) a proportion p of them would have the outcome; in your case, a proportion p of them will have the medical condition in question. You do NOT know who in particular will get ill within that group: you only predict the proportion. Just as in the case of throwing coins: when you throw a large number of coins, you know that about 50% will be heads and 50% tails, but you do not know which particular throw will turn either way. In fact, the outcome of any particular throw is strictly INDETERMINATE. The same happens within people belonging to the same group (i.e. having the same combination of values of the predictors) in your medical analysis. You may know that, say, within women in a certain age interval, having such and such medical background, the probability is 0.20, whilst in another group (males, another age group, another background) the probability is 0.40, but you know exactly nothing more: you do not know (and CANNOT know with this kind of information) whether Mary or Ingrid will be ill in the first group, whether Jack or Tom will have the medical condition in the second group.

The ‘classification table’ intrinsically entails a decision cutoff point: you gamble that anybody with p>T (where T is the cutoff probability) WILL have the outcome, and anybody below will not, but that gamble is totally alien to the nature of the problem. In fact you know in advance that within each group defined by a certain combination of predictors, some (indeterminate) subjects will get the outcome and some will not. And moreover, you know that you are fundamentally incapable of telling which. You know that chain-smoking increases the chances of a premature death, and you may know by how much, but you also know that some chain-smokers live many years (like, say, Winston Churchill), and looking at the information in your data set you cannot tell who will have lung cancer or angina pectoris next month and who will happily survive to a ripe old age, cigars and all.

A more realistic, and conceptually sounder, test of the accuracy of a Log Reg model would be not checking on individuals, but checking whether the actual rates of occurrence match the predicted probabilities. For instance, you may break up the total sample into a number of more homogeneous subgroups (say by sex, age, medical background), and check whether the actual proportion with the outcome (in each group) matches the predicted proportion for that group emerging from Logistic Regression.

Philosophically, this position looks at probabilities as (essentially) relative frequencies, and not as intrinsic attributes of individuals. You start your analysis with a population where the relative frequency of the outcome is, say, 20%. This does not tell you the names of those that will be among that 20%. Log Reg will not tell you either: Log Reg would only help you split the population into groups with different relative frequencies, that’s all. When Log Reg (based on predictors) assigns a probability of 0.70 to Jack Smith, this tells you nothing about Jack as an individual: he may or may not develop the medical condition. Log Reg is only putting him within a group of people among which 70% are predicted to have the outcome, i.e. a group where the relative frequency of your medical condition is predicted to be 70%. You cannot know in advance who will be among the unlucky 70%, and who will not. You could only check whether the group sharing Jack’s characteristics actually shows a frequency of 70%. If there is a good correlation between the actual and predicted relative frequencies for the various groups, then your Log Reg model is good. You would still be at a loss about Jack’s or Mary’s fate, no matter the group where they end up or the probability attached to each group. You may say that Jack is “more likely” to get ill than Mary, but that is only a manner of speaking: what you are actually meaning is that Jack belongs in a group with a higher relative frequency of occurrence.

These philosophical ramblings may not assist you in the choice of a cutoff point. They are only intended to show that such cutoff points are, in a fundamental sense, pointless.

Hector

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Ruben van den Berg
Sent: 30 August 2009 06:42
To: [hidden email]
Subject: Re: logistic regression optimal cut-off point for classification?

Dear Anastella,

INP PRO.

LOOP ID=1 to 1000.

END CAS.

END LOOP.

END FIL.

END INP PRO.

COMP Occur=RV.BER(.5).

COMP Predicted=RV.NOR(0,.2)+.3*Occur.

RECOD Predicted (LO THR 0=.05)(1 THR HI=.95)(ELS=COP).

ROC

Predicted BY Occur (1)

/PLOT = CURVE

/PRINT = COORDINATES

/CRITERIA = CUTOFF(INCLUDE) TESTPOS(LARGE) DISTRIBUTION(FREE) CI(95)

/MISSING = EXCLUDE .

Best regards,

Ruben van den Berg

What can you do with the new Windows Live? Find out

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.409 / Virus Database: 270.13.71/2332 - Release Date: 08/29/09 17:51:00