|
Hi all
I am trying to specify a logistic regression model predicting a medical condition. In my sample the occurence is 16.7 percent. So far I have run the regression using the default cut-off point for classification and I am getting poor results-around 20% of the cases occuring are predicted correctly. I have searched online and in several books to read about classification tables and cut-off points and I haven't found anything entirely clear and it's not entirely clear to me: is there a rule of thumb for this issue? Thanking you all in advance for your help Anna |
|
Hi Anna:
You should use ROC analysis to determine the optimal cut-off value. If you are using a multiple regression model, save the predicted probabilities and use that new variable. HTH, Marta annastella wrote: > Hi all > > I am trying to specify a logistic regression model predicting a medical > condition. In my sample the occurence is 16.7 percent. So far I have run the > regression using the default cut-off point for classification and I am > getting poor results-around 20% of the cases occuring are predicted > correctly. I have searched online and in several books to read about > classification tables and cut-off points and I haven't found anything > entirely clear and it's not entirely clear to me: is there a rule of thumb > for this issue? > > Thanking you all in advance for your help > > Anna > -- > View this message in context: http://www.nabble.com/logistic-regression-optimal-cut-off-point-for-classification--tp25208377p25208377.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > > -- For miscellaneous SPSS related statistical stuff, visit: http://gjyp.nl/marta/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
I am in agreement with Marta.
The optimal cutoff depends on the relative costs of making false positive vs. false negative errors. In some cases, you may want to set sensitivity at a pre-specified level, or, conversely, specificity---and you then select the cutoff accordingly. Alternatively, you may want to find the cutoff that simultaneously maximizes ssensitivity and specificity. Scott R Millis, PhD, ABPP (CN,CL,RP), CStat, CSci Professor & Director of Research Dept of Physical Medicine & Rehabilitation Dept of Emergency Medicine Wayne State University School of Medicine 261 Mack Blvd Detroit, MI 48201 Email: [hidden email] Tel: 313-993-8085 Fax: 313-966-7682 --- On Sun, 8/30/09, Marta García-Granero <[hidden email]> wrote: > From: Marta García-Granero <[hidden email]> > Subject: Re: logistic regression optimal cut-off point for classification? > To: [hidden email] > Date: Sunday, August 30, 2009, 7:17 AM > Hi Anna: > > You should use ROC analysis to determine the optimal > cut-off value. If > you are using a multiple regression model, save the > predicted > probabilities and use that new variable. > > HTH, > Marta > > annastella wrote: > > Hi all > > > > I am trying to specify a logistic regression model > predicting a medical > > condition. In my sample the occurence is 16.7 percent. > So far I have run the > > regression using the default cut-off point for > classification and I am > > getting poor results-around 20% of the cases occuring > are predicted > > correctly. I have searched online and in several books > to read about > > classification tables and cut-off points and I haven't > found anything > > entirely clear and it's not entirely clear to me: is > there a rule of thumb > > for this issue? > > > > Thanking you all in advance for your help > > > > Anna > > -- > > View this message in context: http://www.nabble.com/logistic-regression-optimal-cut-off-point-for-classification--tp25208377p25208377.html > > Sent from the SPSSX Discussion mailing list archive at > Nabble.com. > > > > ===================== > > To manage your subscription to SPSSX-L, send a message > to > > [hidden email] > (not to SPSSX-L), with no body text except the > > command. To leave the list, send the command > > SIGNOFF SPSSX-L > > For a list of commands to manage subscriptions, send > the command > > INFO REFCARD > > > > > > > -- > For miscellaneous SPSS related statistical stuff, visit: > http://gjyp.nl/marta/ > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] > (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the > command > INFO REFCARD > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by annastella
Dear Anastella, Logistic regression comes up with predicted values (chance of occurence of an event, e.g. presence of a disease) between 0 and 1. SPSS can also save those but in case of missing values you may need a bit of a workaround. Logistic regression only has LISTWISE deletion of missing values so in some cases imputation may be necessary anyway in order not to lose to much of your sample size. The problem (or trade-off) is that a higher cut-off will usually yield less false positives but more false negatives. What you could do, is make an ROC curve and ask for the coordinate points and choose the cut-off with the most desirable balance (this is obviously subjective) between false positives and false negatives. If ‘Occur’ is the actual occurrence of the event and ‘Predicted’ is the predicted probability rendered by logistic regression, you could run: INP PRO. LOOP ID=1 to 1000. END CAS. END END FIL. END INP PRO. COMP Occur=RV.BER(.5). COMP Predicted=RV.NOR(0,.2)+.3*Occur. RECOD Predicted (LO THR 0=.05)(1 THR HI=.95)(ELS=COP). ROC Predicted BY Occur (1) /PLOT = CURVE /PRINT = COORDINATES /CRITERIA = CUTOFF(INCLUDE) TESTPOS(LARGE) DISTRIBUTION(FREE) CI(95) /MISSING = EXCLUDE . Best regards, Ruben van den Berg > Date: Sun, 30 Aug 2009 02:48:34 -0700 > From: [hidden email] > Subject: logistic regression optimal cut-off point for classification? > To: [hidden email] > > Hi all > > I am trying to specify a logistic regression model predicting a medical > condition. In my sample the occurence is 16.7 percent. So far I have run the > regression using the default cut-off point for classification and I am > getting poor results-around 20% of the cases occuring are predicted > correctly. I have searched online and in several books to read about > classification tables and cut-off points and I haven't found anything > entirely clear and it's not entirely clear to me: is there a rule of thumb > for this issue? > > Thanking you all in advance for your help > > Anna > -- > View this message in context: http://www.nabble.com/logistic-regression-optimal-cut-off-point-for-classification--tp25208377p25208377.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD What can you do with the new Windows Live? Find out |
|
I agree with Marta’s and Ruben’s advice
about using ROC, and their contention that a cost-benefit analysis between the
changes of false positives and false negatives is in order. However, I have a more fundamental
consideration to make. To put it bluntly, I do not think that Logistic
Regression should be used to predict individual cases. The reason is in the nature of probabilities.
When you predict a probability, what you are predicting is that (among people
with such and such characteristics, i.e. sharing certain values of the predictors)
a proportion p of them would have the outcome; in your case, a proportion p of
them will have the medical condition in question. You do NOT know who in
particular will get ill within that group: you only predict the proportion. Just
as in the case of throwing coins: when you throw a large number of coins, you
know that about 50% will be heads and 50% tails, but you do not know which
particular throw will turn either way. In fact, the outcome of any particular
throw is strictly INDETERMINATE. The same happens within people belonging to
the same group (i.e. having the same combination of values of the predictors)
in your medical analysis. You may know that, say, within women in a certain age
interval, having such and such medical background, the probability is 0.20,
whilst in another group (males, another age group, another background) the
probability is 0.40, but you know exactly nothing more: you do not know (and
CANNOT know with this kind of information) whether Mary or Ingrid will be ill
in the first group, whether Jack or Tom will have the medical condition in the
second group. The ‘classification table’ intrinsically entails
a decision cutoff point: you gamble that anybody with p>T (where T is the
cutoff probability) WILL have the outcome, and anybody below will not, but that
gamble is totally alien to the nature of the problem. In fact you know in
advance that within each group defined by a certain combination of predictors,
some (indeterminate) subjects will get the outcome and some will not. And
moreover, you know that you are fundamentally incapable of telling which. You
know that chain-smoking increases the chances of a premature death, and you may
know by how much, but you also know that some chain-smokers live many years (like,
say, Winston Churchill), and looking at the information in your data set you
cannot tell who will have lung cancer or angina pectoris next month and who
will happily survive to a ripe old age, cigars and all. A more realistic, and conceptually
sounder, test of the accuracy of a Log Reg model would be not checking on
individuals, but checking whether the actual rates of occurrence match the
predicted probabilities. For instance, you may break up the total sample into a
number of more homogeneous subgroups (say by sex, age, medical background), and
check whether the actual proportion with the outcome (in each group) matches
the predicted proportion for that group emerging from Logistic Regression. Philosophically, this position looks at
probabilities as (essentially) relative frequencies, and not as intrinsic
attributes of individuals. You start your analysis with a population where the
relative frequency of the outcome is, say, 20%. This does not tell you the
names of those that will be among that 20%. Log Reg will not tell you either: Log
Reg would only help you split the population into groups with different
relative frequencies, that’s all. When Log Reg (based on predictors) assigns a
probability of 0.70 to Jack Smith, this tells you nothing about Jack as an
individual: he may or may not develop the medical condition. Log Reg is only
putting him within a group of people among which 70% are predicted to have the
outcome, i.e. a group where the relative frequency of your medical condition is
predicted to be 70%. You cannot know in advance who will be among the unlucky
70%, and who will not. You could only check whether the group sharing Jack’s
characteristics actually shows a frequency of 70%. If there is a good correlation
between the actual and predicted relative frequencies for the various groups, then
your Log Reg model is good. You would still be at a loss about Jack’s or Mary’s
fate, no matter the group where they end up or the probability attached to each
group. You may say that Jack is “more
likely” to get ill than Mary, but that is only a manner of speaking: what you
are actually meaning is that Jack belongs in a group with a higher relative
frequency of occurrence. These philosophical ramblings may not assist
you in the choice of a cutoff point. They are only intended to show that such cutoff
points are, in a fundamental sense, pointless. Hector From: SPSSX(r)
Discussion Dear Anastella, Logistic regression comes up with predicted values
(chance of occurence of an event, e.g. presence of a disease) between 0
and 1. SPSS can also save those but in case of missing values you may need
a bit of a workaround. Logistic regression only has LISTWISE deletion of
missing values so in some cases imputation may be necessary anyway in
order not to lose to much of your sample size. The problem (or trade-off) is that a higher cut-off will usually
yield less false positives but more false negatives. What you could do, is make
an ROC curve and ask for the coordinate points and choose the cut-off with the
most desirable balance (this is obviously subjective) between false positives
and false negatives. If ‘Occur’ is the actual occurrence of the event and
‘Predicted’ is the predicted probability rendered by logistic regression, you
could run: INP PRO. LOOP ID=1 to 1000. END CAS. END END FIL. END INP PRO. COMP Occur=RV.BER(.5). COMP
Predicted=RV.NOR(0,.2)+.3*Occur. RECOD Predicted (LO THR
0=.05)(1 THR HI=.95)(ELS=COP). ROC Predicted BY Occur
(1) /PLOT = CURVE /PRINT = COORDINATES /CRITERIA =
CUTOFF(INCLUDE) TESTPOS(LARGE) DISTRIBUTION(FREE) CI(95) /MISSING = EXCLUDE . Best regards,
What can you do with the new Windows Live? Find
out No virus found in this incoming message. |
| Free forum by Nabble | Edit this page |
