Logistic Regression and Unequal Distribution of Dependent Variable

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Logistic Regression and Unequal Distribution of Dependent Variable

Chao Yawo
Hello, I'm preparing to run a logit model predicting the odds of NOT
testing for an STD.

As you can see from the table below, 2934 (about 86%) of respondents have my outcome of interest (i.e., have not tested for an STD).

I realized that because of this unequal unequal distribution of the dependent variable, all crosstabulations have higher proportions within the untested category of those who have not been tested, regardless of the distribution of the other variable.

I have a feeling that these could bias my estimates in a way - since the not-tested category seemed over-estimated. For example, given the unequal groupings, I think I am only restricted to modeling failure to test (the zero outcome), as modeling for ever tested (1) could lead to unstable estimates.

So my question is it worth producing any crosstabs showing the distribution of socio-demographic variables within my outcome of interest?

What possible impact will this have on my logistic model, and what can I do about it?  Thanks - Yawo

===================>
Table 1:

RECODE of |
V827      |
(Last     |
test was  |
on your   |
own,      |
offered   |  RECODE of V501 (Current
or        |      marital status)
required) |     0      1      2  Total
----------+---------------------------
 Not Test | 99.37   81.1  99.08  88.75
          |   514   1563    857   2934
          |
 Asked fo | .2992  1.015  .2525  .6992
          |     2     18      2     22
          |
  Offered | .2523  17.63  .1184  10.24
          |     3    427      1    431
          |
 Test Req | .0816   .253  .5512  .3114
          |     1      5      2      8
          |
    Total |   100    100    100    100
          |   520   2013    862   3395
--------------------------------------
  Key:  column percentages
        number of observations
----------------------------------





Table 2:

RECODE of |
V827      |
(Last     |
test was  |
on your   |
own,      |
offered   |  RECODE of V106 (Highest
or        |     educational level)
required) |     0      1      2  Total
----------+---------------------------
 Not Test | 83.34  96.84   89.9  88.75
          |   724    273   1937   2934
          |
 Asked fo | .2094  1.662   .777  .6992
          |     2      4     16     22
          |
  Offered | 16.37  1.497  8.887  10.24
          |   209      3    219    431
          |
 Test Req | .0785      0  .4358  .3114
          |     1      0      7      8
          |
    Total |   100    100    100    100
          |   936    280   2179   3395
--------------------------------------
  Key:  column percentages
        number of observations

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression and Unequal Distribution of Dependent Variable

Hector Maletta
Yawo,
Your question betrays a subtle confusion regarding logistic regression, with
which we have dealt before in this forum occasionally. The confusion
concerns the idea that probability is a tool for predicting INDIVIDUAL
outcomes.
First of all let me clarify two points:
1. Logistic regression can be applied to events with whatever probability of
occurring, and not only to those with mid-level probabilities (around 0.5).
The fact that you have a 0.86 versus 0.14 distribution is no problem. The
only problem with very skewed distribution does not concern logistic
regression but accuracy of sample estimates: a sample estimate of a very
small proportion is more prone to sampling error, but I will not deal with
that problem here.
2. Applying log reg to the event of "not being tested" (p=0.86) is
equivalent to the opposite choice of applying it to the event of "being
tested" (p=0.14). The results would be equivalent (with opposite signs).
Second, let us address the probability issue. It has become common practice
to use probabilities arising from logistic regression to estimate the
outcome of individual cases, then compare these estimated or predicted
outcomes to the actual ones, and use this comparison to evaluate the
goodness of fit of the procedure. To make a prediction, usually a subject is
predicted to experience the event if the probability is greater than 0.5.
Now, when the event is relatively rare, such as your being tested, few
subjects will have a probability of being tested that is greater than 0.5.
In some case, none will, and the predicted number of subject being tested
will be zero, when in fact it was 14%. On the opposite side, when the event
is very common, such as "not being tested", 86%, it is quite probably that
almost everyone will have a probability above 0.5, and therefore perhaps
100% will be predicted not to be tested when in fact only 86% were not
tested.
The point is that probabilities are NOT about individuals, but about
populations. A population of individuals, your sample, has a 0.14
probability of having being tested. Perhaps a subpopulation (say males of a
given age group) taken as a group has a greater probability, such as 0.25.
These probabilities simply mean that the proportion of individuals tested in
each group is respectively 0.14 and 0.25, but that says nothing about each
individual.
Think of coins. If you throw 1000 coins, you will get 50% tails and 50%
heads, and the probability of heads will be 0.5, but what about the next
coin? In fact, the next coin has no probabilistic attribute: it may be heads
or tails. Moreover, suppose the coins are somehow tricked into favouring
heads, so that they fall heads 65% of the time; even so, the next coin is
indeterminate: it may be heads or tails. It makes no sense to attribute to
the coin a hidden property called "probability" with a numeric value of 05
or 0.65. What you can say is that 50% (or 65%) of a population of coin
throws will be heads and the rest tails. This is the "frequentist"
interpretation of probability, which is the predominant one in modern
scientific thinking about this matter. The opposite conception of
probability leads to a lot of contradictions. Probability, thus, is a
relative frequency, and no more than that.
What then, of the use of probability as a predictive device? For instance, a
Dean of Admissions at a college may use SAT scores to admit candidates,
based on the probability that a high-score candidate results in a college
graduate instead of resulting in a graduate dropout. But in fact the Dean
knows nothing about each individual candidate: thousands of things may
happen to candidate John or Mary that could cause him to drop out. But the
Dean may confidently say that OUT OF A LARGE NUMBER OF CANDIDATES with high
SAT scores, the percentage of dropouts will be lower than the dropouts from
a comparable number of candidates with lower SAT scores. He is minimizing
the number of dropouts IN THE POPULATION OF ADMITTED CANDIDATES, but he
cannot tell a thing about John or Mary.
Thus, even in common language we say that John has "a high probability of
becoming a cum laude graduate", we in fact do not know. He may or may not.
Perhaps Peter, with a low SAT score, may have done better. The Dean is only
playing it safe by selecting only people with high SAT scores, even knowing
that some of them will fail, and (what is worst) knowing that among those
with lower SAT scores there are some hidden late bloomers, like Albert
Einstein, that would have blossomed in college; they are only difficult to
spot by looking at their application forms.
So, coming back to your problem:
1. Apply log reg to whatever is the event of your interest, either being or
not being tested.
2. Do not care about the cross classification of predicted and observed
outcome. It means nothing.
3. To assess the adequacy of the model use the other coefficients available
to assess goodness of fit and significance.
Hector


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Chao Yawo
Sent: 08 March 2009 17:50
To: [hidden email]
Subject: Logistic Regression and Unequal Distribution of Dependent Variable

Hello, I'm preparing to run a logit model predicting the odds of NOT
testing for an STD.

As you can see from the table below, 2934 (about 86%) of respondents have my
outcome of interest (i.e., have not tested for an STD).

I realized that because of this unequal unequal distribution of the
dependent variable, all crosstabulations have higher proportions within the
untested category of those who have not been tested, regardless of the
distribution of the other variable.

I have a feeling that these could bias my estimates in a way - since the
not-tested category seemed over-estimated. For example, given the unequal
groupings, I think I am only restricted to modeling failure to test (the
zero outcome), as modeling for ever tested (1) could lead to unstable
estimates.

So my question is it worth producing any crosstabs showing the distribution
of socio-demographic variables within my outcome of interest?

What possible impact will this have on my logistic model, and what can I do
about it?  Thanks - Yawo

===================>
Table 1:

RECODE of |
V827      |
(Last     |
test was  |
on your   |
own,      |
offered   |  RECODE of V501 (Current
or        |      marital status)
required) |     0      1      2  Total
----------+---------------------------
 Not Test | 99.37   81.1  99.08  88.75
          |   514   1563    857   2934
          |
 Asked fo | .2992  1.015  .2525  .6992
          |     2     18      2     22
          |
  Offered | .2523  17.63  .1184  10.24
          |     3    427      1    431
          |
 Test Req | .0816   .253  .5512  .3114
          |     1      5      2      8
          |
    Total |   100    100    100    100
          |   520   2013    862   3395
--------------------------------------
  Key:  column percentages
        number of observations
----------------------------------





Table 2:

RECODE of |
V827      |
(Last     |
test was  |
on your   |
own,      |
offered   |  RECODE of V106 (Highest
or        |     educational level)
required) |     0      1      2  Total
----------+---------------------------
 Not Test | 83.34  96.84   89.9  88.75
          |   724    273   1937   2934
          |
 Asked fo | .2094  1.662   .777  .6992
          |     2      4     16     22
          |
  Offered | 16.37  1.497  8.887  10.24
          |   209      3    219    431
          |
 Test Req | .0785      0  .4358  .3114
          |     1      0      7      8
          |
    Total |   100    100    100    100
          |   936    280   2179   3395
--------------------------------------
  Key:  column percentages
        number of observations

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression and Unequal Distribution of Dependent Variable

Chao Yawo
Hector, thanks very much for your detailed remarks.  I clearly understand your point and wouldn't really bother too much about the nature of the distribution when modeling either of the outcome variables.

What I want to find out also is how to present the bivariate portion of my results.  In my field, it is usual to produce a crosstabulation of the relevant independent variables with the dependent variable.  All my tables so far show little variation in the distribution scores since there are more people not getting tested. Hence for example, slightly larger proportion of the never married (99%) versus the married (88%) versus the no longer married (99%) have not been tested.  When you take educational level, the proportion are skewed towards those who have not been tested versus those who are tested.  So, I was wondering, if there was any use reporting these results given the skewed distribution, even though I am comparing categories of the independent variable.

Again, I appreciate your thoughts.

Y1964


--- On Sun, 3/8/09, Hector Maletta <[hidden email]> wrote:

> From: Hector Maletta <[hidden email]>
> Subject: RE: Logistic Regression and Unequal Distribution of Dependent Variable
> To: [hidden email], [hidden email]
> Date: Sunday, March 8, 2009, 4:26 PM
> Yawo,
> Your question betrays a subtle confusion regarding logistic
> regression, with
> which we have dealt before in this forum occasionally. The
> confusion
> concerns the idea that probability is a tool for predicting
> INDIVIDUAL
> outcomes.
> First of all let me clarify two points:
> 1. Logistic regression can be applied to events with
> whatever probability of
> occurring, and not only to those with mid-level
> probabilities (around 0.5).
> The fact that you have a 0.86 versus 0.14 distribution is
> no problem. The
> only problem with very skewed distribution does not concern
> logistic
> regression but accuracy of sample estimates: a sample
> estimate of a very
> small proportion is more prone to sampling error, but I
> will not deal with
> that problem here.
> 2. Applying log reg to the event of "not being
> tested" (p=0.86) is
> equivalent to the opposite choice of applying it to the
> event of "being
> tested" (p=0.14). The results would be equivalent
> (with opposite signs).
> Second, let us address the probability issue. It has become
> common practice
> to use probabilities arising from logistic regression to
> estimate the
> outcome of individual cases, then compare these estimated
> or predicted
> outcomes to the actual ones, and use this comparison to
> evaluate the
> goodness of fit of the procedure. To make a prediction,
> usually a subject is
> predicted to experience the event if the probability is
> greater than 0.5.
> Now, when the event is relatively rare, such as your being
> tested, few
> subjects will have a probability of being tested that is
> greater than 0.5.
> In some case, none will, and the predicted number of
> subject being tested
> will be zero, when in fact it was 14%. On the opposite
> side, when the event
> is very common, such as "not being tested", 86%,
> it is quite probably that
> almost everyone will have a probability above 0.5, and
> therefore perhaps
> 100% will be predicted not to be tested when in fact only
> 86% were not
> tested.
> The point is that probabilities are NOT about individuals,
> but about
> populations. A population of individuals, your sample, has
> a 0.14
> probability of having being tested. Perhaps a subpopulation
> (say males of a
> given age group) taken as a group has a greater
> probability, such as 0.25.
> These probabilities simply mean that the proportion of
> individuals tested in
> each group is respectively 0.14 and 0.25, but that says
> nothing about each
> individual.
> Think of coins. If you throw 1000 coins, you will get 50%
> tails and 50%
> heads, and the probability of heads will be 0.5, but what
> about the next
> coin? In fact, the next coin has no probabilistic
> attribute: it may be heads
> or tails. Moreover, suppose the coins are somehow tricked
> into favouring
> heads, so that they fall heads 65% of the time; even so,
> the next coin is
> indeterminate: it may be heads or tails. It makes no sense
> to attribute to
> the coin a hidden property called "probability"
> with a numeric value of 05
> or 0.65. What you can say is that 50% (or 65%) of a
> population of coin
> throws will be heads and the rest tails. This is the
> "frequentist"
> interpretation of probability, which is the predominant one
> in modern
> scientific thinking about this matter. The opposite
> conception of
> probability leads to a lot of contradictions. Probability,
> thus, is a
> relative frequency, and no more than that.
> What then, of the use of probability as a predictive
> device? For instance, a
> Dean of Admissions at a college may use SAT scores to admit
> candidates,
> based on the probability that a high-score candidate
> results in a college
> graduate instead of resulting in a graduate dropout. But in
> fact the Dean
> knows nothing about each individual candidate: thousands of
> things may
> happen to candidate John or Mary that could cause him to
> drop out. But the
> Dean may confidently say that OUT OF A LARGE NUMBER OF
> CANDIDATES with high
> SAT scores, the percentage of dropouts will be lower than
> the dropouts from
> a comparable number of candidates with lower SAT scores. He
> is minimizing
> the number of dropouts IN THE POPULATION OF ADMITTED
> CANDIDATES, but he
> cannot tell a thing about John or Mary.
> Thus, even in common language we say that John has "a
> high probability of
> becoming a cum laude graduate", we in fact do not
> know. He may or may not.
> Perhaps Peter, with a low SAT score, may have done better.
> The Dean is only
> playing it safe by selecting only people with high SAT
> scores, even knowing
> that some of them will fail, and (what is worst) knowing
> that among those
> with lower SAT scores there are some hidden late bloomers,
> like Albert
> Einstein, that would have blossomed in college; they are
> only difficult to
> spot by looking at their application forms.
> So, coming back to your problem:
> 1. Apply log reg to whatever is the event of your interest,
> either being or
> not being tested.
> 2. Do not care about the cross classification of predicted
> and observed
> outcome. It means nothing.
> 3. To assess the adequacy of the model use the other
> coefficients available
> to assess goodness of fit and significance.
> Hector
>
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]]
> On Behalf Of
> Chao Yawo
> Sent: 08 March 2009 17:50
> To: [hidden email]
> Subject: Logistic Regression and Unequal Distribution of
> Dependent Variable
>
> Hello, I'm preparing to run a logit model predicting
> the odds of NOT
> testing for an STD.
>
> As you can see from the table below, 2934 (about 86%) of
> respondents have my
> outcome of interest (i.e., have not tested for an STD).
>
> I realized that because of this unequal unequal
> distribution of the
> dependent variable, all crosstabulations have higher
> proportions within the
> untested category of those who have not been tested,
> regardless of the
> distribution of the other variable.
>
> I have a feeling that these could bias my estimates in a
> way - since the
> not-tested category seemed over-estimated. For example,
> given the unequal
> groupings, I think I am only restricted to modeling failure
> to test (the
> zero outcome), as modeling for ever tested (1) could lead
> to unstable
> estimates.
>
> So my question is it worth producing any crosstabs showing
> the distribution
> of socio-demographic variables within my outcome of
> interest?
>
> What possible impact will this have on my logistic model,
> and what can I do
> about it?  Thanks - Yawo
>
> ===================>
> Table 1:
>
> RECODE of |
> V827      |
> (Last     |
> test was  |
> on your   |
> own,      |
> offered   |  RECODE of V501 (Current
> or        |      marital status)
> required) |     0      1      2  Total
> ----------+---------------------------
>  Not Test | 99.37   81.1  99.08  88.75
>           |   514   1563    857   2934
>           |
>  Asked fo | .2992  1.015  .2525  .6992
>           |     2     18      2     22
>           |
>   Offered | .2523  17.63  .1184  10.24
>           |     3    427      1    431
>           |
>  Test Req | .0816   .253  .5512  .3114
>           |     1      5      2      8
>           |
>     Total |   100    100    100    100
>           |   520   2013    862   3395
> --------------------------------------
>   Key:  column percentages
>         number of observations
> ----------------------------------
>
>
>
>
>
> Table 2:
>
> RECODE of |
> V827      |
> (Last     |
> test was  |
> on your   |
> own,      |
> offered   |  RECODE of V106 (Highest
> or        |     educational level)
> required) |     0      1      2  Total
> ----------+---------------------------
>  Not Test | 83.34  96.84   89.9  88.75
>           |   724    273   1937   2934
>           |
>  Asked fo | .2094  1.662   .777  .6992
>           |     2      4     16     22
>           |
>   Offered | 16.37  1.497  8.887  10.24
>           |   209      3    219    431
>           |
>  Test Req | .0785      0  .4358  .3114
>           |     1      0      7      8
>           |
>     Total |   100    100    100    100
>           |   936    280   2179   3395
> --------------------------------------
>   Key:  column percentages
>         number of observations
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body
> text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD