Logistic Regression Question

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Logistic Regression Question

Guerrero, Rodrigo

I forgot to change the subject line.  Sorry for the inconvenience.

 

 

From: Guerrero, Rodrigo
Sent: Friday, July 24, 2009 9:13 AM
To: [hidden email]
Subject: RE: SPSS-Cluster Analysis-Query

 

Hello all,

 

I have two statistical questions about Logistic Regression, although it is not necessarily SPSS related.  I am running a model to predict response of our direct mail campaign and have both ratio as well as categorical independent variables.  Some of these categorical variables have missing data for cases. In some variables, more than half the data is missing.

 

1)       What do you think the implications are of treating missing as a valid value and including those records in the equation?

 

2)      I am concerned about over fitting the model.  How can I test the model to avoid this pitfall?

 

 

Thank you very much for your help.

 

 

Rodrigo

 

 

Rodrigo A. Guerrero | Director Of Marketing Research and Analysis | The Scooter Store | 830.627.4317

 

 


The information transmitted is intended only for the addressee(s) and may contain confidential or privileged material, or both. Any review, receipt, dissemination or other use of this information by non-addressees is prohibited. If you received this in error or are a non-addressee, please contact the sender and delete the transmitted information.
Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression Question

SR Millis-3
Rodrigo,

1.  I think it depends on the amount and pattern of missing data.  Have you considered multiple imputation?

2.  Regarding overfitting, you need to consider the number of covariates in your model relative to your sample size.  On the basis of models validated on independent datasets and simulation studies, sample size requirements are formulated as events per variable (EVP).  Several studies (Harrell FE Jr., Lee KL, Califf RM, Pryor DB, Rosati RA, 1984; Harrell FE Jr., Lee KL, Mark DB, 1996; Harrell FE Jr., Lee KL, Matchar DB, & Reichert TA, 1985) have shown the minimum EVP for obtaining reliable predictions is 10.  For binary outcome variables, the upper limit in determining the EVP is the smaller of the two groups (Harrell F, 2001).  After fitting your model, you can use bootstrapping to obtain an estimate of the degree of over-optimism in your model.  Harrell (2001) provides some excellent demonstrations on how to do this in R.


Harrell F (2001). Regression modeling strategies: With applications to linear models, logistic regression, and survival analysis. New York: Springer-Verlag.
Harrell FE Jr., Lee KL, Califf RM, Pryor DB, Rosati RA (1984). Regression modelling strategies for improved prognostic prediction. Stat Med, 3(2), 143-152.
Harrell FE Jr., Lee KL, Mark DB (1996). Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med, 15(4), 361-387.
Harrell FE Jr., Lee KL, Matchar DB, & Reichert TA (1985). Regression models for prognostic prediction: advantages, problems, and suggested solutions. Cancer Treat Rep, 69(10), 1071-1077.


Scott Millis

--- On Fri, 7/24/09, Guerrero, Rodrigo <[hidden email]> wrote:


> 1)
>  What
> do you think the implications are of treating missing as a
> valid value and
> including those records in the equation?


> 2)
> I am
> concerned about over fitting the model.  How can I
> test the model to avoid
> this pitfall?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression Question

Guerrero, Rodrigo
Thanks Scott, that really helps.  I don't think that multiple imputation
will be an option for us since a third person will be running the final
model against their data.  There is only so much they will be willing
and able to do.

Thanks.

RG

Rodrigo A. Guerrero | Director Of Marketing Research and Analysis | The
Scooter Store | 830.627.4317




-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Scott Millis
Sent: Friday, July 24, 2009 10:13 AM
To: [hidden email]
Subject: Re: Logistic Regression Question

Rodrigo,

1.  I think it depends on the amount and pattern of missing data.  Have
you considered multiple imputation?

2.  Regarding overfitting, you need to consider the number of covariates
in your model relative to your sample size.  On the basis of models
validated on independent datasets and simulation studies, sample size
requirements are formulated as events per variable (EVP).  Several
studies (Harrell FE Jr., Lee KL, Califf RM, Pryor DB, Rosati RA, 1984;
Harrell FE Jr., Lee KL, Mark DB, 1996; Harrell FE Jr., Lee KL, Matchar
DB, & Reichert TA, 1985) have shown the minimum EVP for obtaining
reliable predictions is 10.  For binary outcome variables, the upper
limit in determining the EVP is the smaller of the two groups (Harrell
F, 2001).  After fitting your model, you can use bootstrapping to obtain
an estimate of the degree of over-optimism in your model.  Harrell
(2001) provides some excellent demonstrations on how to do this in R.


Harrell F (2001). Regression modeling strategies: With applications to
linear models, logistic regression, and survival analysis. New York:
Springer-Verlag.
Harrell FE Jr., Lee KL, Califf RM, Pryor DB, Rosati RA (1984).
Regression modelling strategies for improved prognostic prediction. Stat
Med, 3(2), 143-152.
Harrell FE Jr., Lee KL, Mark DB (1996). Multivariable prognostic models:
issues in developing models, evaluating assumptions and adequacy, and
measuring and reducing errors. Stat Med, 15(4), 361-387.
Harrell FE Jr., Lee KL, Matchar DB, & Reichert TA (1985). Regression
models for prognostic prediction: advantages, problems, and suggested
solutions. Cancer Treat Rep, 69(10), 1071-1077.


Scott Millis

--- On Fri, 7/24/09, Guerrero, Rodrigo <[hidden email]>
wrote:


> 1)
>  What
> do you think the implications are of treating missing as a
> valid value and
> including those records in the equation?


> 2)
> I am
> concerned about over fitting the model.  How can I
> test the model to avoid
> this pitfall?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


The information transmitted is intended only for the addressee(s) and may contain confidential or privileged material, or both.  Any review, receipt, dissemination or other use of this information by non-addressees is prohibited.   If you received this in error or are a non-addressee, please contact the sender and delete the transmitted information.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD