|
I forgot to change the subject line. Sorry for the
inconvenience. From: Guerrero, Rodrigo Hello all, I have two statistical
questions about Logistic Regression, although it is not necessarily SPSS
related. I am running a model to predict response of our direct mail
campaign and have both ratio as well as categorical independent variables.
Some of these categorical variables have missing data for cases. In some
variables, more than half the data is missing. 1) What
do you think the implications are of treating missing as a valid value and
including those records in the equation? 2) I am
concerned about over fitting the model. How can I test the model to avoid
this pitfall? Thank you very much for your
help. Rodrigo Rodrigo A. Guerrero |
Director Of Marketing Research and Analysis | The Scooter Store | 830.627.4317 The information transmitted is intended only for the addressee(s) and may contain confidential or privileged material, or both. Any review, receipt, dissemination or other use of this information by non-addressees is prohibited. If you received this in error or are a non-addressee, please contact the sender and delete the transmitted information. |
|
Rodrigo,
1. I think it depends on the amount and pattern of missing data. Have you considered multiple imputation? 2. Regarding overfitting, you need to consider the number of covariates in your model relative to your sample size. On the basis of models validated on independent datasets and simulation studies, sample size requirements are formulated as events per variable (EVP). Several studies (Harrell FE Jr., Lee KL, Califf RM, Pryor DB, Rosati RA, 1984; Harrell FE Jr., Lee KL, Mark DB, 1996; Harrell FE Jr., Lee KL, Matchar DB, & Reichert TA, 1985) have shown the minimum EVP for obtaining reliable predictions is 10. For binary outcome variables, the upper limit in determining the EVP is the smaller of the two groups (Harrell F, 2001). After fitting your model, you can use bootstrapping to obtain an estimate of the degree of over-optimism in your model. Harrell (2001) provides some excellent demonstrations on how to do this in R. Harrell F (2001). Regression modeling strategies: With applications to linear models, logistic regression, and survival analysis. New York: Springer-Verlag. Harrell FE Jr., Lee KL, Califf RM, Pryor DB, Rosati RA (1984). Regression modelling strategies for improved prognostic prediction. Stat Med, 3(2), 143-152. Harrell FE Jr., Lee KL, Mark DB (1996). Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med, 15(4), 361-387. Harrell FE Jr., Lee KL, Matchar DB, & Reichert TA (1985). Regression models for prognostic prediction: advantages, problems, and suggested solutions. Cancer Treat Rep, 69(10), 1071-1077. Scott Millis --- On Fri, 7/24/09, Guerrero, Rodrigo <[hidden email]> wrote: > 1) > What > do you think the implications are of treating missing as a > valid value and > including those records in the equation? > 2) > I am > concerned about over fitting the model. How can I > test the model to avoid > this pitfall? ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Thanks Scott, that really helps. I don't think that multiple imputation
will be an option for us since a third person will be running the final model against their data. There is only so much they will be willing and able to do. Thanks. RG Rodrigo A. Guerrero | Director Of Marketing Research and Analysis | The Scooter Store | 830.627.4317 -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Scott Millis Sent: Friday, July 24, 2009 10:13 AM To: [hidden email] Subject: Re: Logistic Regression Question Rodrigo, 1. I think it depends on the amount and pattern of missing data. Have you considered multiple imputation? 2. Regarding overfitting, you need to consider the number of covariates in your model relative to your sample size. On the basis of models validated on independent datasets and simulation studies, sample size requirements are formulated as events per variable (EVP). Several studies (Harrell FE Jr., Lee KL, Califf RM, Pryor DB, Rosati RA, 1984; Harrell FE Jr., Lee KL, Mark DB, 1996; Harrell FE Jr., Lee KL, Matchar DB, & Reichert TA, 1985) have shown the minimum EVP for obtaining reliable predictions is 10. For binary outcome variables, the upper limit in determining the EVP is the smaller of the two groups (Harrell F, 2001). After fitting your model, you can use bootstrapping to obtain an estimate of the degree of over-optimism in your model. Harrell (2001) provides some excellent demonstrations on how to do this in R. Harrell F (2001). Regression modeling strategies: With applications to linear models, logistic regression, and survival analysis. New York: Springer-Verlag. Harrell FE Jr., Lee KL, Califf RM, Pryor DB, Rosati RA (1984). Regression modelling strategies for improved prognostic prediction. Stat Med, 3(2), 143-152. Harrell FE Jr., Lee KL, Mark DB (1996). Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med, 15(4), 361-387. Harrell FE Jr., Lee KL, Matchar DB, & Reichert TA (1985). Regression models for prognostic prediction: advantages, problems, and suggested solutions. Cancer Treat Rep, 69(10), 1071-1077. Scott Millis --- On Fri, 7/24/09, Guerrero, Rodrigo <[hidden email]> wrote: > 1) > What > do you think the implications are of treating missing as a > valid value and > including those records in the equation? > 2) > I am > concerned about over fitting the model. How can I > test the model to avoid > this pitfall? ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD The information transmitted is intended only for the addressee(s) and may contain confidential or privileged material, or both. Any review, receipt, dissemination or other use of this information by non-addressees is prohibited. If you received this in error or are a non-addressee, please contact the sender and delete the transmitted information. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
| Free forum by Nabble | Edit this page |
