SPSSX Discussion

MANOVA vs multiple ANOVA's

Classic

List

Threaded

22 messages Options

Ornelas, Fermin-2

Re: Logistic Regression

When I was at Amex we had regular modeling meetings my models usually had between 7 and 12 variables. Others presented theirs with around 18 variables. We implemented and tracked the models and their performance sometimes suffered after a couple of years. I particularly did not like having too many variables as some of them suffered from collinear relationships. But since we were interested in prediction that did not seem to be an issue.

-----Original Message-----
From: Ergul, Emel A. [mailto:[hidden email]]
Sent: Wednesday, April 22, 2009 4:10 PM
To: Ornelas, Fermin; [hidden email]
Subject: RE: Re: Logistic Regression

OK
I remember from journal reviewers that total number of predictor for LR can be maximum number of event/10. They say over this number, model becomes unstable...How about that?

-----Original Message-----
From: SPSSX(r) Discussion on behalf of Ornelas, Fermin
Sent: Wed 4/22/2009 4:55 PM
To: [hidden email]
Subject: Re: Logistic Regression

This answer is on (1) and (2). There is no magic number for the set of predictor variables in a model but once you clean the data and the model itself you could have between 7 - 18 predictors. That was my experience.

It is possible for a model to perform better or worse in a validation sample than in a training sample. However, to ensure that the model performs equally well you need to make sure that your descriptive statistics on the data are similar in the validation and training sample. If the performance difference is large that could pose a problem when implementing a model particularly if the performance is worse, which is not your case.

Regarding the test, I cannot give input from the top of my head for fear of getting some uncomfortable feedback. But I use to graph a Lorenz curve plotting both training and validation and calculate curve lift.

________________________________
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of <R. Abraham>
Sent: Wednesday, April 22, 2009 1:37 PM
To: [hidden email]
Subject: Logistic Regression

I have 2 questions on Predictive Modeling:

1. I am building a logistic regression model with about 480 predictors. The 'training' sample has about 18000 records with about 3000 responders. I would like to know how many significant predictors can the model have? Is there any suggested number of significant variables that a model can have?

2. Can a predictive model perform better on the 'validation' sample than that seen in the 'Training' sample. The test results on my validation sample performs atleast 15% better than the 'train' sample in the prediction in the top deciles.

3. Does the "Kolmogorov-Smirnov test" help in finding out how much the 'Validation' sample results can differ from the 'Training' sample results? If so, can someone give me some pointers on how to perform the test?

Thanks.
R. Abraham

________________________________
NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR CONFIDENTIAL information and is intended only for the use of the specific individual(s) to whom it is addressed. It may contain information that is privileged and confidential under state and federal law. This information may be used or disclosed only in accordance with law, and you may be subject to penalties under law for improper use or further disclosure of the information in this e-mail and its attachments. If you have received this e-mail in error, please immediately notify the person named above by reply e-mail, and then delete the original e-mail. Thank you.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Hector Maletta

Re: Logistic Regression

In reply to this post by <R. Abraham>

Yes. “Statistically significant” is not identical to “Substantively significant” or “Predictively worthwhile”.

Hector

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of <R. Abraham>
Sent: 23 April 2009 10:49
To: [hidden email]
Subject: Re: Logistic Regression

That makes sense. So you suggest to eliminate the significant variables at the bottom with negligible betas from the model, and rerun the model. I am going to try that.

R. Abraham

"Hector Maletta" <[hidden email]>

04/22/2009 06:41 PM

To	<[hidden email]>, <[hidden email]>
cc
Subject	RE: Logistic Regression

If you have a model with 40 significant coefficients, just take it and discard the other 440 variables.

But even if your model with 40 variables performs well, it may be the case that the last 5 or 10 of those 40 predictors add very little to the result. They are certainly statistically significant in the sense that the probability of their being zero in the population is lower than 5%, but they may still be quite low, and when multiplied by a variable with low absolute values they may modify the result by an almost imperceptible amount. If this is the case, you may try a leaner model without those last variables. Unless, of course, there are strong theoretical reasons to have them in the model.

Hector

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of <R. Abraham>
Sent: 22 April 2009 18:39
To: [hidden email]
Subject: Re: Logistic Regression

Hector,
I started with about 700 variables (includes demographic, lifestyles, census, household cluster variables). By eliminating variables not deemed useful/less useful, and correlated variables, I brought it down to somewhere around 375, the rest (480-375) were dummy variables of categorical variables. So I am really not shooting in the dark. But can't think of further eliminating variables, unless I start eliminating even faintly correlated variables (say 0.4). Or start running a series of regression models with different sets of variables and eliminate the insignificant ones from all the models, thereby reducing the number of variables to be included in the final model? I already do that to a certain extent, not sure if it's advisable but it does work.

My final model is quite stable, and infact it performs better in the validation sample when tested. However the number of significant variables in my final model is somewhere around 40. I know that modelers usually suggest around 10-18 variables similar to what Fermin in the earlier post suggested. I get the same number of variables when I run the regression model using the 'Enter' method in SPSS, and also when running it in SAS (Stepwise). The Forward:LR Stepwise in SPSS takes days to complete, so still waiting for its results. I can limit the number of significant model variables in SPSS using the Stepwise method, by selecting the appropriate Step in the Model.

But since I am getting good results with the model with about 40 variables that I already have, I was thinking if it's Ok to accept it.

And any suggestions on my third question regarding the K-S test?

Thank you so much.

R. Abraham

"Hector Maletta" <[hidden email]>

04/22/2009 04:57 PM

To	<[hidden email]>, <[hidden email]>
cc
Subject	RE: Logistic Regression

The number of records are, I think, irrelevant if the sample consists of 3000 subjects. Sample size is 3000. If the sample was a random sample, there are ways to judge the marginal increase in significance due to the addition of one more predictor.
Now, using 480 predictors seems a bit of an overkill strategy. Is there any actual theory with 480 mutually independent additive factors jointly influencing the probability (or the odds) of occurrence of your event? Or you’re simply shooting in the dark? I bet you can obtain a reasonably good model with just a smaller number of judiciously chosen predictors. Choosing judiciously may indeed involve some initial shooting in the dark, until you find out what are the very best predictors. Choose the best and ignore the rest, unless you have good theoretical reasons to include them all.
Besides, remember that the outcome of logistic regression is NOT the prediction of individual outcomes, but the prediction of population proportions. You may throw a coin 1000 times and predict heads 50% of the times and tails the other 50% of the throws, and miserably fail every time; but still the coin will show heads 50% of the times even if your prediction of individual throws mostly fail. Each particular throw is indeterminate, but the population of 1000 throws will have about 500 heads and 500 tails (you simply do not know which).

Hector

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of <R. Abraham>
Sent: 22 April 2009 17:37
To: [hidden email]
Subject: Logistic Regression

I have 2 questions on Predictive Modeling:

1. I am building a logistic regression model with about 480 predictors. The 'training' sample has about 18000 records with about 3000 responders. I would like to know how many significant predictors can the model have? Is there any suggested number of significant variables that a model can have?

2. Can a predictive model perform better on the 'validation' sample than that seen in the 'Training' sample. The test results on my validation sample performs atleast 15% better than the 'train' sample in the prediction in the top deciles.

3. Does the "Kolmogorov–Smirnov test" help in finding out how much the 'Validation' sample results can differ from the 'Training' sample results? If so, can someone give me some pointers on how to perform the test?

Thanks.
R. Abraham