SPSSX Discussion

best subset regression

Classic

List

Threaded

4 messages Options

Kornbrot, Diana

Apr 07, 2014; 2:58pm

best subset regression

186 posts

how does spss decide on the number of coefficients when fitting best subsets?

the number of coefficients, n, used in the model appears to be for all predictors with p < .10, no matter what criterion is used.

thus the corrected model fit uses n, even when it has decided that the number of significant predictors is m

n-m can vary from 0 to as much as 8 if tough criterion is specified. this seems rather odd to me.

please do NOT tell me that best subset, like all forms of 'automatic' variable selection is evil.

I know that this is a widely accepted view. however, in some situations it is better [less biassed] than the 'theoretical' opinion, model, of the researcher

all help gratefully received

best

diana

___________

Professor Diana Kornbrot

Work

University of Hertfordshire

College Lane, Hatfield, Hertfordshire AL10 9AB, UK

+44 (0) 170 728 4626

[hidden email]

http://dianakornbrot.wordpress.com/

http://go.herts.ac.uk/Diana_Kornbrot

skype: kornbrotme

Home

19 Elmhurst Avenue

London N2 0LT, UK

+44 (0) 208 444 2081

Jon K Peck

Apr 07, 2014; 10:14pm

Re: best subset regression

1976 posts

With best subsets, individual significance levels for entry and removal do not apply - you can see that they are disabled in the dialog box in the Build Options tab if best subsets is used for model selection. The include/remove settings are in the Stepwise selection group. The confidence level on the Basics tab is not related to model selection (as indicated in the help).

You can color variables by significance using the sliders in the output model view coefficients pane, but that is just a display option. As you can see from that pane, the uncolored coefficients are still there in the model.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: "Kornbrot, Diana" <[hidden email]>
To: [hidden email],
Date: 04/07/2014 09:00 AM
Subject: [SPSSX-L] best subset regression
Sent by: "SPSSX(r) Discussion" <[hidden email]>

how does spss decide on the number of coefficients when fitting best subsets?
the number of coefficients, n, used in the model appears to be for all predictors with p < .10, no matter what criterion is used.
thus the corrected model fit uses n, even when it has decided that the number of significant predictors is m
n-m can vary from 0 to as much as 8 if tough criterion is specified. this seems rather odd to me.

please do NOT tell me that best subset, like all forms of 'automatic' variable selection is evil.
I know that this is a widely accepted view. however, in some situations it is better [less biassed] than the 'theoretical' opinion, model, of the researcher

all help gratefully received

best

diana
___________
Professor Diana Kornbrot
Work
University of Hertfordshire
College Lane, Hatfield, Hertfordshire AL10 9AB, UK
+44 (0) 170 728 4626
d.e.kornbrot@...
http://dianakornbrot.wordpress.com/
http://go.herts.ac.uk/Diana_Kornbrot
skype: kornbrotme
Home
19 Elmhurst Avenue
London N2 0LT, UK
+44 (0) 208 444 2081

Bruce Weaver

Apr 08, 2014; 1:21am

Re: best subset regression

Administrator

3512 posts

I assume this discussion is about the "Automatic Linear Modeling" procedure (ALM)--is that right?

Jon K Peck wrote

With best subsets, individual significance levels for entry and removal do
not apply - you can see that they are disabled in the dialog box in the
Build Options tab if best subsets is used for model selection. The
include/remove settings are in the Stepwise selection group. The
confidence level on the Basics tab is not related to model selection (as
indicated in the help).

You can color variables by significance using the sliders in the output
model view coefficients pane, but that is just a display option. As you
can see from that pane, the uncolored coefficients are still there in the
model.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: "Kornbrot, Diana" <[hidden email]>
To: [hidden email],
Date: 04/07/2014 09:00 AM
Subject: [SPSSX-L] best subset regression
Sent by: "SPSSX(r) Discussion" <[hidden email]>

how does spss decide on the number of coefficients when fitting best
subsets?
the number of coefficients, n, used in the model appears to be for all
predictors with p < .10, no matter what criterion is used.
thus the corrected model fit uses n, even when it has decided that the
number of significant predictors is m
n-m can vary from 0 to as much as 8 if tough criterion is specified. this
seems rather odd to me.

please do NOT tell me that best subset, like all forms of 'automatic'
variable selection is evil.
I know that this is a widely accepted view. however, in some situations it
is better [less biassed] than the 'theoretical' opinion, model, of the
researcher

all help gratefully received

best

diana
___________
Professor Diana Kornbrot
Work
University of Hertfordshire
College Lane, Hatfield, Hertfordshire AL10 9AB, UK
+44 (0) 170 728 4626
[hidden email]
http://dianakornbrot.wordpress.com/
http://go.herts.ac.uk/Diana_Kornbrot
skype: kornbrotme
Home
19 Elmhurst Avenue
London N2 0LT, UK
+44 (0) 208 444 2081
... [show rest of quote]

... [show rest of quote]

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

Bruce Weaver

Apr 08, 2014; 10:06pm

Re: best subset regression

Administrator

3512 posts

Now that I've had time to glance at the FM, I see that the procedure for "Automatic Linear Modeling" is actually called LINEAR, not ALM. Under Build Options, one can select FORWARDSTEPWISE, BESTSUBSETS, or NONE. Here is what the FM says about BESTSUBSETS.

BESTSUBSETS. This checks "all possible" models, or at least a larger subset of the possible models than forward stepwise, to choose the best according to the best subsets criterion. The model with the greatest value of the criterion is chosen as the best model. Note that Best subsets selection is more computationally intensive than forward stepwise selection. When best subsets is performed in conjunction with boosting, bagging, or very large datasets, it can take considerably longer to build than a standard model built using forward stepwise selection.

CRITERIA_BEST_SUBSETS. This is the statistic used to choose the "best" model when best subsets selection is used. If MODEL_SELECTION = FORWARDSTEPWISE is not specified, this keyword is ignored.

****************
Jon, Rick & other IBM-SPSS folks who may be lurking: That looks like a typo to me in the CRITERIA_BEST_SUBSETS section. I believe it should say, "If MODEL_SELECTION = BESTSUBSETS is not specified, this keyword is ignored."
****************

The options for CRITERIA_BEST_SUBSETS are as follows.

AICC. Information Criterion (AICC) is based on the likelihood of the data given the model, and is adjusted to penalize overly complex models.

ADJUSTEDRSQUARED. Adjusted R-squared is based on the fit of the data, and is adjusted to penalize overly complex models.

ASE. Overfit Prevention Criterion (ASE) is based on the fit of the overfit prevention set. The overfit prevention set is a random subsample of approximately 30% of the original dataset that is not used to train the model.

Q. Why is ASE short for Overfit Prevention Criterion? ASE is more often short for asymptotic standard error!

Bruce Weaver wrote

I assume this discussion is about the "Automatic Linear Modeling" procedure (ALM)--is that right?

Jon K Peck wrote

With best subsets, individual significance levels for entry and removal do
not apply - you can see that they are disabled in the dialog box in the
Build Options tab if best subsets is used for model selection. The
include/remove settings are in the Stepwise selection group. The
confidence level on the Basics tab is not related to model selection (as
indicated in the help).

You can color variables by significance using the sliders in the output
model view coefficients pane, but that is just a display option. As you
can see from that pane, the uncolored coefficients are still there in the
model.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: "Kornbrot, Diana" <[hidden email]>
To: [hidden email],
Date: 04/07/2014 09:00 AM
Subject: [SPSSX-L] best subset regression
Sent by: "SPSSX(r) Discussion" <[hidden email]>

how does spss decide on the number of coefficients when fitting best
subsets?
the number of coefficients, n, used in the model appears to be for all
predictors with p < .10, no matter what criterion is used.
thus the corrected model fit uses n, even when it has decided that the
number of significant predictors is m
n-m can vary from 0 to as much as 8 if tough criterion is specified. this
seems rather odd to me.

please do NOT tell me that best subset, like all forms of 'automatic'
variable selection is evil.
I know that this is a widely accepted view. however, in some situations it
is better [less biassed] than the 'theoretical' opinion, model, of the
researcher

all help gratefully received

best

diana
___________
Professor Diana Kornbrot
Work
University of Hertfordshire
College Lane, Hatfield, Hertfordshire AL10 9AB, UK
+44 (0) 170 728 4626
[hidden email]
http://dianakornbrot.wordpress.com/
http://go.herts.ac.uk/Diana_Kornbrot
skype: kornbrotme
Home
19 Elmhurst Avenue
London N2 0LT, UK
+44 (0) 208 444 2081
... [show rest of quote]

... [show rest of quote]

... [show rest of quote]

... [show rest of quote]