Addition of covariates in forward regression analyses

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Addition of covariates in forward regression analyses

Peters Gj (PSYCHOLOGY)
Dear list,

[if this question is inappropriate (as it discusses a topic not limited
to SPSS) please tell me; I could not find rules prohibiting this online]

In forward selection multiple linear regression, which of these factors
influence whether a covariate is added to the model?

  - the size of the regression weight the covariate would get
  - the standard error of that regression weight
  - the complete sample size

I suspect that both the size & standard error of the regression weight
are of influence, and that the sample size influences the standard error
of the regression weight.

If you don't want to know why I'm asking this, you can stop reading now
:-)
In any case thanks in advance :-)

Why I want to know this:

I am conducting several very exploratory regression analyses, regressing
the same covariates on the same criterion in a number of different
subsamples (persons with a different value on a certain variable; in
this case for example ecstasy use status (non-users, users & ex-users)).
I use the forward method to probe which covariates yield a significant
addition to the model. The covariates are placed in six blocks (on the
basis of theoretical proximity to the criterion; the idea is that more
distal covariates only enter the model if they explain a significant
portion of the criterion variance over and above the more proximal
covariates already in the model). P to enter is .05. (peripheral
question: am I correct in assuming that this is the p-value associated
with the t-value of the beta of the relevant covariate?)

The sample sizes of the samples are unequal (e.g., ranging from 200 to
500). I get the strong impression that the number of covariates in the
final model depends on the sample size. This would imply that covariates
with less 'impact' would be added to the model when the model is
developed with a larger sample (e.g., with equal standard errors of the
parameter weight, when a covariate increases 1 standard deviation, an
increase of the criterion of 0.2 * Y's standard deviation could suffice
(lead to inclusion) with n=500, but not with n=200).

If this correct? And if so, is there a way to 'correct' the p-to-enter
for sample size, so that all final models comprise covariates with
roughly equal relevance? (except for selecting sub-subsamples from all
subsamples of the size of the smallest subsample)

My goal in the end is to cursorily compare the models in the different
subsamples (no, sorry, I'm not going to use SEM; given the amount of
potential predictors, the sample sizes are too small). This is not very
'fair' if the model in one subsample has lower thresholds for
'inclusion' than the model in another.

If what I'm trying is completely insade/stupid/otherwise unadvisable,
I'm of course eager to learn :-)

Thanks a lot in advance (if nothing else, for reading this far :-)),

Gjalt-Jorn
_____________________________________
Gjalt-Jorn Ygram Peters

Phd. Student
Department of Experimental Psychology
Faculty of Psychology
University of Maastricht
Maastricht, The Netherlands
Reply | Threaded
Open this post in threaded view
|

Re: Addition of covariates in forward regression analyses

Hector Maletta
The criteria for inclusion (as explained in the SPSS documentation) are two:
Significance of the new variable (contribution to explaining variance of the
dependent variable) and tolerance (related to colinearity.
The significance criterion is F (additional variance explained relative to
total residual variance from the previous step). It can be fixed as an
absolute F value (default is 3.84) or as a probability (default is 0.05).
The tolerance indicator is the proportion of variance in the new variable
that is NOT explained by other variables in the equation. When the tolerance
indicator is below a certain threshold (default 0.0001) the new variable is
not accepted because it is close to an exact function of the other
variables.
Hector
-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Peters Gj (PSYCHOLOGY)
Enviado el: Wednesday, July 26, 2006 1:24 PM
Para: [hidden email]
Asunto: Addition of covariates in forward regression analyses

Dear list,

[if this question is inappropriate (as it discusses a topic not limited
to SPSS) please tell me; I could not find rules prohibiting this online]

In forward selection multiple linear regression, which of these factors
influence whether a covariate is added to the model?

  - the size of the regression weight the covariate would get
  - the standard error of that regression weight
  - the complete sample size

I suspect that both the size & standard error of the regression weight
are of influence, and that the sample size influences the standard error
of the regression weight.

If you don't want to know why I'm asking this, you can stop reading now
:-)
In any case thanks in advance :-)

Why I want to know this:

I am conducting several very exploratory regression analyses, regressing
the same covariates on the same criterion in a number of different
subsamples (persons with a different value on a certain variable; in
this case for example ecstasy use status (non-users, users & ex-users)).
I use the forward method to probe which covariates yield a significant
addition to the model. The covariates are placed in six blocks (on the
basis of theoretical proximity to the criterion; the idea is that more
distal covariates only enter the model if they explain a significant
portion of the criterion variance over and above the more proximal
covariates already in the model). P to enter is .05. (peripheral
question: am I correct in assuming that this is the p-value associated
with the t-value of the beta of the relevant covariate?)

The sample sizes of the samples are unequal (e.g., ranging from 200 to
500). I get the strong impression that the number of covariates in the
final model depends on the sample size. This would imply that covariates
with less 'impact' would be added to the model when the model is
developed with a larger sample (e.g., with equal standard errors of the
parameter weight, when a covariate increases 1 standard deviation, an
increase of the criterion of 0.2 * Y's standard deviation could suffice
(lead to inclusion) with n=500, but not with n=200).

If this correct? And if so, is there a way to 'correct' the p-to-enter
for sample size, so that all final models comprise covariates with
roughly equal relevance? (except for selecting sub-subsamples from all
subsamples of the size of the smallest subsample)

My goal in the end is to cursorily compare the models in the different
subsamples (no, sorry, I'm not going to use SEM; given the amount of
potential predictors, the sample sizes are too small). This is not very
'fair' if the model in one subsample has lower thresholds for
'inclusion' than the model in another.

If what I'm trying is completely insade/stupid/otherwise unadvisable,
I'm of course eager to learn :-)

Thanks a lot in advance (if nothing else, for reading this far :-)),

Gjalt-Jorn
_____________________________________
Gjalt-Jorn Ygram Peters

Phd. Student
Department of Experimental Psychology
Faculty of Psychology
University of Maastricht
Maastricht, The Netherlands
Reply | Threaded
Open this post in threaded view
|

Re: Addition of covariates in forward regression analyses

Marta García-Granero
In reply to this post by Peters Gj (PSYCHOLOGY)
Hi Gjalt-Jorn

PGP> [if this question is inappropriate (as it discusses a topic not limited
PGP> to SPSS) please tell me; I could not find rules prohibiting this online]

As I told you in my private message to you: theoretical questions are
neither inappropriate nor prohibited in this list.

OK. I have let you several days to "digest" the PDF file with chapter
7 of Rawlings' book, concerning multiple regression models. Now we can
start discussing your questions (this thread is of course open to
everyone who wants to add something/correct anything of what I say
here).

First of all:

Why do you split your dataset in several subgroups according to a
categorical variable? You loose power working with smaller sample
sizes. You could add the categorical variable to your model as an
extra covariate, and check if the models you obtain are different by
addign interaction terms between that variable and the predictors of
interest (see below for a very simple example).

What I would do instead is select a random sample of cases (around
10%), keep it aside and develop my model with the other 90%. This
"small" sample could be used later to cross-validate the model
(evaluating shrinkage).

Now, if you are interested in adding "distal covariates" only if they
explain an important (don't use the word "significant" here) of the
variance, then you should consider examining the change in adjusted
R-square, or the decrease in residual variance, instead of simply the
significance of the variable in the full model.

Now, the example I mentioned:

DATA LIST LIST/ deadspac height age group (4 F4).
BEGIN DATA
 44 110  5 1
 31 116  5 0
 43 124  6 1
 45 129  7 1
 56 131  7 1
 79 138  6 0
 57 142  6 1
 56 150  8 1
 58 153  8 1
 92 155  9 0
 78 156  7 0
 64 159  8 1
 88 164 10 0
112 168 11 0
101 174 14 0
END DATA.

VAR LABEL deadspac'Pulmonary anatomical deadspace (ml)'.
VAR LABEL height'Height (cm)'.
VAR LABEL age'Age (years)'.
VAR LABEL group'Status'.
VAL LABEL group 0'Normal' 1'Asthma'.

* Two independent models: one for normal children and another one for asthmatic *.
SORT CASES BY group .
SPLIT FILE  SEPARATE BY group .

REGRESSION
  /STATISTICS COEFF OUTS CI R ANOVA
  /NOORIGIN
  /DEPENDENT deadspac
  /METHOD=ENTER height age  .

* As you can see, in looks like in normal children, neither height nor
  age is significant (although the model is significant) *.

SPLIT FILE  OFF.

REGRESSION
  /STATISTICS COEFF OUTS CI R ANOVA
  /NOORIGIN
  /DEPENDENT deadspac
  /METHOD=ENTER height age group .

COMPUTE grphgt=group*height.

REGRESSION
  /STATISTICS COEFF OUTS CI R ANOVA
  /NOORIGIN
  /DEPENDENT deadspac
  /METHOD=ENTER height age group grphgt.

PGP> In forward selection multiple linear regression, which of these factors
PGP> influence whether a covariate is added to the model?

PGP>   - the size of the regression weight the covariate would get
PGP>   - the standard error of that regression weight
PGP>   - the complete sample size

PGP> I suspect that both the size & standard error of the regression weight
PGP> are of influence, and that the sample size influences the standard error
PGP> of the regression weight.

PGP> If you don't want to know why I'm asking this, you can stop reading now
PGP> :-)
PGP> In any case thanks in advance :-)

PGP> Why I want to know this:

PGP> I am conducting several very exploratory regression analyses, regressing
PGP> the same covariates on the same criterion in a number of different
PGP> subsamples (persons with a different value on a certain variable; in
PGP> this case for example ecstasy use status (non-users, users & ex-users)).
PGP> I use the forward method to probe which covariates yield a significant
PGP> addition to the model. The covariates are placed in six blocks (on the
PGP> basis of theoretical proximity to the criterion; the idea is that more
PGP> distal covariates only enter the model if they explain a significant
PGP> portion of the criterion variance over and above the more proximal
PGP> covariates already in the model). P to enter is .05. (peripheral
PGP> question: am I correct in assuming that this is the p-value associated
PGP> with the t-value of the beta of the relevant covariate?)

PGP> The sample sizes of the samples are unequal (e.g., ranging from 200 to
PGP> 500). I get the strong impression that the number of covariates in the
PGP> final model depends on the sample size. This would imply that covariates
PGP> with less 'impact' would be added to the model when the model is
PGP> developed with a larger sample (e.g., with equal standard errors of the
PGP> parameter weight, when a covariate increases 1 standard deviation, an
PGP> increase of the criterion of 0.2 * Y's standard deviation could suffice
PGP> (lead to inclusion) with n=500, but not with n=200).

PGP> If this correct? And if so, is there a way to 'correct' the p-to-enter
PGP> for sample size, so that all final models comprise covariates with
PGP> roughly equal relevance? (except for selecting sub-subsamples from all
PGP> subsamples of the size of the smallest subsample)

PGP> My goal in the end is to cursorily compare the models in the different
PGP> subsamples (no, sorry, I'm not going to use SEM; given the amount of
PGP> potential predictors, the sample sizes are too small). This is not very
PGP> 'fair' if the model in one subsample has lower thresholds for
PGP> 'inclusion' than the model in another.

PGP> If what I'm trying is completely insade/stupid/otherwise unadvisable,
PGP> I'm of course eager to learn :-)


--
Regards,
Dr. Marta García-Granero,PhD           mailto:[hidden email]
Statistician

---
"It is unwise to use a statistical procedure whose use one does
not understand. SPSS syntax guide cannot supply this knowledge, and it
is certainly no substitute for the basic understanding of statistics
and statistical thinking that is essential for the wise choice of
methods and the correct interpretation of their results".

(Adapted from WinPepi manual - I'm sure Joe Abrahmson will not mind)