Re: Addition of covariates in forward regression analyses

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Addition of covariates in forward regression analyses

Peters Gj (PSYCHOLOGY)
Dear Marta & list,

I have just found your post to the SPSS list about the problem I posed earlier concerning my dataset. I am sorry; I missed it earlier!!!

> Why do you split your dataset in several subgroups according to a
> categorical variable? You lose power working with smaller sample
> sizes. You could add the categorical variable to your model as an
> extra covariate, and check if the models you obtain are different by
> adding interaction terms between that variable and the predictors of
> interest (see below for a very simple example).

I actually have 25 subsamples. The total sample is about 7000 participants, with between 200 to 500 participants in each subsample. The questionnaire was administered online, and heavily tailored. The study is about ecstasy use, and there have been virtually no quantitative studies (10-15) examining the determinants (psychological antecedent variables) of the various behaviours related to responsible (less irresponsible ☺) ecstasy use (e.g., drinking sufficient water, getting your ecstasy tested (we do that in the Netherlands), not using too often, etc etc). So, given the possibilities of internet research, I decided to examing all these behaviours, so I (more or less randomly) assigned each participant to a different ‘subsample’, and everybody answered questions about a different behaviour. I’m not sure whether it would make much sense to create the 16 dummy variables I would need to specify the subsample of each participant.
In small comparative analyses, for example, to see whether the determinants for using ecstasy are the same for users and non-users, your idea of creating an interaction term to test whether the model differs for the groups is simply ingenious! I was afraid I had to resort to SEM to test this (for which I have too little participants, as I don’t know of any SEM program allowing explorative regression analyses (thank god! SEM is being abused as a fishing technique too much as it is anyway!). Because I would have to enter all determinants into the model, I think I would need about 25 * 15 = 375 participants per subsample, assuming you need as many participants as you’d want with simple OLS regression). I’m very happy with your suggestion!

> Now, if you are interested in adding "distal covariates" only if they
> explain an important (don't use the word "significant" here) of the
> variance, then you should consider examining the change in adjusted
> R-square, or the decrease in residual variance, instead of simply the
> significance of the variable in the full model.

Regarding examination or R^2, you’re right that this would make more sense. However, you can’t ‘tell’ SPSS to use R^2 as a criterion for inclusion of a variable, sadly (for as far as I know, anyway ☺). I would ideally tell SPSS to add all variable that 1) ‘are significant’ (i.e., have beta weights that are not zero and are estimated with reasonably accuracy) and 2) have substantial influence (e.g., standardised beta weights of .1 or higher). As I have to develop an intervention, variables that only have a small influence need not enter the model (not that I’m not interested in answering more fundamental questions, but for the time being the intervention development should have priority).

Again, Marta, my apoligies that I missed your post!!!
And thank you for it (the post)! ☺

Kind regards,

Gjalt-Jorn
______________________________________
Gjalt-Jorn Ygram Peters

Phd. Student
Department of Experimental Psychology
Faculty of Psychology
University of Maastricht

________________________________________

Hi Gjalt-Jorn

PGP> [if this question is inappropriate (as it discusses a topic not limited
PGP> to SPSS) please tell me; I could not find rules prohibiting this online]

As I told you in my private message to you: theoretical questions are
neither inappropriate nor prohibited in this list.

OK. I have let you several days to "digest" the PDF file with chapter
7 of Rawlings' book, concerning multiple regression models. Now we can
start discussing your questions (this thread is of course open to
everyone who wants to add something/correct anything of what I say
here).

First of all:

Why do you split your dataset in several subgroups according to a
categorical variable? You loose power working with smaller sample
sizes. You could add the categorical variable to your model as an
extra covariate, and check if the models you obtain are different by
addign interaction terms between that variable and the predictors of
interest (see below for a very simple example).

What I would do instead is select a random sample of cases (around
10%), keep it aside and develop my model with the other 90%. This
"small" sample could be used later to cross-validate the model
(evaluating shrinkage).

Now, if you are interested in adding "distal covariates" only if they
explain an important (don't use the word "significant" here) of the
variance, then you should consider examining the change in adjusted
R-square, or the decrease in residual variance, instead of simply the
significance of the variable in the full model.

Now, the example I mentioned:

DATA LIST LIST/ deadspac height age group (4 F4).
BEGIN DATA
� 44 110 � 5 1
� 31 116 � 5 0
� 43 124 � 6 1
� 45 129 � 7 1
� 56 131 � 7 1
� 79 138 � 6 0
� 57 142 � 6 1
� 56 150 � 8 1
� 58 153 � 8 1
� 92 155 � 9 0
� 78 156 � 7 0
� 64 159 � 8 1
� 88 164 10 0
112 168 11 0
101 174 14 0
END DATA.

VAR LABEL deadspac'Pulmonary anatomical deadspace (ml)'.
VAR LABEL height'Height (cm)'.
VAR LABEL age'Age (years)'.
VAR LABEL group'Status'.
VAL LABEL group 0'Normal' 1'Asthma'.

* Two independent models: one for normal children and another one for asthmatic *.
SORT CASES BY group .
SPLIT FILE � SEPARATE BY group .

REGRESSION
�  /STATISTICS COEFF OUTS CI R ANOVA
�  /NOORIGIN
�  /DEPENDENT deadspac
�  /METHOD=ENTER height age � .

* As you can see, in looks like in normal children, neither height nor
�  age is significant (although the model is significant) *.

SPLIT FILE � OFF.

REGRESSION
�  /STATISTICS COEFF OUTS CI R ANOVA
�  /NOORIGIN
�  /DEPENDENT deadspac
�  /METHOD=ENTER height age group .

COMPUTE grphgt=group*height.

REGRESSION
�  /STATISTICS COEFF OUTS CI R ANOVA
�  /NOORIGIN
�  /DEPENDENT deadspac
�  /METHOD=ENTER height age group grphgt.

PGP> In forward selection multiple linear regression, which of these factors
PGP> influence whether a covariate is added to the model?

PGP> �  - the size of the regression weight the covariate would get
PGP> �  - the standard error of that regression weight
PGP> �  - the complete sample size

PGP> I suspect that both the size & standard error of the regression weight
PGP> are of influence, and that the sample size influences the standard error
PGP> of the regression weight.

PGP> If you don't want to know why I'm asking this, you can stop reading now
PGP> :-)
PGP> In any case thanks in advance :-)

PGP> Why I want to know this:

PGP> I am conducting several very exploratory regression analyses, regressing
PGP> the same covariates on the same criterion in a number of different
PGP> subsamples (persons with a different value on a certain variable; in
PGP> this case for example ecstasy use status (non-users, users & ex-users)).
PGP> I use the forward method to probe which covariates yield a significant
PGP> addition to the model. The covariates are placed in six blocks (on the
PGP> basis of theoretical proximity to the criterion; the idea is that more
PGP> distal covariates only enter the model if they explain a significant
PGP> portion of the criterion variance over and above the more proximal
PGP> covariates already in the model). P to enter is .05. (peripheral
PGP> question: am I correct in assuming that this is the p-value associated
PGP> with the t-value of the beta of the relevant covariate?)

PGP> The sample sizes of the samples are unequal (e.g., ranging from 200 to
PGP> 500). I get the strong impression that the number of covariates in the
PGP> final model depends on the sample size. This would imply that covariates
PGP> with less 'impact' would be added to the model when the model is
PGP> developed with a larger sample (e.g., with equal standard errors of the
PGP> parameter weight, when a covariate increases 1 standard deviation, an
PGP> increase of the criterion of 0.2 * Y's standard deviation could suffice
PGP> (lead to inclusion) with n=500, but not with n=200).

PGP> If this correct? And if so, is there a way to 'correct' the p-to-enter
PGP> for sample size, so that all final models comprise covariates with
PGP> roughly equal relevance? (except for selecting sub-subsamples from all
PGP> subsamples of the size of the smallest subsample)

PGP> My goal in the end is to cursorily compare the models in the different
PGP> subsamples (no, sorry, I'm not going to use SEM; given the amount of
PGP> potential predictors, the sample sizes are too small). This is not very
PGP> 'fair' if the model in one subsample has lower thresholds for
PGP> 'inclusion' than the model in another.

PGP> If what I'm trying is completely insade/stupid/otherwise unadvisable,
PGP> I'm of course eager to learn :-)
Reply | Threaded
Open this post in threaded view
|

Re: Addition of covariates in forward regression analyses

Marta García-Granero
Hi Gjalt-Jorn

There is something else important to consider: you simply CAN'T let
SPSS (or any other statistical software, no criticism meant for this
particular product, @spss.com people) develop a model for you without
your control. The researcher should have the last word concerning
which variables enter/leave the model, and no stepwise method,
governed by any rule, is a good substitute for common sense. Stepwise
regression has been considered a form of "unwise regression".

Now, having said that, let's contribute with something useful. You can
use a kind of stepwise method to see the effect of including/excluding
a block of variables (it is called "chunkwise" regression). This will
allow you to check if a whole group of variables (that are together
for theoretical reasons) is worth analyzing in more detail (because
the overall test for this block of variables, the chunk test, is
significant, and the increase in R-square for the whole block of
variables is important. Unfortunately, this can be easily done with
BMDP, but not so easily with SPSS (I mean, the inclusion/exclusion of
the block of variables is not authomatic, but decided by the user).
I'm working on the adaptation to SPSS of an example I have found a
book (solved with BMDP).

Let me know if you are interested, and I can send you what I get.

PGP> Regarding examination or R^2, you’re right that this would
PGP> make more sense. However, you can’t ‘tell’ SPSS to use R^2 as a
PGP> criterion for inclusion of a variable, sadly (for as far as I
PGP> know, anyway ?). I would ideally tell SPSS to add all variable
PGP> that 1) ‘are significant’ (i.e., have beta weights that are not
PGP> zero and are estimated with reasonably accuracy) and 2) have
PGP> substantial influence (e.g., standardised beta weights of .1 or
PGP> higher). As I have to develop an intervention, variables that
PGP> only have a small influence need not enter the model (not that
PGP> I’m not interested in answering more fundamental questions, but
PGP> for the time being the intervention development should have
PGP> priority).

Marta
Reply | Threaded
Open this post in threaded view
|

....wise regression analysis (was: "Addition of covariates in forward regression analyses")

Peters Gj (PSYCHOLOGY)
Hey Marta & list,

Thank you for your reply.

>Marta>> The researcher should have the last word concerning
>Marta>> which variables enter/leave the model, and no stepwise method,
>Marta>> governed by any rule, is a good substitute for common sense.
>Marta>> Stepwise regression has been considered a form of "unwise
>Marta>> regression".

I have heard this argument many times before, but do still not wholly
agree. Imagine the following situation (completely coincidentally my own
:-)).
You are performing a very exploratory, applied study, measuring
variables from several theories, and trying to found out whether, to
what degree, and how they can explain your dependent variable
('criterion'). You have a set of 25 variables which could theoretically
each be argued to predict the criterion. The variables have differing
proximities to the criterion, and it could be argued that the variables
partly comprise each other (e.g., X1 could be a combination of a part of
X2, a part of X3, and for the rest consist of variables you did not not
measure). Your mission (should you choose to accept it :-)) is to build
the model that explains the largest amount of variance in the criterion,
but with the restriction that you would ideally not want any distal
variables in your model (unless they explain parts of the criterion that
cannot be explained by more proximal variables). In addition, for
pragmatic/practical reasons, you have no interest in interactions,
moderations or mediations. You want to know whether, and to what degree,
your proximal variables suffice to explain the criterion. In the example
I gave above (X1, X2 & X3), if X1 would be distal and X2 & X3 would be
proximal, you would want to first enter X2 and X3 in the model, and then
see whether X1 still explains sufficient variance to warrant inclusion
in the model.

I don't see how there's a lot of danger in using stepwise (or, unwise if
you prefer, semantics and stones don't hurt :-P) regression. Given my
limited experience with statistics though, I'm probably missing some
points :-)

Regarding the chunkwise regression: doesn't it suffice to manually
adding a chunk of variable using ENTER and inspecting the R^2 change?
Our university doesn't have BMDP, and it's kind of expensive :-) (well,
not really, but not worth buying it for this project alone). I am not
sure whether the example you would send me would not be too complex. I
can always try :-)

Thank you again for your help!

Kind regards,

Gjalt-Jorn
___________________________________________________________________
Gjalt-Jorn Ygram Peters

## Phd. Student
   Department of Experimental Psychology
   Faculty of Psychology, University of Maastricht, The Netherlands

## Contact:
   P.O. Box 616                              Phone: +31 43 388 4508
   6200 MD Maastricht, The Netherlands       Msn:       [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: ....wise regression analysis (was: "Addition of covariates in forward regression analyses")

Marta García-Granero
Hi list

It looks that I made a mistake (the one I don't want people to do) and
sent this message to Gjalt-Jorn privately, instead to the whole list
(I hit "Reply", instead of "Reply to all"), as I intended (and, almost
always do).

Following his suggestion, I'm editing it a bit and sending to the
list, just in case someone finds it interesting "food for thought"

Thursday, August 17, 2006, 12:55:15 PM, I wrote:

MGG> Hi Gjalt-Jorn

MGG> All your arguments below are in favour of my theory: you are not
MGG> running stepwise regression (in the classical sense), but a very
MGG> carefully planned model development, attending to characteristics of
MGG> the variables involved. The fact that it's exploratory doesn't modify
MGG> this. To my knowledge, the degree of control and intelligence needed
MGG> for this task can't be supplied by any program (SPSS, BMDP, SAS...),
MGG> but by the user only. This means that you can't hope to find a way of
MGG> running the task authomatically. At every step, you'll have to examine
MGG> the results and take decisions that no program can do for you.

MGG> Anyway, since you have "only" 25 variables, then there could be a
MGG> second approach: to get Adjusted R-square and other interesting
MGG> statistics (residual SD, information criteria...) for all the
MGG> possible models (yes, I know this means 2**25-1 = 33,554,431
MGG> models!, but I already have code for that, although in some test
MGG> runs it took almost one hour to run), save that info to a new
MGG> file and then use your criteria to filter, sort, list,
MGG> eliminate... models until you find those (a very small number,
MGG> hopefully) that are interesting enough to evaluate them fully
MGG> (using SPSS' REGRESSION procedure).

MGG> I can send you the macros privately (since I have to publish them
MGG> first, on University's orders, and I don't want them to run free
MGG> on the Web, because I already had a nasty experience a year ago
MGG> or so, when someone though he could improve his curriculum by
MGG> plagiarizing every piece of interesting code in SPSS and GAUSS
MGG> that he found searching all the web, and a great part of
MGG> Raynald's web page was looted).

Gjalt-Jorn had written before...

PGP>> I have heard this argument many times before, but do still not wholly
PGP>> agree. Imagine the following situation (completely coincidentally my own
PGP>> :-)).
PGP>> You are performing a very exploratory, applied study, measuring
PGP>> variables from several theories, and trying to found out whether, to
PGP>> what degree, and how they can explain your dependent variable
PGP>> ('criterion'). You have a set of 25 variables which could theoretically
PGP>> each be argued to predict the criterion. The variables have differing
PGP>> proximities to the criterion, and it could be argued that the variables
PGP>> partly comprise each other (e.g., X1 could be a combination of a part of
PGP>> X2, a part of X3, and for the rest consist of variables you did not not
PGP>> measure). Your mission (should you choose to accept it :-)) is to build
PGP>> the model that explains the largest amount of variance in the criterion,
PGP>> but with the restriction that you would ideally not want any distal
PGP>> variables in your model (unless they explain parts of the criterion that
PGP>> cannot be explained by more proximal variables). In addition, for
PGP>> pragmatic/practical reasons, you have no interest in interactions,
PGP>> moderations or mediations. You want to know whether, and to what degree,
PGP>> your proximal variables suffice to explain the criterion. In the example
PGP>> I gave above (X1, X2 & X3), if X1 would be distal and X2 & X3 would be
PGP>> proximal, you would want to first enter X2 and X3 in the model, and then
PGP>> see whether X1 still explains sufficient variance to warrant inclusion
PGP>> in the model.

PGP>> I don't see how there's a lot of danger in using stepwise (or, unwise if
PGP>> you prefer, semantics and stones don't hurt :-P) regression. Given my
PGP>> limited experience with statistics though, I'm probably missing some
PGP>> points :-)

PGP>> Regarding the chunkwise regression: doesn't it suffice to manually
PGP>> adding a chunk of variable using ENTER and inspecting the R^2 change?



--
Regards,
Dr. Marta García-Granero,PhD           mailto:[hidden email]
Statistician

---
"It is unwise to use a statistical procedure whose use one does
not understand. SPSS syntax guide cannot supply this knowledge, and it
is certainly no substitute for the basic understanding of statistics
and statistical thinking that is essential for the wise choice of
methods and the correct interpretation of their results".

(Adapted from WinPepi manual - I'm sure Joe Abrahmson will not mind)