Dear Marta & list,
I have just found your post to the SPSS list about the problem I posed earlier concerning my dataset. I am sorry; I missed it earlier!!! > Why do you split your dataset in several subgroups according to a > categorical variable? You lose power working with smaller sample > sizes. You could add the categorical variable to your model as an > extra covariate, and check if the models you obtain are different by > adding interaction terms between that variable and the predictors of > interest (see below for a very simple example). I actually have 25 subsamples. The total sample is about 7000 participants, with between 200 to 500 participants in each subsample. The questionnaire was administered online, and heavily tailored. The study is about ecstasy use, and there have been virtually no quantitative studies (10-15) examining the determinants (psychological antecedent variables) of the various behaviours related to responsible (less irresponsible ☺) ecstasy use (e.g., drinking sufficient water, getting your ecstasy tested (we do that in the Netherlands), not using too often, etc etc). So, given the possibilities of internet research, I decided to examing all these behaviours, so I (more or less randomly) assigned each participant to a different ‘subsample’, and everybody answered questions about a different behaviour. I’m not sure whether it would make much sense to create the 16 dummy variables I would need to specify the subsample of each participant. In small comparative analyses, for example, to see whether the determinants for using ecstasy are the same for users and non-users, your idea of creating an interaction term to test whether the model differs for the groups is simply ingenious! I was afraid I had to resort to SEM to test this (for which I have too little participants, as I don’t know of any SEM program allowing explorative regression analyses (thank god! SEM is being abused as a fishing technique too much as it is anyway!). Because I would have to enter all determinants into the model, I think I would need about 25 * 15 = 375 participants per subsample, assuming you need as many participants as you’d want with simple OLS regression). I’m very happy with your suggestion! > Now, if you are interested in adding "distal covariates" only if they > explain an important (don't use the word "significant" here) of the > variance, then you should consider examining the change in adjusted > R-square, or the decrease in residual variance, instead of simply the > significance of the variable in the full model. Regarding examination or R^2, you’re right that this would make more sense. However, you can’t ‘tell’ SPSS to use R^2 as a criterion for inclusion of a variable, sadly (for as far as I know, anyway ☺). I would ideally tell SPSS to add all variable that 1) ‘are significant’ (i.e., have beta weights that are not zero and are estimated with reasonably accuracy) and 2) have substantial influence (e.g., standardised beta weights of .1 or higher). As I have to develop an intervention, variables that only have a small influence need not enter the model (not that I’m not interested in answering more fundamental questions, but for the time being the intervention development should have priority). Again, Marta, my apoligies that I missed your post!!! And thank you for it (the post)! ☺ Kind regards, Gjalt-Jorn ______________________________________ Gjalt-Jorn Ygram Peters Phd. Student Department of Experimental Psychology Faculty of Psychology University of Maastricht ________________________________________ Hi Gjalt-Jorn PGP> [if this question is inappropriate (as it discusses a topic not limited PGP> to SPSS) please tell me; I could not find rules prohibiting this online] As I told you in my private message to you: theoretical questions are neither inappropriate nor prohibited in this list. OK. I have let you several days to "digest" the PDF file with chapter 7 of Rawlings' book, concerning multiple regression models. Now we can start discussing your questions (this thread is of course open to everyone who wants to add something/correct anything of what I say here). First of all: Why do you split your dataset in several subgroups according to a categorical variable? You loose power working with smaller sample sizes. You could add the categorical variable to your model as an extra covariate, and check if the models you obtain are different by addign interaction terms between that variable and the predictors of interest (see below for a very simple example). What I would do instead is select a random sample of cases (around 10%), keep it aside and develop my model with the other 90%. This "small" sample could be used later to cross-validate the model (evaluating shrinkage). Now, if you are interested in adding "distal covariates" only if they explain an important (don't use the word "significant" here) of the variance, then you should consider examining the change in adjusted R-square, or the decrease in residual variance, instead of simply the significance of the variable in the full model. Now, the example I mentioned: DATA LIST LIST/ deadspac height age group (4 F4). BEGIN DATA � 44 110 � 5 1 � 31 116 � 5 0 � 43 124 � 6 1 � 45 129 � 7 1 � 56 131 � 7 1 � 79 138 � 6 0 � 57 142 � 6 1 � 56 150 � 8 1 � 58 153 � 8 1 � 92 155 � 9 0 � 78 156 � 7 0 � 64 159 � 8 1 � 88 164 10 0 112 168 11 0 101 174 14 0 END DATA. VAR LABEL deadspac'Pulmonary anatomical deadspace (ml)'. VAR LABEL height'Height (cm)'. VAR LABEL age'Age (years)'. VAR LABEL group'Status'. VAL LABEL group 0'Normal' 1'Asthma'. * Two independent models: one for normal children and another one for asthmatic *. SORT CASES BY group . SPLIT FILE � SEPARATE BY group . REGRESSION � /STATISTICS COEFF OUTS CI R ANOVA � /NOORIGIN � /DEPENDENT deadspac � /METHOD=ENTER height age � . * As you can see, in looks like in normal children, neither height nor � age is significant (although the model is significant) *. SPLIT FILE � OFF. REGRESSION � /STATISTICS COEFF OUTS CI R ANOVA � /NOORIGIN � /DEPENDENT deadspac � /METHOD=ENTER height age group . COMPUTE grphgt=group*height. REGRESSION � /STATISTICS COEFF OUTS CI R ANOVA � /NOORIGIN � /DEPENDENT deadspac � /METHOD=ENTER height age group grphgt. PGP> In forward selection multiple linear regression, which of these factors PGP> influence whether a covariate is added to the model? PGP> � - the size of the regression weight the covariate would get PGP> � - the standard error of that regression weight PGP> � - the complete sample size PGP> I suspect that both the size & standard error of the regression weight PGP> are of influence, and that the sample size influences the standard error PGP> of the regression weight. PGP> If you don't want to know why I'm asking this, you can stop reading now PGP> :-) PGP> In any case thanks in advance :-) PGP> Why I want to know this: PGP> I am conducting several very exploratory regression analyses, regressing PGP> the same covariates on the same criterion in a number of different PGP> subsamples (persons with a different value on a certain variable; in PGP> this case for example ecstasy use status (non-users, users & ex-users)). PGP> I use the forward method to probe which covariates yield a significant PGP> addition to the model. The covariates are placed in six blocks (on the PGP> basis of theoretical proximity to the criterion; the idea is that more PGP> distal covariates only enter the model if they explain a significant PGP> portion of the criterion variance over and above the more proximal PGP> covariates already in the model). P to enter is .05. (peripheral PGP> question: am I correct in assuming that this is the p-value associated PGP> with the t-value of the beta of the relevant covariate?) PGP> The sample sizes of the samples are unequal (e.g., ranging from 200 to PGP> 500). I get the strong impression that the number of covariates in the PGP> final model depends on the sample size. This would imply that covariates PGP> with less 'impact' would be added to the model when the model is PGP> developed with a larger sample (e.g., with equal standard errors of the PGP> parameter weight, when a covariate increases 1 standard deviation, an PGP> increase of the criterion of 0.2 * Y's standard deviation could suffice PGP> (lead to inclusion) with n=500, but not with n=200). PGP> If this correct? And if so, is there a way to 'correct' the p-to-enter PGP> for sample size, so that all final models comprise covariates with PGP> roughly equal relevance? (except for selecting sub-subsamples from all PGP> subsamples of the size of the smallest subsample) PGP> My goal in the end is to cursorily compare the models in the different PGP> subsamples (no, sorry, I'm not going to use SEM; given the amount of PGP> potential predictors, the sample sizes are too small). This is not very PGP> 'fair' if the model in one subsample has lower thresholds for PGP> 'inclusion' than the model in another. PGP> If what I'm trying is completely insade/stupid/otherwise unadvisable, PGP> I'm of course eager to learn :-) |
Hi Gjalt-Jorn
There is something else important to consider: you simply CAN'T let SPSS (or any other statistical software, no criticism meant for this particular product, @spss.com people) develop a model for you without your control. The researcher should have the last word concerning which variables enter/leave the model, and no stepwise method, governed by any rule, is a good substitute for common sense. Stepwise regression has been considered a form of "unwise regression". Now, having said that, let's contribute with something useful. You can use a kind of stepwise method to see the effect of including/excluding a block of variables (it is called "chunkwise" regression). This will allow you to check if a whole group of variables (that are together for theoretical reasons) is worth analyzing in more detail (because the overall test for this block of variables, the chunk test, is significant, and the increase in R-square for the whole block of variables is important. Unfortunately, this can be easily done with BMDP, but not so easily with SPSS (I mean, the inclusion/exclusion of the block of variables is not authomatic, but decided by the user). I'm working on the adaptation to SPSS of an example I have found a book (solved with BMDP). Let me know if you are interested, and I can send you what I get. PGP> Regarding examination or R^2, youre right that this would PGP> make more sense. However, you cant tell SPSS to use R^2 as a PGP> criterion for inclusion of a variable, sadly (for as far as I PGP> know, anyway ?). I would ideally tell SPSS to add all variable PGP> that 1) are significant (i.e., have beta weights that are not PGP> zero and are estimated with reasonably accuracy) and 2) have PGP> substantial influence (e.g., standardised beta weights of .1 or PGP> higher). As I have to develop an intervention, variables that PGP> only have a small influence need not enter the model (not that PGP> Im not interested in answering more fundamental questions, but PGP> for the time being the intervention development should have PGP> priority). Marta |
Hey Marta & list,
Thank you for your reply. >Marta>> The researcher should have the last word concerning >Marta>> which variables enter/leave the model, and no stepwise method, >Marta>> governed by any rule, is a good substitute for common sense. >Marta>> Stepwise regression has been considered a form of "unwise >Marta>> regression". I have heard this argument many times before, but do still not wholly agree. Imagine the following situation (completely coincidentally my own :-)). You are performing a very exploratory, applied study, measuring variables from several theories, and trying to found out whether, to what degree, and how they can explain your dependent variable ('criterion'). You have a set of 25 variables which could theoretically each be argued to predict the criterion. The variables have differing proximities to the criterion, and it could be argued that the variables partly comprise each other (e.g., X1 could be a combination of a part of X2, a part of X3, and for the rest consist of variables you did not not measure). Your mission (should you choose to accept it :-)) is to build the model that explains the largest amount of variance in the criterion, but with the restriction that you would ideally not want any distal variables in your model (unless they explain parts of the criterion that cannot be explained by more proximal variables). In addition, for pragmatic/practical reasons, you have no interest in interactions, moderations or mediations. You want to know whether, and to what degree, your proximal variables suffice to explain the criterion. In the example I gave above (X1, X2 & X3), if X1 would be distal and X2 & X3 would be proximal, you would want to first enter X2 and X3 in the model, and then see whether X1 still explains sufficient variance to warrant inclusion in the model. I don't see how there's a lot of danger in using stepwise (or, unwise if you prefer, semantics and stones don't hurt :-P) regression. Given my limited experience with statistics though, I'm probably missing some points :-) Regarding the chunkwise regression: doesn't it suffice to manually adding a chunk of variable using ENTER and inspecting the R^2 change? Our university doesn't have BMDP, and it's kind of expensive :-) (well, not really, but not worth buying it for this project alone). I am not sure whether the example you would send me would not be too complex. I can always try :-) Thank you again for your help! Kind regards, Gjalt-Jorn ___________________________________________________________________ Gjalt-Jorn Ygram Peters ## Phd. Student Department of Experimental Psychology Faculty of Psychology, University of Maastricht, The Netherlands ## Contact: P.O. Box 616 Phone: +31 43 388 4508 6200 MD Maastricht, The Netherlands Msn: [hidden email] |
Hi list
It looks that I made a mistake (the one I don't want people to do) and sent this message to Gjalt-Jorn privately, instead to the whole list (I hit "Reply", instead of "Reply to all"), as I intended (and, almost always do). Following his suggestion, I'm editing it a bit and sending to the list, just in case someone finds it interesting "food for thought" Thursday, August 17, 2006, 12:55:15 PM, I wrote: MGG> Hi Gjalt-Jorn MGG> All your arguments below are in favour of my theory: you are not MGG> running stepwise regression (in the classical sense), but a very MGG> carefully planned model development, attending to characteristics of MGG> the variables involved. The fact that it's exploratory doesn't modify MGG> this. To my knowledge, the degree of control and intelligence needed MGG> for this task can't be supplied by any program (SPSS, BMDP, SAS...), MGG> but by the user only. This means that you can't hope to find a way of MGG> running the task authomatically. At every step, you'll have to examine MGG> the results and take decisions that no program can do for you. MGG> Anyway, since you have "only" 25 variables, then there could be a MGG> second approach: to get Adjusted R-square and other interesting MGG> statistics (residual SD, information criteria...) for all the MGG> possible models (yes, I know this means 2**25-1 = 33,554,431 MGG> models!, but I already have code for that, although in some test MGG> runs it took almost one hour to run), save that info to a new MGG> file and then use your criteria to filter, sort, list, MGG> eliminate... models until you find those (a very small number, MGG> hopefully) that are interesting enough to evaluate them fully MGG> (using SPSS' REGRESSION procedure). MGG> I can send you the macros privately (since I have to publish them MGG> first, on University's orders, and I don't want them to run free MGG> on the Web, because I already had a nasty experience a year ago MGG> or so, when someone though he could improve his curriculum by MGG> plagiarizing every piece of interesting code in SPSS and GAUSS MGG> that he found searching all the web, and a great part of MGG> Raynald's web page was looted). Gjalt-Jorn had written before... PGP>> I have heard this argument many times before, but do still not wholly PGP>> agree. Imagine the following situation (completely coincidentally my own PGP>> :-)). PGP>> You are performing a very exploratory, applied study, measuring PGP>> variables from several theories, and trying to found out whether, to PGP>> what degree, and how they can explain your dependent variable PGP>> ('criterion'). You have a set of 25 variables which could theoretically PGP>> each be argued to predict the criterion. The variables have differing PGP>> proximities to the criterion, and it could be argued that the variables PGP>> partly comprise each other (e.g., X1 could be a combination of a part of PGP>> X2, a part of X3, and for the rest consist of variables you did not not PGP>> measure). Your mission (should you choose to accept it :-)) is to build PGP>> the model that explains the largest amount of variance in the criterion, PGP>> but with the restriction that you would ideally not want any distal PGP>> variables in your model (unless they explain parts of the criterion that PGP>> cannot be explained by more proximal variables). In addition, for PGP>> pragmatic/practical reasons, you have no interest in interactions, PGP>> moderations or mediations. You want to know whether, and to what degree, PGP>> your proximal variables suffice to explain the criterion. In the example PGP>> I gave above (X1, X2 & X3), if X1 would be distal and X2 & X3 would be PGP>> proximal, you would want to first enter X2 and X3 in the model, and then PGP>> see whether X1 still explains sufficient variance to warrant inclusion PGP>> in the model. PGP>> I don't see how there's a lot of danger in using stepwise (or, unwise if PGP>> you prefer, semantics and stones don't hurt :-P) regression. Given my PGP>> limited experience with statistics though, I'm probably missing some PGP>> points :-) PGP>> Regarding the chunkwise regression: doesn't it suffice to manually PGP>> adding a chunk of variable using ENTER and inspecting the R^2 change? -- Regards, Dr. Marta García-Granero,PhD mailto:[hidden email] Statistician --- "It is unwise to use a statistical procedure whose use one does not understand. SPSS syntax guide cannot supply this knowledge, and it is certainly no substitute for the basic understanding of statistics and statistical thinking that is essential for the wise choice of methods and the correct interpretation of their results". (Adapted from WinPepi manual - I'm sure Joe Abrahmson will not mind) |
Free forum by Nabble | Edit this page |