http://spssx-discussion.165.s1.nabble.com/R-and-R-square-over-90-for-regression-with-imputed-dataset-tp5717033p5717060.html
Hi Rich,
(Note that I am not the original poster)
I do not follow the distinction you are making in your second paragraph - are you referring to specific methods of imputation? Or between different flavours of multiple imputation?
Now that I am in front of my references let me quote a passage from one of the classic references, JL Schafer & JW Graham (2002) Missing Data: Our View of the State of the Art. Psychological Methods 7(2): 147-177. On page 167 under 'Choosing the Imputation
Model':
====
Notice that the MI procedure described above is based on a joint normality assumption for Y1, Y2, and Y3. This model makes no distinctions betwen response (dependent) or predictor (independent) variables but treats all three as a multivariate response. The
imputation model is not intended to provide a parsimonious description of the data, nor does it represent structural or causal relationships among variables. The model is merely a device to preserve important features of the joint distribution (means, variances,
and correlations) in the imputed values. A procedure that preserves the joint distribution of Y1, Y2, and Y3 will automatically preserve the linear regression of any of these variables on the others. Therefore, in a subsequent analysis of the imputed data,
any variable could be treated as a response or as a predictor. For example, we may regress Y2 on Y1 in each imputed data set and combine the estimated intercepts and slopes by Rubin's (1987) rules. Distinctions between dependent and independent variables and
substantive interpretation of relationships should be left to postimputation analyses.
====
As I have had it explained to me before - the aim of MI is reclaim the covariances between variables. Hence it doesn't care about dependent/independent variables, and if a dependent variable has a relationship with independent variables with missing data then
it should be included in the imputation model.
Having said that though, Rich's misconception is certainly a common one - particularly amongst the medical clinicians I work with. Even amongst the most stats savvy, they often see the process of MI as somewhat 'magical' and that using an outcome variable in
an imputation model is somehow more 'iffy' than using other predictors. However, as far I am aware this is misguided and so while not directly relevant to the initial post I wanted to clarify this point for the benefit of the archives.
There are many out there who know more than me on this though so happy to continue the discussion if I've overstated anything.
Thanks,
Kylie.
From: Rich Ulrich [[hidden email]]
Sent: Wednesday, 19 December 2012 5:55 PM
To: Kylie Lange; SPSS list
Subject: RE: R and R square over .90? for regression with imputed dataset
Okay. Not only am I no expert in imputation, but I managed to avoid it forever.
My data seldom had much missing (and not outcome) and I got by with
relabeling (like "Yes/not-yes" instead of Yes/No) or adding a Missing category.
What I wrote is the obvious starting point -- you can't use the info of outcome
to determine what you always set a predictor to. You *might* use some
probabilistic approach that avoids creating a relationship... if you have such
a large amount of missing to account for that you have to do this to salvage
an analysis.
Your results imply that you did the former - incorporating information - and
not the latter
Frank Harrell is reliable. I googled for him on the subject and came up with this
comment by someone else --
http://lists.utsouthwestern.edu/pipermail/impute/2001-February/000104.html
- which FH agrees with, in the next post in the thread.
I also noticed a comment worrying about Missing at Random, but your change
in results seems to drastic for that to matter.
--
Rich Ulrich
Date: Wed, 19 Dec 2012 06:37:58 +0000
From:
[hidden email]
Subject: Re: R and R square over .90? for regression with imputed dataset
To:
[hidden email]
Hi Rich,
Could you please expand on your comment? I have always been under the impression that one should include the outcome in an imputation model to ensure that
all relevant relationships are accounted for, and that excluding the outcome could/would introduce bias (see, for example [1]). Are you aware of scenarios in which that isn’t the case?
I know that there is debate about whether to include cases that have had the outcome imputed in the final analysis (multiple imputation then deletion,
[2]), but that appears to be a separate issue to what you describe.
Thanks,
Kylie.
[1] Moons KGM, Donders RART, Stijen T, Harrell FE. (2006) Using the outcome for imputation of missing predictor values was preferred. J of Clinical Epidemiology
59: 1092-1101.
[2] von Hippel PT. (2007) Regression with missing Ys: An improved strategy for analysing multiply imputed data. Sociological Methodology 37(1): 83-117.
From: SPSSX(r) Discussion [mailto:[hidden email]]
On Behalf Of Rich Ulrich
Sent: Wednesday, 19 December 2012 4:38 PM
To: [hidden email]
Subject: Re: R and R square over .90? for regression with imputed dataset
It is a no-no to use your criterion variable for imputing values for
your predictors. Probably because, that could account for this sort
of result.
--
Rich Ulrich
> Date: Tue, 18 Dec 2012 19:51:51 -0800
> From: [hidden email]
> Subject: R and R square over .90? for regression with imputed dataset
> To: [hidden email]
>
> Hello,
>
> I was running regressions with the imputed variables, and I got R and R2
> over .90. Is this ever possible? The highest R2 I had with original dataset
> was over .50.
>
> I used SPSS Multiple Imputation and Missing Value Analysis functions by
> following all steps suggested by the IBM SPSS guide. The original dataset
> had a significant amount of missing values (about 10-40%). Some variables
> imputed are at the scale level as well as the individual scale items. I
> spoke to my advisor and she, too, is skeptical about this result.
>
> Any suggestions would be appreciated. Thanks much.
>
> ...