SPSSX Discussion

R and R square over .90? for regression with imputed dataset

Classic

List

Threaded

9 messages Options

tonishi@iupui.edu

Dec 19, 2012; 3:51am

R and R square over .90? for regression with imputed dataset

Hello,

I was running regressions with the imputed variables, and I got R and R2 over .90. Is this ever possible? The highest R2 I had with original dataset was over .50.

I used SPSS Multiple Imputation and Missing Value Analysis functions by following all steps suggested by the IBM SPSS guide. The original dataset had a significant amount of missing values (about 10-40%). Some variables imputed are at the scale level as well as the individual scale items. I spoke to my advisor and she, too, is skeptical about this result.

Any suggestions would be appreciated. Thanks much.

Rich Ulrich

Dec 19, 2012; 6:08am

Re: R and R square over .90? for regression with imputed dataset

It is a no-no to use your criterion variable for imputing values for
your predictors. Probably because, that could account for this sort
of result.

--
Rich Ulrich

> Date: Tue, 18 Dec 2012 19:51:51 -0800

> From: [hidden email]
> Subject: R and R square over .90? for regression with imputed dataset
> To: [hidden email]
>
> Hello,
>
> I was running regressions with the imputed variables, and I got R and R2
> over .90. Is this ever possible? The highest R2 I had with original dataset
> was over .50.
>
> I used SPSS Multiple Imputation and Missing Value Analysis functions by
> following all steps suggested by the IBM SPSS guide. The original dataset
> had a significant amount of missing values (about 10-40%). Some variables
> imputed are at the scale level as well as the individual scale items. I
> spoke to my advisor and she, too, is skeptical about this result.
>
> Any suggestions would be appreciated. Thanks much.
>
> ...

... [show rest of quote]

Kylie

Dec 19, 2012; 6:37am

Re: R and R square over .90? for regression with imputed dataset

Hi Rich,

Could you please expand on your comment? I have always been under the impression that one should include the outcome in an imputation model to ensure that all relevant relationships are accounted for, and that excluding the outcome could/would introduce bias (see, for example [1]). Are you aware of scenarios in which that isn’t the case?

I know that there is debate about whether to include cases that have had the outcome imputed in the final analysis (multiple imputation then deletion, [2]), but that appears to be a separate issue to what you describe.

Thanks,

Kylie.

[1] Moons KGM, Donders RART, Stijen T, Harrell FE. (2006) Using the outcome for imputation of missing predictor values was preferred. J of Clinical Epidemiology 59: 1092-1101.

[2] von Hippel PT. (2007) Regression with missing Ys: An improved strategy for analysing multiply imputed data. Sociological Methodology 37(1): 83-117.

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Rich Ulrich
Sent: Wednesday, 19 December 2012 4:38 PM
To: [hidden email]
Subject: Re: R and R square over .90? for regression with imputed dataset

It is a no-no to use your criterion variable for imputing values for
your predictors. Probably because, that could account for this sort
of result.

--
Rich Ulrich

> Date: Tue, 18 Dec 2012 19:51:51 -0800
> From: [hidden email]
> Subject: R and R square over .90? for regression with imputed dataset
> To: [hidden email]
>
> Hello,
>
> I was running regressions with the imputed variables, and I got R and R2
> over .90. Is this ever possible? The highest R2 I had with original dataset
> was over .50.
>
> I used SPSS Multiple Imputation and Missing Value Analysis functions by
> following all steps suggested by the IBM SPSS guide. The original dataset
> had a significant amount of missing values (about 10-40%). Some variables
> imputed are at the scale level as well as the individual scale items. I
> spoke to my advisor and she, too, is skeptical about this result.
>
> Any suggestions would be appreciated. Thanks much.
>
> ...

Rich Ulrich

Dec 19, 2012; 7:25am

Re: R and R square over .90? for regression with imputed dataset

Okay. Not only am I no expert in imputation, but I managed to avoid it forever.
My data seldom had much missing (and not outcome) and I got by with
relabeling (like "Yes/not-yes" instead of Yes/No) or adding a Missing category.

What I wrote is the obvious starting point -- you can't use the info of outcome
to determine what you always set a predictor to. You *might* use some
probabilistic approach that avoids creating a relationship... if you have such
a large amount of missing to account for that you have to do this to salvage
an analysis.

Your results imply that you did the former - incorporating information - and
not the latter

Frank Harrell is reliable. I googled for him on the subject and came up with this
comment by someone else --
http://lists.utsouthwestern.edu/pipermail/impute/2001-February/000104.html
- which FH agrees with, in the next post in the thread.

I also noticed a comment worrying about Missing at Random, but your change
in results seems to drastic for that to matter.

--
Rich Ulrich

Date: Wed, 19 Dec 2012 06:37:58 +0000
From: [hidden email]
Subject: Re: R and R square over .90? for regression with imputed dataset
To: [hidden email]

Hi Rich,

Thanks,

Kylie.

[1] Moons KGM, Donders RART, Stijen T, Harrell FE. (2006) Using the outcome for imputation of missing predictor values was preferred. J of Clinical Epidemiology 59: 1092-1101.

[2] von Hippel PT. (2007) Regression with missing Ys: An improved strategy for analysing multiply imputed data. Sociological Methodology 37(1): 83-117.

It is a no-no to use your criterion variable for imputing values for
your predictors. Probably because, that could account for this sort
of result.

--
Rich Ulrich

Art Kendall

Dec 19, 2012; 12:29pm

Re: R and R square over .90? for regression with imputed dataset

In reply to this post by tonishi@iupui.edu

I agree with the two replies Rich sent.

In addition.
Why are the data missing? How did you gather the data?

You say that data is missing at the scale level and at the item level.
Did you create the scales yourself?

To what degree are your missing data attributable to a few variables? Are they items in a scale you put together?

Art Kendall
Social Research Consultants

On 12/18/2012 10:51 PM, [hidden email] wrote:

Hello,

I was running regressions with the imputed variables, and I got R and R2
over .90. Is this ever possible? The highest R2 I had with original dataset
was over .50.

I used SPSS Multiple Imputation and Missing Value Analysis functions by
following all steps suggested by the IBM SPSS guide. The original dataset
had a significant amount of missing values (about 10-40%). Some variables
imputed are at the scale level as well as the individual scale items. I
spoke to my advisor and she, too, is skeptical about this result.

Any suggestions would be appreciated. Thanks much.




--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/R-and-R-square-over-90-for-regression-with-imputed-dataset-tp5717033.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

... [show rest of quote]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Art Kendall
Social Research Consultants

tonishi@iupui.edu

Dec 19, 2012; 6:50pm

Re: R and R square over .90? for regression with imputed dataset

In reply to this post by Rich Ulrich

Hi Rich and Art,

Thanks much for your responses. I actually found that there were 2 different datasets produced by one MI attempt. One lists all variables generated by all imputations - variables from the original, imputations 1~5. Since I have about 150 cases, this lists have 150 x 5 imputations + 150 original data.

The other dataset has only 150 cases.

I thought I should use the second dataset, then got over .90 R. But I went through some archived messages in this ML and found it wasn't a correct one to use for regression. I used the variables imputed by the 5th imputation, and got about .5 R, which seems right.

But I would like to see more about your thoughts of missing values (or avoiding them) for my future survey studies --

Rich, it is great to hear you could manage limiting missing values in your datasets... My study needs to use several common scales used in the field (management), and my target organizations are notorious for not participating in surveys (venture capital funds, foundations, etc.). Almost all prior studies' response rates were about 20%. I could get over 50% of response rate, but am still suffering with many questions not answered -- I used paper-version, as well as online-, questionnaire by following Dillman's recommendation So I couldn't "force" them to answer all questions.

Is there any way that I can still limit missing values?

Art, I think the reasons why participants didn't answer some questions is because they thought the answers were "no." I mean, for instance, for the question about whether they participate in certain network organizations, some didn't circle the number. My assumption, by looking at the organizations' profiles, is that they are not affiliated to such network organizations. But, there is no way to prove all participants had the same intention.

I used the scales used commonly in my field. The reason why I needed to use the scales, rather than items , is because SPSS didn't allow me to run MI by individual items since they are too many (although after I changed the level of measurement into scale.). Also, those scales have good alphas. I left items for some scales that don't have alphas over .70. Those decisions were based on suggestions by prior literature such as those by Rubin, Little, Allison, Graham, and some other empirical studies using imputed values.

Should I do anything else?

Kylie

Dec 20, 2012; 12:04am

Re: R and R square over .90? for regression with imputed dataset

In reply to this post by Rich Ulrich

Hi Rich,

(Note that I am not the original poster)

I do not follow the distinction you are making in your second paragraph - are you referring to specific methods of imputation? Or between different flavours of multiple imputation?

Now that I am in front of my references let me quote a passage from one of the classic references, JL Schafer & JW Graham (2002) Missing Data: Our View of the State of the Art. Psychological Methods 7(2): 147-177. On page 167 under 'Choosing the Imputation Model':

====
Notice that the MI procedure described above is based on a joint normality assumption for Y1, Y2, and Y3. This model makes no distinctions betwen response (dependent) or predictor (independent) variables but treats all three as a multivariate response. The imputation model is not intended to provide a parsimonious description of the data, nor does it represent structural or causal relationships among variables. The model is merely a device to preserve important features of the joint distribution (means, variances, and correlations) in the imputed values. A procedure that preserves the joint distribution of Y1, Y2, and Y3 will automatically preserve the linear regression of any of these variables on the others. Therefore, in a subsequent analysis of the imputed data, any variable could be treated as a response or as a predictor. For example, we may regress Y2 on Y1 in each imputed data set and combine the estimated intercepts and slopes by Rubin's (1987) rules. Distinctions between dependent and independent variables and substantive interpretation of relationships should be left to postimputation analyses.
====

As I have had it explained to me before - the aim of MI is reclaim the covariances between variables. Hence it doesn't care about dependent/independent variables, and if a dependent variable has a relationship with independent variables with missing data then it should be included in the imputation model.

Having said that though, Rich's misconception is certainly a common one - particularly amongst the medical clinicians I work with. Even amongst the most stats savvy, they often see the process of MI as somewhat 'magical' and that using an outcome variable in an imputation model is somehow more 'iffy' than using other predictors. However, as far I am aware this is misguided and so while not directly relevant to the initial post I wanted to clarify this point for the benefit of the archives.

There are many out there who know more than me on this though so happy to continue the discussion if I've overstated anything.

Thanks,
Kylie.

From: Rich Ulrich [[hidden email]]
Sent: Wednesday, 19 December 2012 5:55 PM
To: Kylie Lange; SPSS list
Subject: RE: R and R square over .90? for regression with imputed dataset

Date: Wed, 19 Dec 2012 06:37:58 +0000
From: [hidden email]
Subject: Re: R and R square over .90? for regression with imputed dataset
To: [hidden email]

Hi Rich,

Thanks,

Kylie.

[1] Moons KGM, Donders RART, Stijen T, Harrell FE. (2006) Using the outcome for imputation of missing predictor values was preferred. J of Clinical Epidemiology 59: 1092-1101.

[2] von Hippel PT. (2007) Regression with missing Ys: An improved strategy for analysing multiply imputed data. Sociological Methodology 37(1): 83-117.

It is a no-no to use your criterion variable for imputing values for
your predictors. Probably because, that could account for this sort
of result.

--
Rich Ulrich

Kylie

Dec 20, 2012; 5:04am

Re: R and R square over .90? for regression with imputed dataset

In reply to this post by tonishi@iupui.edu

Hi,
For multiple imputation you should be running your regression on the entire imputed dataset (ie, the one with the original data and the 5 imputed datasets in it) - not just selecting out the 5th imputed dataset. With your datafile split by the Imputation_ variable (I think that is what is called - can't check at the moment, sorry), SPSS will detect that it is a multiply imputed dataset and analyse the data appropriately. Specifically, it will run your regression on each of the 5 imputed datasets and present results for each one separately plus a 'pooled' set of results. It is these pooled results that you want.
Hope this helps.
Cheers,
Kylie.

________________________________________
From: SPSSX(r) Discussion [[hidden email]] on behalf of [hidden email] [[hidden email]]
Sent: Thursday, 20 December 2012 5:20 AM
To: [hidden email]
Subject: Re: R and R square over .90? for regression with imputed dataset

Hi Rich and Art,

Thanks much for your responses. I actually found that there were 2 different
datasets produced by one MI attempt. One lists all variables generated by
all imputations - variables from the original, imputations 1~5. Since I have
about 150 cases, this lists have 150 x 5 imputations + 150 original data.

The other dataset has only 150 cases.

I thought I should use the second dataset, then got over .90 R. But I went
through some archived messages in this ML and found it wasn't a correct one
to use for regression. I used the variables imputed by the 5th imputation,
and got about .5 R, which seems right.

But I would like to see more about your thoughts of missing values (or
avoiding them) for my future survey studies --

Rich, it is great to hear you could manage limiting missing values in your
datasets... My study needs to use several common scales used in the field
(management), and my target organizations are notorious for not
participating in surveys (venture capital funds, foundations, etc.). Almost
all prior studies' response rates were about 20%. I could get over 50% of
response rate, but am still suffering with many questions not answered -- I
used paper-version, as well as online-, questionnaire by following Dillman's
recommendation So I couldn't "force" them to answer all questions.

Is there any way that I can still limit missing values?

Art, I think the reasons why participants didn't answer some questions is
because they thought the answers were "no." I mean, for instance, for the
question about whether they participate in certain network organizations,
some didn't circle the number. My assumption, by looking at the
organizations' profiles, is that they are not affiliated to such network
organizations. But, there is no way to prove all participants had the same
intention.

I used the scales used commonly in my field. The reason why I needed to use
the scales, rather than items , is because SPSS didn't allow me to run MI by
individual items since they are too many (although after I changed the level
of measurement into scale.). Also, those scales have good alphas. I left
items for some scales that don't have alphas over .70. Those decisions were
based on suggestions by prior literature such as those by Rubin, Little,
Allison, Graham, and some other empirical studies using imputed values.

Should I do anything else?

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/R-and-R-square-over-90-for-regression-with-imputed-dataset-tp5717033p5717054.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Art Kendall

Jan 01, 2013; 3:40pm

Re: R and R square over .90? for regression with imputed dataset

In reply to this post by tonishi@iupui.edu

Do you use the mean.n function or the sum.n function to get your scores?

For your scales how many items are there in each.
compute ItemsScale1 = nvalid(item1, it3m3, item11...).

What are you planning to do with the items on network membership?

Art Kendall
Social Research Consultants

On 12/19/2012 1:50 PM, [hidden email] wrote:

Hi Rich and Art,

Thanks much for your responses. I actually found that there were 2 different
datasets produced by one MI attempt. One lists all variables generated by
all imputations - variables from the original, imputations 1~5. Since I have
about 150 cases, this lists have 150 x 5 imputations + 150 original data.

The other dataset has only 150 cases.

I thought I should use the second dataset, then got over .90 R. But I went
through some archived messages in this ML and found it wasn't a correct one
to use for regression. I used the variables imputed by the 5th imputation,
and got about .5 R, which seems right.

But I would like to see more about your thoughts of missing values (or
avoiding them) for my future survey studies --

Rich, it is great to hear you could manage limiting missing values in your
datasets... My study needs to use several common scales used in the field
(management), and my target organizations are notorious for not
participating in surveys (venture capital funds, foundations, etc.). Almost
all prior studies' response rates were about 20%. I could get over 50% of
response rate, but am still suffering with many questions not answered -- I
used paper-version, as well as online-, questionnaire by following Dillman's
recommendation So I couldn't "force" them to answer all questions.

Is there any way that I can still limit missing values?

Art, I think the reasons why participants didn't answer some questions is
because they thought the answers were "no." I mean, for instance, for the
question about whether they participate in certain network organizations,
some didn't circle the number. My assumption, by looking at the
organizations' profiles, is that they are not affiliated to such network
organizations. But, there is no way to prove all participants had the same
intention.

I used the scales used commonly in my field. The reason why I needed to use
the scales, rather than items , is because SPSS didn't allow me to run MI by
individual items since they are too many (although after I changed the level
of measurement into scale.). Also, those scales have good alphas. I left
items for some scales that don't have alphas over .70. Those decisions were
based on suggestions by prior literature such as those by Rubin, Little,
Allison, Graham, and some other empirical studies using imputed values.

Should I do anything else?

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/R-and-R-square-over-90-for-regression-with-imputed-dataset-tp5717033p5717054.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

... [show rest of quote]

Art Kendall
Social Research Consultants