Trying to Do Principal Components Analysis With Lots of Pairwise Missing Data

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Trying to Do Principal Components Analysis With Lots of Pairwise Missing Data

Zachary Feinstein
I have a situation where there are a total of 150 attributes.  Each respondent to my survey randomly answers about 1/3 of the questions, so they each get about 50 questions.
 
I wish to do a Factor Analysis/PCA on the data but clearly that much missing data is a problem.  I can get a correlation matrix and try to run the PCA off of that.  But the PCA will not work because the matrix is not "positive definite."  I figured changing all of the counts to something constant in my correlation "matrix" data file (the one with the ROWTYPE_ and other such variables) would trick SPSS into not seeing all of the pairwise missing data but I still get the same error message.
 
So yes I am trying to trick SPSS into not viewing the plethora of missing data.  Below are some ideas.  I would love any and all feedback on my ideas as well as some other ideas:
 
1.    Mean-sub the data like crazy.  This means 2/3 of the data will be based on mean-subbed data.  I figure mean sub-by the variable and by the person average too.
2.    Somehow add random noise to either the raw data or the correlation matrix.  Not entirely sure what this would accomplish besides getting rid of some linear dependencies.
3.    Seek out the linear dependencies and maybe drop a few variables (or randomly adjust them).  I have a MANOVA command I did this with but I think that MANOVA does not want missing data.  Correct me if I am wrong.
4.    Bootstrapping.  But this will take a long time to bootstrap missing raw data.
5.    Hot-Deck Imputation.  Have heard a bit about this but do not know much about it.
6.    Missing-Value module in SPSS.
7.    Amelia module that I used many years ago.  I did not like the missing-value imputation that it did.
 
Yes, I recognize that we are replacing structurally missing data almost as if it is randomly missing.  But surely there must be a way.  I know that it is not Kosher to run PCA with so much missing data but I need to figure something out.  I am very interested in your feedback.  Thank you.
 
Zachary
(651) 698-2184
 
 

Reply | Threaded
Open this post in threaded view
|

Re: Trying to Do Principal Components Analysis With Lots of Pairwise Missing Data

Hector Maletta

Zachary,

The fact that your correlation matrix is not positive definite is a completely different problem than the numerous missing values your dataset contains. (Of course, if you had no missing values at all, perhaps the correlation matrix would have different values in its cells, and then it might be positive definite, but that’s just hypothetical: it may well happen that even with no missing data the matrix still fails to be positive definite. Any square symmetric nxn matrix A is positive definite if for any x it is xAx’>0, where x is a row vector of n real numbers and x’ is its (column vector) transpose. So is normally the case for correlation matrices, but it may fail in particular cases of singularity or colinearity or some other quirk. It might conceivably arise even in the absence of missing values.

 

Now, leaving this problem aside, the large number of missing data in your dataset would almost certainly preclude any useful attempt to perform PCA, unless you are ready to adopt heroic assumptions and to engage in no less audacious procedures, some of which you suggest.

 

Replacing missing data with the grand mean is not advisable. There are much better ways, as the ones included in the SPSS Missing Values module, e.g. assigning values for a missing variable based on a regression that predicts that variable based on other related variables. Your problem is that in most cases for which variable X is missing, the attempt to predict X as a function of other variables U, V, W, …., Z may fail because probably one or more of those predictors would also be missing. You may have to hunt around for the best set of non-missing predictors to predict each particular missing value, but this may lead to inconsistencies: you would use some predictors to predict the missing AGE of John, and another set of predictors for the AGE of Mary, depending on which predictors are missing for John and which for Mary (and for each of your other subjects in the sample). Your message does not tell how many cases are in your data set, but this may be a long endeavor involving thousands of individual missing cells to be predicted by different equations each. I do not know whether any Hot Deck software can automate this process, but I do have doubts about its reliability in case it exists.

 

If “Each respondent … randomly answers about 1/3 of the questions”, then probably the questions are to some extent interchangeable. John answered some questions, Mary answered others, but if the survey design let them “randomnly answer about 1/3 of the questions” it would look as if the questions answered by each person are more or less interchangeable, any with any, or some with some. Are there subsets of questions, such as a set of questions about some subject matter and another subset about another? ¿Some questions about how the product smells, some questions about its status meaning, some about its health properties, and so on? You may want to treat attributes belonging to the same “family” of attributes as equivalent or interchangeable, thus greatly simplifying your work: Mary answered four “smell” questions, no matter which specifically she chose, and so you have valid values for Mary in four smell variables, even if she only answered four of the 12 smell questions available.

 

Hope any of this rambling answer helps.

 

Hector

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Zachary Feinstein
Sent: 10 July 2009 13:42
To: [hidden email]
Subject: Trying to Do Principal Components Analysis With Lots of Pairwise Missing Data

 

I have a situation where there are a total of 150 attributes.  Each respondent to my survey randomly answers about 1/3 of the questions, so they each get about 50 questions.

 

I wish to do a Factor Analysis/PCA on the data but clearly that much missing data is a problem.  I can get a correlation matrix and try to run the PCA off of that.  But the PCA will not work because the matrix is not "positive definite."  I figured changing all of the counts to something constant in my correlation "matrix" data file (the one with the ROWTYPE_ and other such variables) would trick SPSS into not seeing all of the pairwise missing data but I still get the same error message.

 

So yes I am trying to trick SPSS into not viewing the plethora of missing data.  Below are some ideas.  I would love any and all feedback on my ideas as well as some other ideas:

 

1.    Mean-sub the data like crazy.  This means 2/3 of the data will be based on mean-subbed data.  I figure mean sub-by the variable and by the person average too.

2.    Somehow add random noise to either the raw data or the correlation matrix.  Not entirely sure what this would accomplish besides getting rid of some linear dependencies.

3.    Seek out the linear dependencies and maybe drop a few variables (or randomly adjust them).  I have a MANOVA command I did this with but I think that MANOVA does not want missing data.  Correct me if I am wrong.

4.    Bootstrapping.  But this will take a long time to bootstrap missing raw data.

5.    Hot-Deck Imputation.  Have heard a bit about this but do not know much about it.

6.    Missing-Value module in SPSS.

7.    Amelia module that I used many years ago.  I did not like the missing-value imputation that it did.

 

Yes, I recognize that we are replacing structurally missing data almost as if it is randomly missing.  But surely there must be a way.  I know that it is not Kosher to run PCA with so much missing data but I need to figure something out.  I am very interested in your feedback.  Thank you.

 

Zachary

(651) 698-2184