|
I have a situation where there are a total of 150 attributes. Each respondent to my survey randomly answers about 1/3 of the questions, so they each get about 50 questions.
I wish to do a Factor Analysis/PCA on the data but clearly that much missing data is a problem. I can get a correlation matrix and try to run the PCA off of that. But the PCA will not work because the matrix is not "positive definite." I figured changing all of the counts to something constant in my correlation "matrix" data file (the one with the ROWTYPE_ and other such variables) would trick SPSS into not seeing all of the pairwise missing data but I still get the same error message.
So yes I am trying to trick SPSS into not viewing the plethora of missing data. Below are some ideas. I would love any and all feedback on my ideas as well as some other ideas:
1. Mean-sub the data like crazy. This means 2/3 of the data will be based on mean-subbed data. I figure mean sub-by the variable and by the person average too.
2. Somehow add random noise to either the raw data or the correlation matrix. Not entirely sure what this would accomplish besides getting rid of some linear dependencies.
3. Seek out the linear dependencies and maybe drop a few variables (or randomly adjust them). I have a MANOVA command I did this with but I think that MANOVA does not want missing data. Correct me if I am wrong.
4. Bootstrapping. But this will take a long time to bootstrap missing raw data.
5. Hot-Deck Imputation. Have heard a bit about this but do not know much about it.
6. Missing-Value module in SPSS.
7. Amelia module that I used many years ago. I did not like the missing-value imputation that it did.
Yes, I recognize that we are replacing structurally missing data almost as if it is randomly missing. But surely there must be a way. I know that it is not Kosher to run PCA with so much missing data but I need to figure something out. I am very interested in your feedback. Thank you.
Zachary
(651) 698-2184
|
|
Zachary, The fact that your correlation matrix is
not positive definite is a completely different problem than the numerous
missing values your dataset contains. (Of course, if you had no missing values
at all, perhaps the correlation matrix would have different values in its
cells, and then it might be positive definite, but that’s just hypothetical:
it may well happen that even with no missing data the matrix still fails to be
positive definite. Any square symmetric nxn
matrix A is positive definite if for
any x it is xAx’>0, where x is a row vector of n real numbers and x’ is its (column vector) transpose.
So is normally the case for correlation matrices, but it may fail in particular
cases of singularity or colinearity or some other quirk. It might conceivably arise
even in the absence of missing values. Now, leaving this problem aside, the large
number of missing data in your dataset would almost certainly preclude any
useful attempt to perform PCA, unless you are ready to adopt heroic assumptions
and to engage in no less audacious procedures, some of which you suggest. Replacing missing data with the grand mean
is not advisable. There are much better ways, as the ones included in the SPSS
Missing Values module, e.g. assigning values for a missing variable based on a
regression that predicts that variable based on other related variables. Your
problem is that in most cases for which variable X is missing, the attempt to
predict X as a function of other variables U, V, W, …., Z may fail
because probably one or more of those predictors would also be missing. You may
have to hunt around for the best set of non-missing predictors to predict each
particular missing value, but this may lead to inconsistencies: you would use
some predictors to predict the missing AGE of John, and another set of predictors
for the AGE of Mary, depending on which predictors are missing for John and
which for Mary (and for each of your other subjects in the sample). Your
message does not tell how many cases are in your data set, but this may be a
long endeavor involving thousands of individual missing cells to be predicted
by different equations each. I do not know whether any Hot Deck software can
automate this process, but I do have doubts about its reliability in case it
exists. If “Each respondent …
randomly answers about 1/3 of the questions”, then probably the questions
are to some extent interchangeable. John answered some questions, Mary answered
others, but if the survey design let them “randomnly
answer about 1/3 of the questions” it would look as if the questions
answered by each person are more or less interchangeable, any with any, or some
with some. Are there subsets of questions, such as a set of questions about
some subject matter and another subset about another? ¿Some questions
about how the product smells, some questions about its status meaning, some
about its health properties, and so on? You may want to treat attributes belonging
to the same “family” of attributes as equivalent or interchangeable,
thus greatly simplifying your work: Mary answered four “smell”
questions, no matter which specifically she chose, and so you have valid values
for Mary in four smell variables, even if she only answered four of the 12 smell
questions available. Hope any of this rambling answer helps. Hector From: SPSSX(r)
Discussion I have a situation where there are a total of 150
attributes. Each respondent to my survey randomly answers about 1/3 of
the questions, so they each get about 50 questions. I wish to do a Factor Analysis/PCA on the data but clearly
that much missing data is a problem. I can get a correlation matrix and
try to run the PCA off of that. But the PCA will not work because the
matrix is not "positive definite." I figured changing all of
the counts to something constant in my correlation "matrix" data file
(the one with the ROWTYPE_ and other such variables) would trick SPSS into not
seeing all of the pairwise missing data but I still get the same error message. So yes I am trying to trick SPSS into not viewing the
plethora of missing data. Below are some ideas. I would love any
and all feedback on my ideas as well as some other ideas: 1. Mean-sub the data like crazy.
This means 2/3 of the data will be based on mean-subbed data. I figure
mean sub-by the variable and by the person average too. 2. Somehow add random noise to either the
raw data or the correlation matrix. Not entirely sure what this would
accomplish besides getting rid of some linear dependencies. 3. Seek out the linear dependencies and
maybe drop a few variables (or randomly adjust them). I have a MANOVA
command I did this with but I think that MANOVA does not want missing
data. Correct me if I am wrong. 4. Bootstrapping. But this will take
a long time to bootstrap missing raw data. 5. Hot-Deck Imputation. Have heard a
bit about this but do not know much about it. 6. Missing-Value module in SPSS. 7. Amelia module that I used many years
ago. I did not like the missing-value imputation that it did. Yes, I recognize that we are replacing structurally missing
data almost as if it is randomly missing. But surely there must be a
way. I know that it is not Kosher to run PCA with so much missing data
but I need to figure something out. I am very interested in your
feedback. Thank you. Zachary (651) 698-2184 |
| Free forum by Nabble | Edit this page |
