SPSSX Discussion

Trying to Do Principal Components Analysis With Lots of Pairwise Missing Data

Classic

List

Threaded

1 message

Larry Hotchkiss-2

Trying to Do Principal Components Analysis With Lots of Pairwise Missing Data

Hi,

I'm commenting on the question of doing a factor analysis with 2/3 of the cases missing on each variable (see below).

First, the fact that the correlation matrix is not positive definite is not necessarily a problem from a computational standpoint. Principal components still should execute. The SPSS error message is bogus on this point. Try another software package. I just ran a PCA using R with constructed data to assure a correlation matrix that is not positive definite and got standard output matching the structure I put into the constructed data.

Second, a correlation matrix on complete data is, however, guaranteed to be positive semi-definite. This means that the minimum eigenvalue cannot be negative. If your patched-together correlation matrix contains negative eigenvalues that can't be attributed to rounding error, then it is a problem.

Third, since each third of the sample answered a random selection of the items, approximately 11% of the sample must have data for each pair of items, implying that you have a sample estimate of each correlation. If the sample size is very large, you might be able to get usable results from the correlation matrix. Probably you need in the neighborhood of 1000 or more observations for each correlation, meaning maybe 9000 in total. The fact that your correlation matrix is not positive definite suggests that the sample size may not be large enough. Certainly, as N goes to infinity, the correlation matrix constructed from this type of data converges on the population correlation matrix, which very likely is positive definite and cannot contain negative eigenvalues.

Fourth, standard inferential statistics with data like these are not correct. You don't get inferential output from a PCA, but you do with maximum likelihood factor analysis and from most other analyses you might do. Check the documentation for the data to see what it says about statistical inference.

Larry Hotchkiss
------------------------------------------------------------------------------------

Date: Fri, 10 Jul 2009 09:41:50 -0700
From: Zachary Feinstein <[hidden email]>
Subject: Trying to Do Principal Components Analysis With Lots of Pairwise
Missing Data

--0-1462823626-1247244110=:88316
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

I have a situation where there are a total of 150 attributes.=A0 Each respo=
ndent to my survey randomly answers about 1/3 of the questions, so they eac=
h get about 50 questions.=0A=0AI wish to do a Factor Analysis/PCA on the da=
ta but clearly that much missing data is a problem.=A0 I can get a correlat=
ion matrix and try to run the PCA off of that.=A0 But the PCA will not work=
because the matrix is not "positive definite."=A0 I figured changing all o=
f the counts to something constant in my correlation "matrix" data file (th=
e one with the ROWTYPE_ and other such variables) would trick SPSS into not=
seeing all of the pairwise missing data but I still get the same error mes=
sage.=0A=0ASo yes I am trying to trick SPSS into not viewing the plethora o=
f missing data.=A0 Below are some ideas.=A0 I would love any and all feedba=
ck on my ideas as well as some other ideas:=0A=0A1.=A0=A0=A0 Mean-sub the d=
ata like crazy.=A0 This means 2/3 of the data will be based on mean-subbed =
data.=A0 I figure mean sub-by the variable and by the person average too.=
=0A2.=A0=A0=A0 Somehow add random noise to either the raw data or the corre=
lation matrix.=A0 Not entirely sure what this would accomplish besides gett=
ing rid of some linear dependencies.=0A3.=A0=A0=A0 Seek out the linear depe=
ndencies and maybe drop a few variables (or randomly adjust them).=A0 I hav=
e a MANOVA command I did this with but I think that MANOVA does not want mi=
ssing data.=A0 Correct me if I am wrong.=0A4.=A0=A0=A0 Bootstrapping.=A0 Bu=
t this will take a long time to bootstrap missing raw data.=0A5.=A0=A0=A0 H=
ot-Deck Imputation.=A0 Have heard a bit about this but do not know much abo=
ut=A0it.=0A6.=A0=A0=A0 Missing-Value module in SPSS.=0A7.=A0=A0=A0 Amelia m=
odule that I used many years ago.=A0 I did not like the missing-value imput=
ation that it did.=0A=0AYes, I recognize that we are replacing structurally=
missing data almost as if it is randomly missing.=A0 But surely there must=
be a way.=A0 I know that it is not Kosher to run PCA with so much missing =
data but I need to figure something out.=A0 I am very interested in your fe=
edback.=A0 Thank you.=0A=0AZachary=[hidden email]=0A(651) 698-2184=
=0A=0A=0A
--0-1462823626-1247244110=:88316
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

<html><head><style type=3D"text/css"></style></he=
ad><body><div style=3D"font-family:arial, helvetica, sans-serif;font-size:1=
0pt"><DIV>I have a situation where there are a total of 150 attributes.&nbs=
p; Each respondent to my survey randomly answers about 1/3 of the questions=
, so they each get about 50 questions.</DIV>=0A<DIV> </DIV>=0A<DIV>I w=
ish to do a Factor Analysis/PCA on the data but clearly that much missing d=
ata is a problem.  I can get a correlation matrix and try to run the P=
CA off of that.  But the PCA will not work because the matrix is not "=
positive definite."  I figured changing all of the counts to something=
constant in my correlation "matrix" data file (the one with the ROWTYPE_ a=
nd other such variables) would trick SPSS into not seeing all of the pairwi=
se missing data but I still get the same error message.</DIV>=0A<DIV> =
</DIV>=0A<DIV>So yes I am trying to trick SPSS into not viewing the plethor=
a of missing data.  Below are some ideas.  I would love any and a=
ll feedback on my ideas as well as some other ideas:</DIV>=0A<DIV> </D=
IV>=0A<DIV>1.    Mean-sub the data like crazy.  This me=
ans 2/3 of the data will be based on mean-subbed data.  I figure mean =
sub-by the variable and by the person average too.</DIV>=0A<DIV>2. &nb=
sp;  Somehow add random noise to either the raw data or the correlatio=
n matrix.  Not entirely sure what this would accomplish besides gettin=
g rid of some linear dependencies.</DIV>=0A<DIV>3.    Seek o=
ut the linear dependencies and maybe drop a few variables (or randomly adju=
st them).  I have a MANOVA command I did this with but I think that MA=
NOVA does not want missing data.  Correct me if I am wrong.</DIV>=0A<D=
IV>4.    Bootstrapping.  But this will take a long time=
to bootstrap missing raw data.</DIV>=0A<DIV>5.    Hot-Deck =
Imputation.  Have heard a bit about this but do not know much about&nb=
sp;it.</DIV>=0A<DIV>6.    Missing-Value module in SPSS.</DIV=
>=0A<DIV>7.    Amelia module that I used many years ago.&nbs=
p; I did not like the missing-value imputation that it did.</DIV>=0A<DIV>&n=
bsp;</DIV>=0A<DIV>Yes, I recognize that we are replacing structurally missi=
ng data almost as if it is randomly missing.  But surely there must be=
a way.  I know that it is not Kosher to run PCA with so much missing =
data but I need to figure something out.  I am very interested in your=
feedback.  Thank you.</DIV>=0A<DIV> </DIV>=0A<DIV>Zachary</DIV>=
=0A<DIV><A href=3D"mailto:[hidden email]">[hidden email]</A><=
/DIV>=0A<DIV>(651) 698-2184</DIV>=0A<DIV> </DIV>=0A<DIV> </DIV></=
div><br>=0A=0A </body></html>
--0-1462823626-1247244110=:88316--

------------------------------

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD