PCA: R-Matrix Determinant =0 and "not positive Definite"

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

PCA: R-Matrix Determinant =0 and "not positive Definite"

mzalikhan
Hi Everybody

I am new to stats and doing PCA using SPSS 16.0, dealing with some meteorological variables to do synoptic met patterns. The problem is that the correlation matrix is giving 0 determinant with a warning of "not positive definite matrix".

My question is "is that going to affect the results of the PCA?" as the analysis does not quit after this warning, rather it goes on. If yes, what can be  done  in this case?

Peace.
Muhammad Zeeshan
Reply | Threaded
Open this post in threaded view
|

Re: PCA: R-Matrix Determinant =0 and "not positive Definite"

Art Kendall
Questions to get at extreme examples of one or more variables being perfect linear predictors of (an)other variable(s).

do you have more variables than cases? Either before you start or after removing cases with missing data?

Do you have a variable in the specification more than once? something like this?
 variables = a b c a d e

Do you have sets of dummy variables that represent all categories of a nominal level variable? 


Do you have variables that add up to another variable, such as daily and weekly rainfall?

Is your correlation matrix small enough to check by eye?  if so, do you have correlation near +/- 1?

Please describe your data in more detail.
What constitutes a case (row)? How many to you have before accounting for missing data? How many have complete data?
What are your variables? How many are there? Are they standardized?

If you have a large correlation matrix, the RELIABILITY procedure produces statistics on the distribution of correlation coefficients. Pretend that you variables that are items in a scale. In this context item means variable.  Look at the item-total correlations for high absolute values.  Look at the alpha-if-item-deleted to see if some item(s) is/are redundant.

Art Kendall
Social Research Consultants






On 6/22/2011 12:00 AM, mzalikhan wrote:
Hi Everybody

I am new to stats and doing PCA using SPSS 16.0, dealing with some
meteorological variables to do synoptic met patterns. The problem is that
the correlation matrix is giving 0 determinant with a warning of "not
positive definite matrix".

My question is "is that going to affect the results of the PCA?" as the
analysis does not quit after this warning, rather it goes on. If yes, what
can be  done  in this case?

Peace.
Muhammad Zeeshan

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/PCA-R-Matrix-Determinant-0-and-not-positive-Definite-tp4512844p4512844.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: PCA: R-Matrix Determinant =0 and "not positive Definite"

Rich Ulrich
In reply to this post by mzalikhan
Art covered the issue of "why".

The PCA will continue as a PCA and should give consistent
results.  If you have fewer cases than variables, you might
try, in addition, a reduced set of variables, in order to see
how much difference it makes -- It is usually desirable,
unless correlations are large (and structure is plain), to have
far more cases than variables.

--
Rich Ulrich

Reply | Threaded
Open this post in threaded view
|

Re: PCA: R-Matrix Determinant =0 and "not positive Definite"

mzalikhan
In reply to this post by Art Kendall
Thanks Art.

Yes i had some variables that were highly correlated (>0.9) with one another. Excluding those variables solves the ""not positive Definite" issue. However still the R-matrix determinant is very low (E-10). In the literature i have read that it should not be less than 0.00001.

Is it necessary that the significance level (1-tailed) should be greater than 0.001 for R matrix?

I am doing PCA followed by a 2 step clustering analysis on met data to find out synoptic met patterns. My number of variables is around 30 (which perhaps i have to reduce to around 18), number of observations is 1330 but the results show VALID N= 890.

What would be the effect if i replace the missing values by means rather than excluding cases list wise?

Any idea what should i be looking for in the results for accuracy?
I would also be thankful if you can guide me after i get PCA results and go for clustering.

Peace.
Reply | Threaded
Open this post in threaded view
|

Re: PCA: R-Matrix Determinant =0 and "not positive Definite"

Art Kendall
When you use factor analysis (PCA is one kind of factor analysis) you find  a small number dimensions that "account for"  much of the variance in the original data.  You can think of each new dimension (factor) as pulling together a number of imperfect variables into an internally consistent measure of a new construct (idea). The original variables are redundant measures of the new construct.  If you use varimax rotation, you have a new set of measure that cover pretty much the same hyperspace but that are independent of each other which is what is desirable for clustering.

The size of the determinant is not extremely important as long as it is not zero.  The small determinant means it will take more iterations to come up with a solution but that should not be a problem with today's computer.

WRT the significance level, it is not important.

Why is your data missing? Is there anything meaningful about what is missing?

Does your version of SPSS have the RMV -replace missing values- procedure?

I do not understand your question about accuracy.

Members of this list will find it difficult to contribute to this conversation without a more detailed understanding of what you are trying to do and what your data are.
What constitutes a case?  What do the variables mean?

Art Kendall
Social research Consultants

On 6/23/2011 5:46 AM, mzalikhan wrote:
Thanks Art.

Yes i had some variables that were highly correlated (>0.9) with one
another. Excluding those variables solves the ""not positive Definite"
issue. However still the R-matrix determinant is very low (E-10). In the
literature i have read that it should not be less than 0.00001.

Is it necessary that the significance level (1-tailed) should be greater
than 0.001 for R matrix?

I am doing PCA followed by a 2 step clustering analysis on met data to find
out synoptic met patterns. My number of variables is around 30 (which
perhaps i have to reduce to around 18), number of observations is 1330 but
the results show VALID N= 890.

What would be the effect if i replace the missing values by means rather
than excluding cases list wise?

Any idea what should i be looking for in the results for accuracy?
I would also be thankful if you can guide me after i get PCA results and go
for clustering.

Peace.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/PCA-R-Matrix-Determinant-0-and-not-positive-Definite-tp4512844p4516900.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: PCA: R-Matrix Determinant =0 and "not positive Definite"

mzalikhan
The data is missing because its observational data and either "some variable in some cases" or "some cases as whole" are not recorded.
I am using SPSS 16 and it gives three options in Factor analysis regarding Missing Values.
1- Exclude cases list wise (which reduces my cases from 1343-810)
2- Exclude cases pairs wise. (gives not positive matrix)
3- Replace by means. (simply replaces the missing variables by mean values and thus valid N= total N)

I am sorry for not been elaborate enough regarding data.

Its daily meteorological data, recorded at around 15 stations across South East Asia with Bangkok as center. So meteorological parameters like daily temperature, humidity, cloud cover etc (a total of 41 parameters for all 15 stations) are the variables and thus set of all the variables for each day is a case. The data i am dealing with is for summer season (March-June) (2000-2010). So its initially a 1343 (days) x 41 (parameters) matrix. The Objective of my study is to find out the meteorological patterns prevailing the region. The methodology i am going to follow is
1- to find out the minimum number of PC's representing maximum variance in the data.(of course i can exclude some of the variables out of 41 to achieve this)
2- Once I get (for example six) PCs with respective loading of different variables, (by the way on the basis of my literature review, i am expecting not more than 6 PCs) *I have to calculate the scores of 6 PCs for each day (It should be a 1343 x 6 Matrix)
3- Days in this matrix are to be grouped by following a 2-stage clustering technique.
    i- Application of an average linkage clustering method** on this 1343 x 6 matrix to determine the initial number of clusters and mean conditions with in each cluster mean component score.
    ii- Modify these initial clusters using K-means clustering technique with the initial number of clusters and their mean component scores as an initial seed value. This procedure is to classify the 1343 days in to a certain number of meteorologically homogeneous clusters.

This is the summery of my objectives and methodology. Since i don't have stat background, in fact i am using SPSS for the very first time, so i am having issues which i need to discuss with you guys.

i am rephrasing the questions that right now i have as;

1- Once i get 6 PCs and the variable loading (say 15 variables having loading greater than 0.4 are contributing in the 6 PCs Eigen value greater than 1), I get a 15 x 6 matrix. How can i calculate the scores of 6 PCs for each day(*)

2- How to Apply an average linkage clustering method on this 1343 x 6 matrix**?

Peace.
Reply | Threaded
Open this post in threaded view
|

Re: PCA: R-Matrix Determinant =0 and "not positive Definite"

David Marso
Administrator
Those are a LOT of questions and considering...
"This is the summery of my objectives and methodology. Since i don't have stat background, in fact i am using SPSS for the very first time, so i am having issues which i need to discuss with you guys. "
Not to seem rude, but...
Sounds like you are in way over your head!!
Your associated entity should perhaps consider hiring a consultant or trainer?
"How can i calculate the scores of 6 PCs for each day(*)"
See /SAVE subcommand on FACTOR.
OTOH, the analysis over ALL of these data points is surely to be a complete mess.  Maybe group the data by WEEK? then get PCA and scores within each week?  SEE SPLIT FILE?
Re Missing values?
MEAN SUB from entire data file?  Are you kidding?  Maybe from same day from closest other station?
That's about all I have to contribute at the moment.
You really have a 1343x15x41 matrix (or are these aggregated across station? -in which case ignore the MEAN SUB idea-).
RE AVG LINKAGE?  Please consult the DOCS on cluster analysis.
HTH,
Gun For Hire, David
---------
mzalikhan wrote
The data is missing because its observational data and either "some variable in some cases" or "some cases as whole" are not recorded.
I am using SPSS 16 and it gives three options in Factor analysis regarding Missing Values.
1- Exclude cases list wise (which reduces my cases from 1343-810)
2- Exclude cases pairs wise. (gives not positive matrix)
3- Replace by means. (simply replaces the missing variables by mean values and thus valid N= total N)

I am sorry for not been elaborate enough regarding data.

Its daily meteorological data, recorded at around 15 stations across South East Asia with Bangkok as center. So meteorological parameters like daily temperature, humidity, cloud cover etc (a total of 41 parameters for all 15 stations) are the variables and thus set of all the variables for each day is a case. The data i am dealing with is for summer season (March-June) (2000-2010). So its initially a 1343 (days) x 41 (parameters) matrix. The Objective of my study is to find out the meteorological patterns prevailing the region. The methodology i am going to follow is
1- to find out the minimum number of PC's representing maximum variance in the data.(of course i can exclude some of the variables out of 41 to achieve this)
2- Once I get (for example six) PCs with respective loading of different variables, (by the way on the basis of my literature review, i am expecting not more than 6 PCs) *I have to calculate the scores of 6 PCs for each day (It should be a 1343 x 6 Matrix)
3- Days in this matrix are to be grouped by following a 2-stage clustering technique.
    i- Application of an average linkage clustering method** on this 1343 x 6 matrix to determine the initial number of clusters and mean conditions with in each cluster mean component score.
    ii- Modify these initial clusters using K-means clustering technique with the initial number of clusters and their mean component scores as an initial seed value. This procedure is to classify the 1343 days in to a certain number of meteorologically homogeneous clusters.

This is the summery of my objectives and methodology. Since i don't have stat background, in fact i am using SPSS for the very first time, so i am having issues which i need to discuss with you guys.

i am rephrasing the questions that right now i have as;

1- Once i get 6 PCs and the variable loading (say 15 variables having loading greater than 0.4 are contributing in the 6 PCs Eigen value greater than 1), I get a 15 x 6 matrix. How can i calculate the scores of 6 PCs for each day(*)

2- How to Apply an average linkage clustering method on this 1343 x 6 matrix**?

Peace.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: PCA: R-Matrix Determinant =0 and "not positive Definite"

mzalikhan
Thanks David for your comments as well as sharing your views. I would be thankful to you if you kindly share the reasons which make you conclude that the analysis is going to be a mess?

Peace.
Reply | Threaded
Open this post in threaded view
|

Re: PCA: R-Matrix Determinant =0 and "not positive Definite"

David Marso
Administrator
Autocorrelation, Seasonality... to name a few reasons (Google these concepts and while you are at it ARIMA models and related things).  If you are "new to stats" you are out of your depth trying to work with this.  There are many people "trained" in statistics who would make a mess of this analysis!
Confused re your level of observation.  You say you have some 15? sites and some 41? measures on each, daily for several years.  But your N is for each day.  Are these measures averages across the 15 sites?
There are MANY complex issues in data analysis which are going to bite you on the a__ here.
mzalikhan wrote
Thanks David for your comments as well as sharing your views. I would be thankful to you if you kindly share the reasons which make you conclude that the analysis is going to be a mess?

Peace.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: PCA: R-Matrix Determinant =0 and "not positive Definite"

David Marso
Administrator
In reply to this post by mzalikhan
To clarify other things...
adjacent observations are correlated (autocorrelation).
observations across years are correlated (seasonality).
Also the various parameters are correlated (the object of study).  HOWEVER, they may not be simply correlated at the time point of interest but be offset by a factor (lag or lead) - Cross Correlation -.
If you google Multivariate Time Series Analysis you will find TONS of info and numerous PDF files (for example: faculty.washington.edu/ezivot/econ584/notes/multivariatetimeseries.pdf ).
Browsing through this I am scratching my head muttering WTF??? (and BTW I have over 20 years of experience studying statistics and would find this a b**ch to implement in SPSS).
For a seriously 'dumbed down' discussion of Time Series Analysis see here:
http://en.wikipedia.org/wiki/Time_series
(I really hate Wikipedia for anything serious about stats but it will at least make a few issues come to light ).
I am not familiar with the latest and greatest methods of dealing with this sort of data but way back when the devil was a little boy a took a sequence of graduate stat courses dealing with ARIMA models (prerequisites were 3 other stats with one focusing on Linear Models (ANOVA, Multiple Regression etc )...  Towards the end of the semester we attempted to fit multivariate models with say 3-4 variables.
say X,Y,Z?
Fit an ARIMA model to each of X,Y,Z
Verify residuals from models are White Noise... Ex, Ey, Ez
Estimate CrossCorrelation function of Ex, Ey, Ez ...
Identify Transfer functions or Reduce dimensionality (PCA etc)?
etc....
IE: at the point of having white noise residuals and identified the proper lags, transformed appropriately, then it MIGHT be appropriate to run a PCA and reduce the dimensionality of the process/vector space.
---
Your modelling problem is complicated further by having multiple locations.
Spatial Correlation...
OTOH, sounds like your measures are average across the locations.
If so then how is it then that you have missing data?  Are the parameters missing for ALL locations on a particular day?
Not sure that averaging these is a proper procedure either.
-----
In other words, this whole thing is much more complicated than dropping a bunch of variables into a dialog listbox, checking a SAVE box and clicking OK !
I'm not sure what others in your field have done previously, but if they did what you are proposing then it's really a GIGO fiasco!
Now please don't get me wrong here,  I don't consider myself an **EXPERT** in statistics (there are people on this list that can run circles around my bony a$$ ).  There have been a lot of advances since I was in school and I don't have ready access to any scholarly journals such as Psychometrica, or the British Journal of Statistics or the Journal of Time Series and Forecasting, Structural Equations Modelling .......).
OTOH:  I know enough to see a train wreck about to occur and thought it prudent to intervene!
HTH, David

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"