Trying to reproduce PCA analysis of a published paper, but not getting same results

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Trying to reproduce PCA analysis of a published paper, but not getting same results

Daradai
Hello everyone,

This is my first time posting, so if I need to change anything in this post
please let me know!
I am working on a paper myself and came across this research topic called
AESPI (Aggregated Energy Security Performance Indicator). (The paper can be
found here for those interested:
https://www.sciencedirect.com/science/article/pii/S0306261912007337)

So the same author has successfully applied AESPI in the case of Thailand
(https://www.sciencedirect.com/science/article/pii/S0306261914003985) and
has included all standardized data for 45 years in all 25 indicators.

So here comes my dilemma, I have tried to reproduce their results using SPSS
and performing a PCA on the standardized data, but including all variables
leads to the "This matrix is not positive definite" error when trying to do
a KMO Test.
Additionally, the eigenvalues that are offered in the paper are different
from mine. I get only 3 components, while they in their paper get 5.

I have included the SPSS file I was using and a picture of the orginal data.
<http://spssx-discussion.1045642.n5.nabble.com/file/t341397/DataPic.png>
<http://spssx-discussion.1045642.n5.nabble.com/file/t341397/Rotated_Values.png>
Thailand_Data_Test_Comparison.sav
<http://spssx-discussion.1045642.n5.nabble.com/file/t341397/Thailand_Data_Test_Comparison.sav>  








--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Trying to reproduce PCA analysis of a published paper, but not getting same results

Bruce Weaver
Administrator
List members who do not use Nabble can find links to the uploaded files here:

http://spssx-discussion.1045642.n5.nabble.com/Trying-to-reproduce-PCA-analysis-of-a-published-paper-but-not-getting-same-results-td5735493.html

HTH.



Daradai wrote

> Hello everyone,
>
> This is my first time posting, so if I need to change anything in this
> post
> please let me know!
> I am working on a paper myself and came across this research topic called
> AESPI (Aggregated Energy Security Performance Indicator). (The paper can
> be
> found here for those interested:
> https://www.sciencedirect.com/science/article/pii/S0306261912007337)
>
> So the same author has successfully applied AESPI in the case of Thailand
> (https://www.sciencedirect.com/science/article/pii/S0306261914003985) and
> has included all standardized data for 45 years in all 25 indicators.
>
> So here comes my dilemma, I have tried to reproduce their results using
> SPSS
> and performing a PCA on the standardized data, but including all variables
> leads to the "This matrix is not positive definite" error when trying to
> do
> a KMO Test.
> Additionally, the eigenvalues that are offered in the paper are different
> from mine. I get only 3 components, while they in their paper get 5.
>
> I have included the SPSS file I was using and a picture of the orginal
> data.
> &lt;http://spssx-discussion.1045642.n5.nabble.com/file/t341397/DataPic.png&gt; 
> &lt;http://spssx-discussion.1045642.n5.nabble.com/file/t341397/Rotated_Values.png&gt; 
> Thailand_Data_Test_Comparison.sav
> &lt;http://spssx-discussion.1045642.n5.nabble.com/file/t341397/Thailand_Data_Test_Comparison.sav&gt; 
>
>
>
>
>
>
>
>
> --
> Sent from: http://spssx-discussion.1045642.n5.nabble.com/
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD





-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Trying to reproduce PCA analysis of a published paper, but not getting same results

Mike
In reply to this post by Daradai
Okay, I'll make a couple of simple points and I'm sure that if
I am terribly wrong (or even slightly), someone will correct me.
So:

(1)  You probably have negative eigenvalues.  Now, you may not
realize this because for reasons only the original programmers of
the Factor procedure decided to print only positive eigenvalues.
If you do a principal components analysis and have negative eigenvalues,
then your correlation matrix (I assume 25 x 25 matrix based on
45 units of analysis -- a prize to the first person to explain this
problem) is not positive definite or positive semidefinite.  But
maybe you don't really want to do a principal components analysis.

(2)  I thank God everyday for providing use with the UCLA IDRE
center even though it is far from perfect.  Why am I so glad?
Consider the following link the presents a Principal FACTOR
analysis done with SAS:
https://stats.idre.ucla.edu/sas/output/factor-analysis/

Now, you might be asking "Why is he showing me SAS
output when I'm doing SPSS?"  Well, the answer is that
the SAS output is better annotated than the SPSS output.
For example, consider the following quote:

"b.  Eigenvalue – This is the initial eigenvalue.  An eigenvalue
is the variance of the factor.  Because this is an unrotated
solution, the first factor will account for the most variance,
the second will account for the second highest amount of
variance, and so on.  Some of the eigenvalues are negative
because the matrix is not of full rank.  This means that there
are probably only four dimensions (corresponding to the four
factors whose eigenvalues are greater than zero).  Although
it is strange to have a negative variance, this happens because
the factor analysis is only analyzing the common variance,
which is less than the total variance.  *******If we were doing
a principal components analysis, we would have had 1’s on
the diagonal, which means that all of the variance is being
analyzed (which is another way of saying that we are assuming
that we have no measurement error), and we would not have
negative eigenvalues.  In general, it is not uncommon to have
negative eigenvalues.********"


So, make sure that you don't have any negative eigenvalues
if you are doing a principal components analysis.  Otherwise,
you ****might***** want to do a principal factor analysis instead
(which may be what your original source did but did not report
it correctly).  I note that the SPSS output for factor does not
provide this warning.

(3) The UCLA IDRE center does provide an annotated output
for a principle factor analysis which you examine here:
https://stats.idre.ucla.edu/spss/output/factor-analysis/

However, let me point out something that presented in
the front matter of this webpage.  Quoting:

" Factor analysis is a technique that requires a large sample
size.  Factor analysis is based on the correlation matrix of
the variables involved, and correlations usually need a large
sample size before they stabilize.  Tabachnick and Fidell
(2001, page 588) cite Comrey and Lee’s (1992) advise
regarding sample size: 50 cases is very poor, 100 is poor,
200 is fair, 300 is good, 500 is very good, and 1000 or more
is excellent.
  As a rule of thumb, a bare minimum of 10 observations
per variable is necessary to avoid computational difficulties."

You say that you have 45 years but the table you present
indicates that there are a few variables that do not have
values for certain years, meaning, if year is the unit of
analysis, you have less than 45 years or a little more than
1 case per variable.  What is wrong with this picture?

I will leave it to others to suggest ways of dealing with this
situation.

-Mike Palij
New York University
 


On Mon, Feb 5, 2018 at 7:21 PM, Daradai <[hidden email]> wrote:
Hello everyone,

This is my first time posting, so if I need to change anything in this post
please let me know!
I am working on a paper myself and came across this research topic called
AESPI (Aggregated Energy Security Performance Indicator). (The paper can be
found here for those interested:
https://www.sciencedirect.com/science/article/pii/S0306261912007337)

So the same author has successfully applied AESPI in the case of Thailand
(https://www.sciencedirect.com/science/article/pii/S0306261914003985) and
has included all standardized data for 45 years in all 25 indicators.

So here comes my dilemma, I have tried to reproduce their results using SPSS
and performing a PCA on the standardized data, but including all variables
leads to the "This matrix is not positive definite" error when trying to do
a KMO Test.
Additionally, the eigenvalues that are offered in the paper are different
from mine. I get only 3 components, while they in their paper get 5.

I have included the SPSS file I was using and a picture of the orginal data.
<http://spssx-discussion.1045642.n5.nabble.com/file/t341397/DataPic.png>
<http://spssx-discussion.1045642.n5.nabble.com/file/t341397/Rotated_Values.png>
Thailand_Data_Test_Comparison.sav
<http://spssx-discussion.1045642.n5.nabble.com/file/t341397/Thailand_Data_Test_Comparison.sav>








--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Trying to reproduce PCA analysis of a published paper, but not getting same results

Rich Ulrich
In reply to this post by Daradai

I looked at the tables 4 and 5 (your data, I think), and here are some notes.


The columns ("observations") run from 1986 to 2030; however, the set with "complete

data" which a factor analysis will use start with 2004.  Twenty-seven observations will be

highly unstable for components or factors of 25 variables. If you are analyzing correlations

(rather than raw values), you are at about the minimum for full rank. The KMO result says

to me that your data are not full-rank.  Somewhere, you have collinearity.


The first three variables (eco-1.1 to eco-1.3) appear to be practically identical across the

range of years. Are these versions of each other?


Most or all the variables show a strong "year" trend. Since we haven't seen Years of 2018 to

2030, I assume that these are projections.  The formulas for projecting would produce

collinearity, if the latter columns are linear combinations of the early columns.


A principal component analysis /can/ show you as many components (if full rank) as you have

variables.  Getting 3 or 5 from a set of data depends on what you specify as options.


--

Rich Ulrich



From: SPSSX(r) Discussion <[hidden email]> on behalf of Daradai <[hidden email]>
Sent: Monday, February 5, 2018 7:21:59 PM
To: [hidden email]
Subject: Trying to reproduce PCA analysis of a published paper, but not getting same results
 
Hello everyone,

This is my first time posting, so if I need to change anything in this post
please let me know!
I am working on a paper myself and came across this research topic called
AESPI (Aggregated Energy Security Performance Indicator). (The paper can be
found here for those interested:
https://www.sciencedirect.com/science/article/pii/S0306261912007337)

So the same author has successfully applied AESPI in the case of Thailand
(https://www.sciencedirect.com/science/article/pii/S0306261914003985) and
has included all standardized data for 45 years in all 25 indicators.

So here comes my dilemma, I have tried to reproduce their results using SPSS
and performing a PCA on the standardized data, but including all variables
leads to the "This matrix is not positive definite" error when trying to do
a KMO Test.
Additionally, the eigenvalues that are offered in the paper are different
from mine. I get only 3 components, while they in their paper get 5.

I have included the SPSS file I was using and a picture of the orginal data.
<http://spssx-discussion.1045642.n5.nabble.com/file/t341397/DataPic.png>
<http://spssx-discussion.1045642.n5.nabble.com/file/t341397/Rotated_Values.png>
Thailand_Data_Test_Comparison.sav
<http://spssx-discussion.1045642.n5.nabble.com/file/t341397/Thailand_Data_Test_Comparison.sav








--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Trying to reproduce PCA analysis of a published paper, but not getting same results

Kirill Orlov
In reply to this post by Daradai
Not positive definite (p.d.) means the correlation matrix has either some zero or some negative (or both) eigenvalues. Zero eigenvalues appear when there are linear dependencies among variables or when N<P (number of cases is less than number of variables). Negative eigenvalues may appear if there were missing data which you deleted in "pairwise" manner, or when the correlation matrix was not computed from data but estimated somehow or simply borrowed and entered with not enough precision.

Note please, besides, that KMO index isn't needed in PCA. It is of value in Factor analysis. PCA easily tolerates non p.d. matrix, but Factor analysis (most methods) doesn't. If you pretend to use PCA as "factor analysis" (i.e. going to interpret factors as real latents generating data) your matrix should be p.d.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Trying to reproduce PCA analysis of a published paper, but not getting same results

Daradai
Thank you, everyone, for your great help. Unfortunately, I have just started
delving into statistics and SPSS so it will take me some time to understand
all of the intricacies you have discussed here.

I want to include an answer I just received via mail, who found the solution
to my main issue, my data deviating from the source:
"
It seems that the authors of the article used the option "Replace with mean"
under factor analysis Options/Missing Values. In SPSS version 24 this seems
to produce the same summary statistics (Table 7), rotated loadings (Table
8), but slightly different KMO/Bartlett results (Table6).
"

Again, thank you everyone for your great help!




--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD