Interpretation of PCA

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Interpretation of PCA

Barry-43
Dear All,

I have started to look at PCA in SPSS and have a question regarding
iterpreting some of the output, and how this relates to the "mathematical
theory".  I have the definition of PCA (greatly simplified) for the first
component:

For an (n x p) matrix of raw (unstandardised) data X, the 1st PC can be
given by

Y = a1X1 + a2X2 + ..... + apXp

where Y is an n x 1 matrix (of new component scores?), ai is an element of
the eigenvector which corresponds to the largest eigenvalue of the
correlation matrix, and Xi is 1 x n matrix corresponding to row i of the
data matrix X.

So, then I run PCA (unrotated) in SPSS and get, amongst other things, (1)
the (loadings) Component Matrix, (2) the Component Score Coefficient
Matrix, and (3) the factor scores, which are saved in my SPSS data sheet.

So basically, my question is, where in the SPSS output, if anywhere, is my
Y matrix in the above definition, and my weighting values ai?

Am I right in saying that my definition relates to the raw (unstandardised)
data, but the SPSS output relates to standardised data?

Am I right in saying that Y, as it is defined in the definition above, is
not displayed in SPSS, however the normalised Y matrix is.  And this
normalised matrix Y is infact displayed in SPSS as fac1_1 in my SPSS data
sheet?

Am I right in saying that the (loadings) Component Matrix is a normalised
version of my Eigenvector matrix, and if I divide each loading my sqrt
(eigenvalue), I will get my eigenvalues, ai, above?  I don’t think this is
correct, because it is the Component Score Coefficient Matrix that is used
to calculate fac1_1 etc..

I am confused.

Sorry this is so long, I am just trying to straighten it out in my head.

Help appreciated.
Barry
Reply | Threaded
Open this post in threaded view
|

Re: Interpretation of PCA

Dan Zetu
Barry:

The ai's that you are referring to are the Factor Score coefficients as
displayed in the Factor Score Coefficient Matrix. You are right that these
ai's are applied to the standardized X's, so if you need to relate the
unstandardized original variables to your factor scores, you need to divide
the ai's by the standard deviation of the corresponding original X
variables.

Dan


>From: Barry <[hidden email]>
>Reply-To: Barry <[hidden email]>
>To: [hidden email]
>Subject: Interpretation of PCA
>Date: Wed, 7 Mar 2007 06:51:05 -0500
>
>Dear All,
>
>I have started to look at PCA in SPSS and have a question regarding
>iterpreting some of the output, and how this relates to the "mathematical
>theory".  I have the definition of PCA (greatly simplified) for the first
>component:
>
>For an (n x p) matrix of raw (unstandardised) data X, the 1st PC can be
>given by
>
>Y = a1X1 + a2X2 + ..... + apXp
>
>where Y is an n x 1 matrix (of new component scores?), ai is an element of
>the eigenvector which corresponds to the largest eigenvalue of the
>correlation matrix, and Xi is 1 x n matrix corresponding to row i of the
>data matrix X.
>
>So, then I run PCA (unrotated) in SPSS and get, amongst other things, (1)
>the (loadings) Component Matrix, (2) the Component Score Coefficient
>Matrix, and (3) the factor scores, which are saved in my SPSS data sheet.
>
>So basically, my question is, where in the SPSS output, if anywhere, is my
>Y matrix in the above definition, and my weighting values ai?
>
>Am I right in saying that my definition relates to the raw (unstandardised)
>data, but the SPSS output relates to standardised data?
>
>Am I right in saying that Y, as it is defined in the definition above, is
>not displayed in SPSS, however the normalised Y matrix is.  And this
>normalised matrix Y is infact displayed in SPSS as fac1_1 in my SPSS data
>sheet?
>
>Am I right in saying that the (loadings) Component Matrix is a normalised
>version of my Eigenvector matrix, and if I divide each loading my sqrt
>(eigenvalue), I will get my eigenvalues, ai, above?  I don’t think this
>is
>correct, because it is the Component Score Coefficient Matrix that is used
>to calculate fac1_1 etc..
>
>I am confused.
>
>Sorry this is so long, I am just trying to straighten it out in my head.
>
>Help appreciated.
>Barry

_________________________________________________________________
Mortgage rates as low as 4.625% - Refinance $150,000 loan for $579 a month.
Intro*Terms
https://www2.nextag.com/goto.jsp?product=100000035&url=%2fst.jsp&tm=y&search=mortgage_text_links_88_h27f6&disc=y&vers=743&s=4056&p=5117
Reply | Threaded
Open this post in threaded view
|

Re: Interpretation of PCA

Kooij, A.J. van der
In reply to this post by Barry-43
PCA works on standardized data. If you save
standardized variables, using Descriptives, in a
data file called tmp.sav, you can run the syntax below
to see how PCA works, either with standardized data
using singulare value decomposition, or with correlation
matrix using eigenvalue decomposition. The results of
both are equal up to possible difference in sign for components.
 
The loadings are in the component matrix.
The component (factor) scores (what you call Y) are not displayed but can be saved as variables.
The component scores coefficients matrix displays the regresssion
coefficients for regresssion of the standardized variables on a component score.
 
Regards,
Anita van der Kooij
Data Theory Group
Leiden University

DESCRIPTIVES VARIABLES= varlist  /SAVE.
 
MATRIX.
* PCA on standardized data (normalized on 1 in stead of N-1) (SVD) *.
get zdata /file = 'c:\path\tmp.sav'.
compute N = NROW(zdata).
compute M = NCOL(zdata).
compute zdata = zdata / SQRT(N-1).
CALL SVD (zdata, K, singval, L).
compute singval = singval(1:m,1:m).
compute eigval = singval**2.
compute load = L * singval.
compute fscores = K( : ,1:m) * SQRT(N-1).
print eigval.
print load.
print fscores.
END MATRIX.
 
MATRIX.
* PCA on correlation matrix (EVD) *.
get zdata /file = 'c:\path\tmp.sav'.
compute N = NROW(zdata).
compute M = NCOL(zdata).
compute zdata = zdata / SQRT(N-1).
compute R = T(zdata) * zdata .
CALL EIGEN (R, L, eigval).
compute eigval = MDIAG(eigval).
compute load= L * SQRT(eigval).
compute K = (zdata * L) * INV(SQRT(eigval)).
compute fscores = K( : ,1:m) * SQRT(N-1).
print eigval.
print load.
print fscores.
END MATRIX.

________________________________

From: SPSSX(r) Discussion on behalf of Barry
Sent: Wed 07/03/2007 12:51
To: [hidden email]
Subject: Interpretation of PCA



Dear All,

I have started to look at PCA in SPSS and have a question regarding
iterpreting some of the output, and how this relates to the "mathematical
theory".  I have the definition of PCA (greatly simplified) for the first
component:

For an (n x p) matrix of raw (unstandardised) data X, the 1st PC can be
given by

Y = a1X1 + a2X2 + ..... + apXp

where Y is an n x 1 matrix (of new component scores?), ai is an element of
the eigenvector which corresponds to the largest eigenvalue of the
correlation matrix, and Xi is 1 x n matrix corresponding to row i of the
data matrix X.

So, then I run PCA (unrotated) in SPSS and get, amongst other things, (1)
the (loadings) Component Matrix, (2) the Component Score Coefficient
Matrix, and (3) the factor scores, which are saved in my SPSS data sheet.

So basically, my question is, where in the SPSS output, if anywhere, is my
Y matrix in the above definition, and my weighting values ai?

Am I right in saying that my definition relates to the raw (unstandardised)
data, but the SPSS output relates to standardised data?

Am I right in saying that Y, as it is defined in the definition above, is
not displayed in SPSS, however the normalised Y matrix is.  And this
normalised matrix Y is infact displayed in SPSS as fac1_1 in my SPSS data
sheet?

Am I right in saying that the (loadings) Component Matrix is a normalised
version of my Eigenvector matrix, and if I divide each loading my sqrt
(eigenvalue), I will get my eigenvalues, ai, above?  I donâEUR(tm)t think this is
correct, because it is the Component Score Coefficient Matrix that is used
to calculate fac1_1 etc..

I am confused.

Sorry this is so long, I am just trying to straighten it out in my head.

Help appreciated.
Barry



**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Interpretation of PCA

Barry-43
In reply to this post by Barry-43
Dear Dan and Anita,

Thank you taking the time to reply.  However, having read a bit more, I
want to question this.  I've been reading Dunteman's PCA book, and still
haven't got this right in my head.  This is what I understand.  Please
correct me in places where I am wrong.

Assumption: we are using standardised data in this analysis.

On using the correlation matrix, our aim is initially to calculate the
eigenvalues (lambda{i}) and corresponding eigenvectors (a{i}) of the
correlation matrix.  The eigenvalues identify the amount of variance
accounted for by each PC.  The eigenvectors (the a{i}'s) are the weightings
of the principle components and can be used in the linear expressions of
each PC to determine the PC scores.  By the scores, I mean the values of
the new components that are going to be used to replace our original number
of correlated variables, and that can be used in any future analysis
instead of the original variables.  To obtain the PC loadings, we multiply a
{i} by the sq root of lambda{i}.

Now, in SPSS, one of the outputs is the loadings matrix (given in SPSS as
the Component matrix).  I’m assuming that this is equal to the PC loadings
I have mentioned above.  So in theory, if I divide each of these loadings
by the sq root of the corresponding eigenvalue (that is, lambda{i}), then I
get the eigenvectors (the a{i}’s that I am talking above).  But SPSS
doesn’t actually display these eigenvectors (and hence does not specify the
weightings which are used in the linear expression for each PC
expression.)  The expression I mean is one of the form (for example, for
1st component)

Y{1} = a{11}X1 + a{12}X2 + so on for whatever number of variables we have.

SPSS does however display these things called Component Score coefficients
(in the Component Score Coefficient Matrix), and it is these that are used
to calculate the component scores (according to SPSS and I think what
others have said), which can be saved into the SPSS worksheet.

However, as far as I understand (and can see), these Component Score
coefficients are not the same as a{i}, eigenvectors, or weightings, which
are used in the linear expression for each PC expression.  So the component
scores calculated in SPSS are not the same as the PC scores I am talking
about above.

It is this that is causing the confusion in my head.

Can you please advise what I do not understand as regards to what I have
said above?

Many thanks.
Barry
Reply | Threaded
Open this post in threaded view
|

Re: Interpretation of PCA

Kooij, A.J. van der
 >Please correct me in places where I am wrong.
Corrections inserted below.
 
Regards,
Anita
 
 
>On using the correlation matrix, our aim is initially to calculate the
>eigenvalues (lambda{i}) and corresponding eigenvectors (a{i}) of the
>correlation matrix.  The eigenvalues identify the amount of variance
>accounted for by each PC.  
Yes.
>The eigenvectors (the a{i}'s) are the weightings
>of the principle components and can be used in the linear expressions of
>each PC to determine the PC scores.
No, the loadings are the weights.
> By the scores, I mean the values of
>the new components that are going to be used to replace our original number
>of correlated variables, and that can be used in any future analysis
>instead of the original variables.  To obtain the PC loadings, we multiply a
>{i} by the sq root of lambda{i}.
Yes.
>Now, in SPSS, one of the outputs is the loadings matrix (given in SPSS as
>the Component matrix).  IâEUR(tm)m assuming that this is equal to the PC loadings
>I have mentioned above.  
Yes.
>So in theory, if I divide each of these loadings
>by the sq root of the corresponding eigenvalue (that is, lambda{i}), then I
>get the eigenvectors (the a{i}âEUR(tm)s that I am talking above).  
Yes.
>But SPSS doesnâEUR(tm)t actually display these eigenvectors (and hence does not specify the
>weightings which are used in the linear expression for each PC
>expression.)  
See above: the loadings are the weights to use, not the eigenvectors..

Y{1} = a{11}X1 + a{12}X2 + so on for whatever number of variables we have.

SPSS does however display these things called Component Score coefficients
(in the Component Score Coefficient Matrix), and it is these that are used
to calculate the component scores (according to SPSS and I think what
others have said), which can be saved into the SPSS worksheet.
No, again, the loadings are used.

However, as far as I understand (and can see), these Component Score
coefficients are not the same as a{i}, eigenvectors, or weightings, which
are used in the linear expression for each PC expression.  So the component
scores calculated in SPSS are not the same as the PC scores I am talking
about above.
To obtain the components scores: sum variables weighted with loadings and standardize the result:
Y{1} = a{11}X1 + a{12}X2 + ...
where a{11} is loading = eigenvectors{11} * SQRT(eigenvalue{1}).
component score {1} is ZY{1} is standardized Y{y} .

The component score coefficients are the (standardized) regression coefficients if you do regression with ZY{1} the dependent variable and the variables the independents.
 

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Interpretation of PCA (more about normalization)

Kooij, A.J. van der
You can compute Y{p} using loadings or using eigenvectors.

If you compute Y{p} using the loadings, the mean is 0 and the sum of squares is the  square of the pth eigenvalue, so ZY{p} (standardized Y{p}) = Y{p} / eigval{p}.

If you compute Y{p} using the eigenvectors, the mean is 0 and the sum of squares is the pth eigenvalue, then ZY{p} = Y{p} / sqrt(eigval{p}).

You can see this by writing the SVD of the data: X = K D A'

X is the data matrix, K the left singular vectors, A the right singular vectors

(is equal to the eigenvectors), and D the diagonal matrix of singular values.

with K' K = A' A = I. (svd of correlation matrix R: R = X' X = A D^2 A' ; D^2 is eigenvalues).

L = A D is loadings

K = component scores = X A D

So, K{p} =  ( l{11}*X1 + l{21}*X2 + ... ) / eigval{p}

                =  ( a{11}*sqrt(eigval{p}*X1 + a{21}*sqrt(eigval{p}*X2 + ... ) / eigval{p}

                =  ( a{11}*X1 + a{21}*X2 + ... ) * sqrt(eigval{p} / eigval{p}

                =  ( a{11}*X1 + a{21}*X2 + ... ) / sqrt(eigval{p}

 

This is the standard normalization in PCA (called 'variable normalization'); with this normalization the eigenvalues are in the loadings: L' L = D' A' A' D = D^2 = eigenvalues.

Other normalizations are often used for biplots (plot of variables and subjects).

For example, with 'subject normalization', the eigenvalues are in the component scores: loadings is A, component scores =  K D, or symmetrical normalization, spreading the eigenvalues equally over both the loadings and the component scores: loadings = A D^1/2, component scores = K D^1/2.

 

With the CATPCA procedure in the Categories module (Nonlinear PCA, for both categorical and numerical data, can also perform linear PCA), you can request the biplot and choose a normalization option.

 

Regards,

Anita van der Kooij

Data Theory Group

Leiden University


________________________________

From: Kooij, A.J. van der
Sent: Mon 12/03/2007 19:06
To: [hidden email]
Subject: RE: Re: Interpretation of PCA


 >Please correct me in places where I am wrong.
Corrections inserted below.
 
Regards,
Anita
 
 
>On using the correlation matrix, our aim is initially to calculate the
>eigenvalues (lambda{i}) and corresponding eigenvectors (a{i}) of the
>correlation matrix.  The eigenvalues identify the amount of variance
>accounted for by each PC.  
Yes.
>The eigenvectors (the a{i}'s) are the weightings
>of the principle components and can be used in the linear expressions of
>each PC to determine the PC scores.
No, the loadings are the weights.
> By the scores, I mean the values of
>the new components that are going to be used to replace our original number
>of correlated variables, and that can be used in any future analysis
>instead of the original variables.  To obtain the PC loadings, we multiply a
>{i} by the sq root of lambda{i}.
Yes.
>Now, in SPSS, one of the outputs is the loadings matrix (given in SPSS as
>the Component matrix).  IâEUR(tm)m assuming that this is equal to the PC loadings
>I have mentioned above.  
Yes.
>So in theory, if I divide each of these loadings
>by the sq root of the corresponding eigenvalue (that is, lambda{i}), then I
>get the eigenvectors (the a{i}âEUR(tm)s that I am talking above).  
Yes.
>But SPSS doesnâEUR(tm)t actually display these eigenvectors (and hence does not specify the
>weightings which are used in the linear expression for each PC
>expression.)  
See above: the loadings are the weights to use, not the eigenvectors..

Y{1} = a{11}X1 + a{12}X2 + so on for whatever number of variables we have.

SPSS does however display these things called Component Score coefficients
(in the Component Score Coefficient Matrix), and it is these that are used
to calculate the component scores (according to SPSS and I think what
others have said), which can be saved into the SPSS worksheet.
No, again, the loadings are used.

However, as far as I understand (and can see), these Component Score
coefficients are not the same as a{i}, eigenvectors, or weightings, which
are used in the linear expression for each PC expression.  So the component
scores calculated in SPSS are not the same as the PC scores I am talking
about above.
To obtain the components scores: sum variables weighted with loadings and standardize the result:
Y{1} = a{11}X1 + a{12}X2 + ...
where a{11} is loading = eigenvectors{11} * SQRT(eigenvalue{1}).
component score {1} is ZY{1} is standardized Y{y} .

The component score coefficients are the (standardized) regression coefficients if you do regression with ZY{1} the dependent variable and the variables the independents.
 

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************