Principal components analysis + regression?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Principal components analysis + regression?

cynicalflyer
Here's my conundrum-

Independent Variable: I have a survey of 50 states indicating the amount of control the state board of education has in 31 areas answered on a three point scale (1 = total control; 2 = partial control; 3 = no control). I have a solid theoretical underpinning for all 31, or to be more precise, the literature review found evidence for all 31 as being important (Study X found items 1, 4 and 7; Study Y found items 2, 9, and 11, etc.)

Dependent variable: % of students graduating HS within 4 years.

1) In SPSS Analyze -> Dimension Reduction -> Factor
2) Descriptives: Initial Solution
3) Extraction: Method = Principal components; Analyze = Correlation matrix; Display = Unrotated factor solution and Scree plot; Extract: Based on Eigenvalue greater than 1; Maximum Iterations for Convergence = 25
4) Rotation: Method = Varimax; Display = Rotated Solution and Loading Plots; Maximum Iterations for Convergence = 25
5) Scores: Save as variables; Method = Regression; Display factor score coefficient matrix
6) Options: Exclude cases listwise; Suppress small coefficients [with] absolutely value below .10

The result are 9 saved columns (FAC1_1, FAC1_2, FAC1_3...FAC1_9) in the SPSS sheet.

The Total Variance Explained -> Rotation Sums of Squared Loadings indicates that the first 5 of these explain 51.51% of the variance.

Should I then go back into SPSS run a linear regression (Analyze -> Regression -> Linear) with the Dependent Variable % of students graduating HS within 4 years and the Independent Variables being FAC1_1, FAC1_2, FAC1_3, FAC1_4, and FAC1_5?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Principal components analysis + regression?

Mike
A few questions:

(1) You say you have "a survey of 50 states" -- does this mean you have
N= 50 (e.g., mean values for each state) or do you have N for each state
and Total N is sum of these?

(2) Was the principal components analysis done on 31 variables?  If so,
your answer to (1) becomes very important (it should not be N=50).

(3) You might consider recoding your variables to use a scale of
increasing
"degree of control", for example: 0=no control, 1=partial control, 2=
total control.
Consider calculating polychoric correlations among the variables
(recoded
or not but they will give opposite results) if you meet the assumptions
(e.g.,
underlying normal distribution for each variable).  Consider doing a
factor
analysis on the polychoric correlations.  If this is unfamiliar to you,
take a
look at the literature on this.  One starting point is the following
article:

Flora, D. B., LaBrish, C., & Chalmers, R. P. (2012). Old and new ideas
for
data screening and assumption testing for exploratory and confirmatory
factor analysis. Frontiers in psychology, 3.
Available at:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3290828/

NOTE:  Assuming you use something like MPLUS or other structural
equation
modeling (SEM) software and if you have a good factor model for the
polychoric
correlations, you can add your "dependent variable" into the analysis
and
regress it on derived factors/latent variables (which is what you're
sort of
trying to do with the principal components regression, sort of).

(4)  If you stick with a form of principal components regression, I
would
suggest you first take a look at the simple Pearson r between your
dependent
variable and the *all* of the component scores.  Why all instead of the
first
largest?  Because research using this technique shows that it could be
that
the largest components are not related to the dependent variable but
some
smaller ones are.  For more on this point, see:

Jolliffe, I. T. (1982). A note on the use of principal components in
regression.
Applied Statistics, 300-303.
and
Hadi, A. S., & Ling, R. F. (1998). Some cautionary notes on the use of
principal
components regression. The American Statistician, 52(1), 15-19.
NOTE: article is available at:
www.uvm.edu/~rsingle/stat380/F04/possible/Hadi+Ling-AmStat-1998_PCRegression.pdf
and
Jolliffe, I. T. (1982). A note on the use of principal components in
regression.
Applied Statistics, 300-303.
Check scholar.google.com for where you can get a copy of this article.

I'm sure others will have something to say in addition.

-Mike Palij
New York University
[hidden email]



----- Original Message -----
From: "cynicalflyer" <[hidden email]>
To: <[hidden email]>
Sent: Tuesday, August 26, 2014 11:48 AM
Subject: Principal components analysis + regression?


Here's my conundrum-

Independent Variable: I have a survey of 50 states indicating the amount
of control the state board of education has in 31 areas answered on a
three point scale (1 = total control; 2 = partial control; 3 = no
control). I have a solid theoretical underpinning for all 31, or to be
more precise, the literature review found evidence for all 31 as being
important (Study X found items 1, 4 and 7; Study Y found items 2, 9, and
11, etc.)

Dependent variable: % of students graduating HS within 4 years.

1) In SPSS Analyze -> Dimension Reduction -> Factor
2) Descriptives: Initial Solution
3) Extraction: Method = Principal components; Analyze = Correlation
matrix; Display = Unrotated factor solution and Scree plot; Extract:
Based on Eigenvalue greater than 1; Maximum Iterations for Convergence =
25
4) Rotation: Method = Varimax; Display = Rotated Solution and Loading
Plots; Maximum Iterations for Convergence = 25
5) Scores: Save as variables; Method = Regression; Display factor score
coefficient matrix
6) Options: Exclude cases listwise; Suppress small coefficients [with]
absolutely value below .10

The result are 9 saved columns (FAC1_1, FAC1_2, FAC1_3...FAC1_9) in the
SPSS sheet.

The Total Variance Explained -> Rotation Sums of Squared Loadings
indicates that the first 5 of these explain 51.51% of the variance.

Should I then go back into SPSS run a linear regression (Analyze ->
Regression -> Linear) with the Dependent Variable % of students
graduating HS within 4 years and the Independent Variables being FAC1_1,
FAC1_2, FAC1_3, FAC1_4, and FAC1_5?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Principal components analysis + regression?

cynicalflyer
In reply to this post by cynicalflyer

1) I have a survey that consisted of 31 questions, administered to each of the 50 states.

2) I'm not even sure what this is asking. My data is as follows

  State control over X State control over Y State control over Z
Alabama 1 2 3
Alaska 2 2 3
Arizona 2 2 3

In SPSS I Analyze -> Dimension Reduction -> Factor -> Variables where Variables were "State control over X", "State control over Y", "State control over Z", and so on with a total of 31 variables.

3) I'll recode it the other way.

4) I'll look at simple Pearson r between my dependent variable and the *all* of the component scores.

Reply | Threaded
Open this post in threaded view
|

Re: Principal components analysis + regression?

Rich Ulrich
I'm pretty sure that the point of Mike's (2) is that an N of 50 is far too
small for a PCA or PFA on 31 items.  The rule of thumb says, 10 or 20
times as many cases as items.  Smaller N's work when there are higher
correlations and evident structure than what you are apt to see with items
scored as 0,1,2.  - So - Your PCA should pretty thoroughly unreliable.

I might try a PFA, assuming common factors; and if there are some factors that
come out of varimax rotation which also have face validity, I would score up those
factors from the high loadings.  Every variable not in one of those factors would
be preserved to look at separately.  The whole set of analyses, with too many
variables for the N, should be regarded as largely exploratory.

ANOTHER ISSUE.
I would expect that several states would show heterogeneity in the outcome
(graduation rate) between cities, or cities vs. rural in; that situation (if it exists)
would imply that using an average figure is not a very good idea if you want to
explain those rates.  Similarly:  Is "State control over X"  homogeneous within
states, or is that also problematic?

--
Rich Ulrich


Date: Tue, 26 Aug 2014 10:41:41 -0700
From: [hidden email]
Subject: Re: Principal components analysis + regression?
To: [hidden email]

1) I have a survey that consisted of 31 questions, administered to each of the 50 states.
2) I'm not even sure what this is asking. My data is as follows
  State control over X State control over Y State control over Z
Alabama 1 2 3
Alaska 2 2 3
Arizona 2 2 3

In SPSS I Analyze -> Dimension Reduction -> Factor -> Variables where Variables were "State control over X", "State control over Y", "State control over Z", and so on with a total of 31 variables.
3) I'll recode it the other way.
4) I'll look at simple Pearson r between my dependent variable and the *all* of the component scores.


View this message in context: Re: Principal components analysis + regression?
Sent from the SPSSX Discussion mailing list archive at Nabble.com.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD