Posted by
Mike on
Aug 26, 2014; 5:09pm
URL: http://spssx-discussion.165.s1.nabble.com/Principal-components-analysis-regression-tp5727080p5727081.html
A few questions:
(1) You say you have "a survey of 50 states" -- does this mean you have
N= 50 (e.g., mean values for each state) or do you have N for each state
and Total N is sum of these?
(2) Was the principal components analysis done on 31 variables? If so,
your answer to (1) becomes very important (it should not be N=50).
(3) You might consider recoding your variables to use a scale of
increasing
"degree of control", for example: 0=no control, 1=partial control, 2=
total control.
Consider calculating polychoric correlations among the variables
(recoded
or not but they will give opposite results) if you meet the assumptions
(e.g.,
underlying normal distribution for each variable). Consider doing a
factor
analysis on the polychoric correlations. If this is unfamiliar to you,
take a
look at the literature on this. One starting point is the following
article:
Flora, D. B., LaBrish, C., & Chalmers, R. P. (2012). Old and new ideas
for
data screening and assumption testing for exploratory and confirmatory
factor analysis. Frontiers in psychology, 3.
Available at:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3290828/NOTE: Assuming you use something like MPLUS or other structural
equation
modeling (SEM) software and if you have a good factor model for the
polychoric
correlations, you can add your "dependent variable" into the analysis
and
regress it on derived factors/latent variables (which is what you're
sort of
trying to do with the principal components regression, sort of).
(4) If you stick with a form of principal components regression, I
would
suggest you first take a look at the simple Pearson r between your
dependent
variable and the *all* of the component scores. Why all instead of the
first
largest? Because research using this technique shows that it could be
that
the largest components are not related to the dependent variable but
some
smaller ones are. For more on this point, see:
Jolliffe, I. T. (1982). A note on the use of principal components in
regression.
Applied Statistics, 300-303.
and
Hadi, A. S., & Ling, R. F. (1998). Some cautionary notes on the use of
principal
components regression. The American Statistician, 52(1), 15-19.
NOTE: article is available at:
www.uvm.edu/~rsingle/stat380/F04/possible/Hadi+Ling-AmStat-1998_PCRegression.pdf
and
Jolliffe, I. T. (1982). A note on the use of principal components in
regression.
Applied Statistics, 300-303.
Check scholar.google.com for where you can get a copy of this article.
I'm sure others will have something to say in addition.
-Mike Palij
New York University
[hidden email]
----- Original Message -----
From: "cynicalflyer" <
[hidden email]>
To: <
[hidden email]>
Sent: Tuesday, August 26, 2014 11:48 AM
Subject: Principal components analysis + regression?
Here's my conundrum-
Independent Variable: I have a survey of 50 states indicating the amount
of control the state board of education has in 31 areas answered on a
three point scale (1 = total control; 2 = partial control; 3 = no
control). I have a solid theoretical underpinning for all 31, or to be
more precise, the literature review found evidence for all 31 as being
important (Study X found items 1, 4 and 7; Study Y found items 2, 9, and
11, etc.)
Dependent variable: % of students graduating HS within 4 years.
1) In SPSS Analyze -> Dimension Reduction -> Factor
2) Descriptives: Initial Solution
3) Extraction: Method = Principal components; Analyze = Correlation
matrix; Display = Unrotated factor solution and Scree plot; Extract:
Based on Eigenvalue greater than 1; Maximum Iterations for Convergence =
25
4) Rotation: Method = Varimax; Display = Rotated Solution and Loading
Plots; Maximum Iterations for Convergence = 25
5) Scores: Save as variables; Method = Regression; Display factor score
coefficient matrix
6) Options: Exclude cases listwise; Suppress small coefficients [with]
absolutely value below .10
The result are 9 saved columns (FAC1_1, FAC1_2, FAC1_3...FAC1_9) in the
SPSS sheet.
The Total Variance Explained -> Rotation Sums of Squared Loadings
indicates that the first 5 of these explain 51.51% of the variance.
Should I then go back into SPSS run a linear regression (Analyze ->
Regression -> Linear) with the Dependent Variable % of students
graduating HS within 4 years and the Independent Variables being FAC1_1,
FAC1_2, FAC1_3, FAC1_4, and FAC1_5?
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD