Here's my conundrum-
Independent Variable: I have a survey of 50 states indicating the amount of control the state board of education has in 31 areas answered on a three point scale (1 = total control; 2 = partial control; 3 = no control). I have a solid theoretical underpinning for all 31, or to be more precise, the literature review found evidence for all 31 as being important (Study X found items 1, 4 and 7; Study Y found items 2, 9, and 11, etc.) Dependent variable: % of students graduating HS within 4 years. 1) In SPSS Analyze -> Dimension Reduction -> Factor 2) Descriptives: Initial Solution 3) Extraction: Method = Principal components; Analyze = Correlation matrix; Display = Unrotated factor solution and Scree plot; Extract: Based on Eigenvalue greater than 1; Maximum Iterations for Convergence = 25 4) Rotation: Method = Varimax; Display = Rotated Solution and Loading Plots; Maximum Iterations for Convergence = 25 5) Scores: Save as variables; Method = Regression; Display factor score coefficient matrix 6) Options: Exclude cases listwise; Suppress small coefficients [with] absolutely value below .10 The result are 9 saved columns (FAC1_1, FAC1_2, FAC1_3...FAC1_9) in the SPSS sheet. The Total Variance Explained -> Rotation Sums of Squared Loadings indicates that the first 5 of these explain 51.51% of the variance. Should I then go back into SPSS run a linear regression (Analyze -> Regression -> Linear) with the Dependent Variable % of students graduating HS within 4 years and the Independent Variables being FAC1_1, FAC1_2, FAC1_3, FAC1_4, and FAC1_5? ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
A few questions:
(1) You say you have "a survey of 50 states" -- does this mean you have N= 50 (e.g., mean values for each state) or do you have N for each state and Total N is sum of these? (2) Was the principal components analysis done on 31 variables? If so, your answer to (1) becomes very important (it should not be N=50). (3) You might consider recoding your variables to use a scale of increasing "degree of control", for example: 0=no control, 1=partial control, 2= total control. Consider calculating polychoric correlations among the variables (recoded or not but they will give opposite results) if you meet the assumptions (e.g., underlying normal distribution for each variable). Consider doing a factor analysis on the polychoric correlations. If this is unfamiliar to you, take a look at the literature on this. One starting point is the following article: Flora, D. B., LaBrish, C., & Chalmers, R. P. (2012). Old and new ideas for data screening and assumption testing for exploratory and confirmatory factor analysis. Frontiers in psychology, 3. Available at: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3290828/ NOTE: Assuming you use something like MPLUS or other structural equation modeling (SEM) software and if you have a good factor model for the polychoric correlations, you can add your "dependent variable" into the analysis and regress it on derived factors/latent variables (which is what you're sort of trying to do with the principal components regression, sort of). (4) If you stick with a form of principal components regression, I would suggest you first take a look at the simple Pearson r between your dependent variable and the *all* of the component scores. Why all instead of the first largest? Because research using this technique shows that it could be that the largest components are not related to the dependent variable but some smaller ones are. For more on this point, see: Jolliffe, I. T. (1982). A note on the use of principal components in regression. Applied Statistics, 300-303. and Hadi, A. S., & Ling, R. F. (1998). Some cautionary notes on the use of principal components regression. The American Statistician, 52(1), 15-19. NOTE: article is available at: www.uvm.edu/~rsingle/stat380/F04/possible/Hadi+Ling-AmStat-1998_PCRegression.pdf and Jolliffe, I. T. (1982). A note on the use of principal components in regression. Applied Statistics, 300-303. Check scholar.google.com for where you can get a copy of this article. I'm sure others will have something to say in addition. -Mike Palij New York University [hidden email] ----- Original Message ----- From: "cynicalflyer" <[hidden email]> To: <[hidden email]> Sent: Tuesday, August 26, 2014 11:48 AM Subject: Principal components analysis + regression? Here's my conundrum- Independent Variable: I have a survey of 50 states indicating the amount of control the state board of education has in 31 areas answered on a three point scale (1 = total control; 2 = partial control; 3 = no control). I have a solid theoretical underpinning for all 31, or to be more precise, the literature review found evidence for all 31 as being important (Study X found items 1, 4 and 7; Study Y found items 2, 9, and 11, etc.) Dependent variable: % of students graduating HS within 4 years. 1) In SPSS Analyze -> Dimension Reduction -> Factor 2) Descriptives: Initial Solution 3) Extraction: Method = Principal components; Analyze = Correlation matrix; Display = Unrotated factor solution and Scree plot; Extract: Based on Eigenvalue greater than 1; Maximum Iterations for Convergence = 25 4) Rotation: Method = Varimax; Display = Rotated Solution and Loading Plots; Maximum Iterations for Convergence = 25 5) Scores: Save as variables; Method = Regression; Display factor score coefficient matrix 6) Options: Exclude cases listwise; Suppress small coefficients [with] absolutely value below .10 The result are 9 saved columns (FAC1_1, FAC1_2, FAC1_3...FAC1_9) in the SPSS sheet. The Total Variance Explained -> Rotation Sums of Squared Loadings indicates that the first 5 of these explain 51.51% of the variance. Should I then go back into SPSS run a linear regression (Analyze -> Regression -> Linear) with the Dependent Variable % of students graduating HS within 4 years and the Independent Variables being FAC1_1, FAC1_2, FAC1_3, FAC1_4, and FAC1_5? ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by cynicalflyer
1) I have a survey that consisted of 31 questions, administered to each of the 50 states. 2) I'm not even sure what this is asking. My data is as follows
In SPSS I Analyze -> Dimension Reduction -> Factor -> Variables where Variables were "State control over X", "State control over Y", "State control over Z", and so on with a total of 31 variables. 3) I'll recode it the other way. 4) I'll look at simple Pearson r between my dependent variable and the *all* of the component scores. |
I'm pretty sure that the point of Mike's (2) is that an N of 50 is far too
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
small for a PCA or PFA on 31 items. The rule of thumb says, 10 or 20 times as many cases as items. Smaller N's work when there are higher correlations and evident structure than what you are apt to see with items scored as 0,1,2. - So - Your PCA should pretty thoroughly unreliable. I might try a PFA, assuming common factors; and if there are some factors that come out of varimax rotation which also have face validity, I would score up those factors from the high loadings. Every variable not in one of those factors would be preserved to look at separately. The whole set of analyses, with too many variables for the N, should be regarded as largely exploratory. ANOTHER ISSUE. I would expect that several states would show heterogeneity in the outcome (graduation rate) between cities, or cities vs. rural in; that situation (if it exists) would imply that using an average figure is not a very good idea if you want to explain those rates. Similarly: Is "State control over X" homogeneous within states, or is that also problematic? -- Rich Ulrich Date: Tue, 26 Aug 2014 10:41:41 -0700 From: [hidden email] Subject: Re: Principal components analysis + regression? To: [hidden email] 1) I have a survey that consisted of 31 questions, administered to each of the 50 states. 2) I'm not even sure what this is asking. My data is as follows
In SPSS I Analyze -> Dimension Reduction -> Factor -> Variables where Variables were "State control over X", "State control over Y", "State control over Z", and so on with a total of 31 variables. 3) I'll recode it the other way. 4) I'll look at simple Pearson r between my dependent variable and the *all* of the component scores. View this message in context: Re: Principal components analysis + regression? Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |