Posted by
Hector Maletta on
Aug 18, 2006; 2:12pm
URL: http://spssx-discussion.165.s1.nabble.com/Reference-category-for-dummies-in-factor-analysis-tp1070339p1070344.html
Art,
Thanks for your interesting response. We used PCA, with 1.00 commonality,
i.e. extracting 100% variance. By "classical" I meant parametric factor
analysis and not any form of optimal scaling or alternating least squares
forms of data reduction. I am now considering these other alternatives under
advice of Anita van der Kooij.
I share your uncertainty about why results change depending on which
category is omitted, and that was the original question starting this
thread. Since nobody else seems to have an answer I will offer one purely
numerical hypothesis. The exercise with the varying results, it turns out,
was done by my colleague not with the entire census but with a SAMPLE of the
Peru census (about 200,000 households, still a lot but perhaps not so much
for so many variables and factors), and the contributions of latter factors
were pretty small. SPSS provides, as is well known, a precision no higher
than 15 decimal places approx. So it is just possible that some matrix
figures for some of the minor factors differed only on the 15th or 16th
decimal place (or further down), and then were taken as equal, and this may
have caused some matrix to be singular or (most probably) near singular, and
the results to be still computable but unstable. Moreover, some of the
categories in census questions used as reference or omitted categories were
populated by very few cases, which may have compounded the problem. Since
running this on the entire census (which would enhance statistical
significance and stability of results) takes a lot of computer time and has
to be done several times with different reference categories, we have not
done it yet but will proceed soon and report back. But I wanted to know
whether some mathematical reason existed for the discrepancy.
About why I would want to create a single score out of multiple factors, let
us leave it for another occasion since it is a rather complicated story of a
project connecting factor analysis with index number theory and economic
welfare theory.
Hector
_____
De: Art Kendall [mailto:
[hidden email]]
Enviado el: Friday, August 18, 2006 9:34 AM
Para: Kooij, A.J. van der;
[hidden email]
CC:
[hidden email]
Asunto: Re: Reference category for dummies in factor analysis
This has been an interesting discussion. I don't know why the FA and scores
would change depending on which category is omitted. Were there errors in
recoding to dummies that could have created different missing values?
You also said classical FA, but then said PCA. What did you use for
communality estimates.? 1.00? Squared multiple correlations?
(I'm not sure why you would create a single score if you have multiple
factors either, but that is another question.)
What I do know is that people who know a lot more about CA, MDS, and factor
analysis than I do ( Like Joe Kruskal, Doug Carroll, Willem Heisser, Phipps
Arabie, Shizuhiko Nishimoto, et al) follow the class-l and mpsych-l
discussion lists.
see
http://aris.ss.uci.edu/smp/mpsych.htmland
http://www.classification-society.org/csna/lists.html#class-lArt Kendall
[hidden email]
Kooij, A.J. van der wrote:
... trouble because any category of each original census question would be
an exact linear
function of the remaining categories of the question.
Yes, but this gives trouble in regression, not in PCA, as far as I know.
In the indicator matrix, one category will have zeroes on all indicator
variables.
No, and, sorry, I was confused with CA on indicator matrix, but this is
"sort of" PCA. See syntax below (object scores=component scores are equal
to row scores CA, category quantifications equal to column scores CA).
Regards,
Anita.
data list free/v1 v2 v3.
begin data.
1 2 3
2 1 3
2 2 2
3 1 1
2 3 4
2 2 2
1 2 4
end data.
Multiple Correspondence v1 v2 v3
/analysis v1 v2 v3
/dim=2
/critit .0000001
/print discrim quant obj
/plot none.
catpca v1 v2 v3
/analysis v1 v2 v3 (mnom)
/dim=2
/critit .0000001
/print quant obj
/plot none.
data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1 v3cat2
v3cat3 v3cat4 .
begin data.
1 0 0 0 1 0 0 0 1 0
0 1 0 1 0 0 0 0 1 0
0 1 0 0 1 0 0 1 0 0
0 0 1 1 0 0 1 0 0 0
0 1 0 0 0 1 0 0 0 1
0 1 0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 0 0 1
end data.
CORRESPONDENCE
TABLE = all (7,10)
/DIMENSIONS = 2
/NORMALIZATION = cprin
/PRINT = RPOINTS CPOINTS
/PLOT = none .
________________________________
From: SPSSX(r) Discussion on behalf of Hector Maletta
Sent: Thu 17/08/2006 19:56
To:
[hidden email]
Subject: Re: Reference category for dummies in factor analysis
Thank you, Anita. I will certainly look into your suggestion about CATCPA.
However, I suspect some mathematical properties of the scores generated by
CATPCA are not the ones I hope to have in our scale, because of the
non-parametric nature of the procedure (too long to explain here, and not
sure of understanding it myself).
As for your second idea, I think if you try to apply PCA on dummies not
omitting any category you'd run into trouble because any category of each
original census question would be an exact linear function of the remaining
categories of the question. In the indicator matrix, one category will have
zeroes on all indicator variables, and that one is the "omitted" category.
Hector
-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:
[hidden email]] En nombre de
Kooij, A.J. van der
Enviado el: Thursday, August 17, 2006 2:37 PM
Para:
[hidden email]
Asunto: Re: Reference category for dummies in factor analysis
CATPCA (in Data Reduction menu, under Optimal Scaling) is PCA for
(ordered//ordinal and unorderd/nominal) categorical variables; no need to
use dummies then.
Using PCA on dummies I think you should not omit dummies (for nominal
variables you can do PCA on an indicator maxtrix (that has columns that can
be regarded as dummy variables; a column for each category, thus without
omitting one)).
Regards,
Anita van der Kooij
Data Theory Group
Leiden University.
________________________________
From: SPSSX(r) Discussion on behalf of Hector Maletta
Sent: Thu 17/08/2006 17:52
To:
[hidden email]
Subject: Reference category for dummies in factor analysis
Dear colleagues,
I am re-posting (slightly re-phrased for added clarity) a question I sent
the list about a week ago without eliciting any response as yet. I hope some
factor analysis experts may be able to help.
In a research project on which we work together, a colleague of mine
constructed a scale based on factor scores obtained through classical factor
analysis (principal components) of a number of categorical census variables
all transformed into dummies. The variables concerned the standard of living
of households and included quality of dwelling and basic services such as
sanitation, water supply, electricity and the like. (The scale was not
simply the score for the first factor, but the average score of several
factors, weighted by their respective contribution to explaining the overall
variance of observed variables, but this is, I surmise, beside the point.)
Now, he found out that the choice of reference or "omitted" category for
defining the dummies has an influence on results. He first ran the analysis
using the first category of all categorical variables as the reference
category, and then repeated the analysis using the last category as the
reference or omitted category, whatever they might be. He found that the
resulting scale varied not only in absolute value but also in the shape of
its distribution.
I can understand that the absolute value of the factor scores may change and
even the ranking of the categories of the various variables (in terms of
their average scores) may also be different, since after all the list of
dummies used has varied and the categories are tallied each time against a
different reference category. But the shape of the scale distribution should
not change, I guess, especially not in a drastic manner. In this case the
shape of the scale frequency distribution did change. Both distributions
were roughly normal, with a kind of "hump" on one side, one of them on the
left and the other on the right, probably due to the change in reference
categories, but also with changes in the range of the scale and other
details.
Also, he found that the two scales had not a perfect correlation, and
moreover, that their correlation was negative. That the correlation was
negative may be understandable: the first category in such census variables
is usually a "good" one (for instance, a home with walls made of brick or
concrete) and the last one is frequently a "bad" one (earthen floor) or a
residual heterogeneous one including bad options ("other" kinds of roof).
But since the two scales are just different combinations of the same
categorical variables based on the same statistical treatment of their given
covariance matrix, one should expect a closer, indeed a perfect correlation,
even if a negative one is possible for the reasons stated above. Changing
the reference category should be like changing the unit of measurement or
the position of the zero point (like passing from Celsius to Fahrenheit), a
decision not affecting the correlation coefficient with other variables. In
this case, instead, the two scales had r = -0.54, implying they shared only
29% of their variance, even in the extreme case when ALL the possible
factors (as many as variables) were extracted and all their scores averaged
into the scale, and therefore the entire variance, common or specific, of
the whole set of variables was taken into account).
I should add that the dataset was a large sample of census data, and all the
results were statistically significant.
Any ideas why choosing different reference categories for dummy conversion
could have such impact on results? I would greatly appreciate your thoughts
in this regard.
Hector
**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************