SPSSX Discussion - Re: Reference category for dummies in factor analysis

Re: Reference category for dummies in factor analysis

Posted by Kooij, A.J. van der on Aug 18, 2006; 6:08pm
URL: http://spssx-discussion.165.s1.nabble.com/Reference-category-for-dummies-in-factor-analysis-tp1070339p1070351.html

>can you say that the solution achieved by CATPCA on a set of
categorical (nominal)
> variables is the same achieved by CPA by converting all categories
(minus 1) of each
> categorical variable into dummies?

No. The CATPCA solution with multiple nominal scaling level (is
(Multiple) Correspondence; is equal to CATPCA with nominal scaling level
if only 1 dimension) is the same solution as the solution obtained from
analyzing the indicator matrix (see previous mail). This solution is not
equal to PCA solution for dummies, with or without omitting categories.
Maybe it is possible to compute the CATPCA results from the dummy
solution and v.v., (as is possible for CATREG nominal and linear
regression on dummies) but I don't know how, and actually I don't think
so.

Regards,
Anita

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Hector Maletta
Sent: 18 August 2006 18:16
To: [hidden email]
Subject: Re: Reference category for dummies in factor analysis

Anita,
You are saying that the solution attained with CATPCA by assigning
numeric values to all categories is the same solution attained by CPA
run on the interval variables resulting from the transformation of
categories by CATPCA. This I do not dispute. Let me ask you this: can
you say that the solution achieved by CATPCA on a set of categorical
(nominal) variables is the same achieved by CPA by converting all
categories (minus 1) of each categorical variable into dummies? If so,
it would transpire that the choice of the omitted category should not
matter: all solutions in PCA, with different omitted categories, should
all coincide with CATPCA. Is it so? Hector

-----Mensaje original-----
De: Kooij, A.J. van der [mailto:[hidden email]] Enviado el:
Friday, August 18, 2006 1:02 PM
Para: Hector Maletta
Asunto: RE: Re: Reference category for dummies in factor analysis

>... we are now analyzing whether by using categorical factor analysis
we may
> lose some useful mathematical properties of PCA

I don't think you will lose anything (but if you do let me know) because
the CATPCA model is the linear PCA model for scaling levels other than
mnom (and also with mnom for 1-dimensional solution). To check this you
can save the transformed data and us these as input to FACTOR PCA:
results are equal to CATPCA results (sometimes slighly different with
default iteration options, then adjust convergence criterion and/or
maximum number of iterations). With mnom scaling level and multi-dim.
solution you have transformed data for each dimension, then FACTOR PCA
on the transformed variables for a dimension is equal to CATPCA results
for that dimension.

Another suggestion regarding regions or something like that:
To explore if regions differ, you can include a region variable as
supplementary. This variable is not used in the analysis, but is fitted
into the solution. If it does not fit well, regions do not differ with
respect to the solution; if it does fit well you can inspect the
categories plot to see how regions relate to the analysis variables.

Regards,
Anita

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Hector Maletta
Sent: 18 August 2006 17:19
To: [hidden email]
Subject: Re: Reference category for dummies in factor analysis

Art,
Interesting thoughts, as usual from you. I didn't know about the higher
precision of exponents. About non parametric factor analysis based on
alternating least squares I have received excellent advice from one of
the major figures in the field, Anita van der Kooij (one of the
developers of the Categories SPSS module where these procedures are
included). Of course in those approaches you do not drop any category,
all categories are quantified and you get a unique set of factor scores,
so my problem disappears. Since our project involves using the scores in
a model involving index number theory and neoclassical economics, we are
now analyzing whether by using categorical factor analysis we may lose
some useful mathematical properties of PCA. Regarding your idea of
differentiating by region or sector: Of course, this kind of scale is
computed in order to be used in analytical and practical applications
involving geographical breakdown (e.g. to improve or refine targeting of
social programs) and other analytical subdivisions (such as sector of
employment). For some applications it makes sense to compute a
region-based scale and for other purposes a nationwide one. In our case
we are working in the context of nationally adopted goals of equitable
development within so-called United Nations Millennium Development
Goals, and therefore the standards for of the standard of living should
be set at national level, because governments establish such goals for
all the people of their nations. However, in equations where the scale
is a predictor, regions are certainly another likely predictor along
with other variables of interest (employment and education level of
household adults, say), to predict outcomes such as children dropping
out of school or child mortality which figure outstandingly in
Millennium Goals. Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Art Kendall Enviado el: Friday, August 18, 2006 11:53 AM
Para: [hidden email]
Asunto: Re: Reference category for dummies in factor analysis

If you think that you might have been approaching a zero determinant,
you might try a pfa and see what the communalities look like. Obviously
the determinant would be zero if all of the categories for a single
variable had dummies. Also, the 16 or so decimal digits is only in the
mantissa, the exponent can go to something like 1022 or 1023 depending
on the sign.

It really would be interesting to hear what some of the inventors of
extensions to FA, like CA,INDSCAL, ALSCAL, PROXSCAL have to say about
your problem. I would urge you to post a description of what you are
trying to do and the problem you ran into on the lists I mentioned..

Just to stir the pot. Is it possible that the relation of the variables
would vary by market areas e.g., manufacturing zones, fishing coastal,
vs field agricultural vs herding etc.? So that something like an INDSCAL
with a correlation or other variable similarity measure per "region"
would be informative and fit the data even better?

Art
Social Research Consultants
[hidden email]

Hector Maletta wrote:

> Art,
>
> Thanks for your interesting response. We used PCA, with 1.00
> commonality, i.e. extracting 100% variance. By "classical" I meant
> parametric factor analysis and not any form of optimal scaling or
> alternating least squares forms of data reduction. I am now
> considering these other alternatives under advice of Anita van der
> Kooij.
>
>
>
> I share your uncertainty about why results change depending on which
> category is omitted, and that was the original question starting this
> thread. Since nobody else seems to have an answer I will offer one
> purely numerical hypothesis. The exercise with the varying results, it

> turns out, was done by my colleague not with the entire census but
> with a SAMPLE of the Peru census (about 200,000 households, still a
> lot but perhaps not so much for so many variables and factors), and
> the contributions of latter factors were pretty small. SPSS provides,
> as is well known, a precision no higher than 15 decimal places approx.

> So it is just possible that some matrix figures for some of the minor
> factors differed only on the 15th or 16th decimal place (or further
> down), and then were taken as equal, and this may have caused some
> matrix to be singular or (most probably) near singular, and the
> results to be still computable but unstable. Moreover, some of the
> categories in census questions used as reference or omitted categories

> were populated by very few cases, which may have compounded the
> problem. Since running this on the entire census (which would enhance
> statistical significance and stability of results) takes a lot of
> computer time and has to be done several times with different
> reference categories, we have not done it yet but will proceed soon
> and report back. But I wanted to know whether some mathematical reason

> existed for the discrepancy.
>
>
>
> About why I would want to create a single score out of multiple
> factors, let us leave it for another occasion since it is a rather
> complicated story of a project connecting factor analysis with index
> number theory and economic welfare theory.
>
>
>
> Hector
>
>
>
> ----------------------------------------------------------------------
> --
>
> De: Art Kendall [mailto:[hidden email]]
> Enviado el: Friday, August 18, 2006 9:34 AM
> Para: Kooij, A.J. van der; [hidden email]
> CC: [hidden email]
> Asunto: Re: Reference category for dummies in factor analysis
>
>
>
> This has been an interesting discussion. I don't know why the FA and
> scores would change depending on which category is omitted. Were
> there errors in recoding to dummies that could have created different
> missing values?
>
>
> You also said classical FA, but then said PCA. What did you use for
> communality estimates.? 1.00? Squared multiple correlations?
>
> (I'm not sure why you would create a single score if you have multiple

> factors either, but that is another question.)
>
> What I do know is that people who know a lot more about CA, MDS, and
> factor analysis than I do ( Like Joe Kruskal, Doug Carroll, Willem
> Heisser, Phipps Arabie, Shizuhiko Nishimoto, et al) follow the
> class-l and mpsych-l discussion lists. see
>
> http://aris.ss.uci.edu/smp/mpsych.html
>
> and
>
> http://www.classification-society.org/csna/lists.html#class-l
>
> Art Kendall
> [hidden email] <mailto:[hidden email]>
>
>
>
> Kooij, A.J. van der wrote:
>
>>... trouble because any category of each original census question
>>would be

an exact linear

>>
>>function of the remaining categories of the question.
>>
>>
>>
>Yes, but this gives trouble in regression, not in PCA, as far as I
>know.
>
>
>
>
>
>>In the indicator matrix, one category will have zeroes on all
>>indicator

variables.
>>
>>
>>
>No, and, sorry, I was confused with CA on indicator matrix, but this is
"sort of" PCA. See syntax below (object scores=component scores are
equal to row scores CA, category quantifications equal to column scores
CA).

>
>Regards,
>
>Anita.
>
>
>
>
>
>data list free/v1 v2 v3.
>
>
>
>begin data.
>
>
>
>1 2 3
>
>
>
>2 1 3
>
>
>
>2 2 2
>
>
>
>3 1 1
>
>
>
>2 3 4
>
>
>
>2 2 2
>
>
>
>1 2 4
>
>
>
>end data.
>
>
>
>
>
>
>
>Multiple Correspondence v1 v2 v3
>
>
>
> /analysis v1 v2 v3
>
>
>
> /dim=2
>
>
>
> /critit .0000001
>
>
>
> /print discrim quant obj
>
>
>
> /plot none.
>
>
>
>
>
>
>
>catpca v1 v2 v3
>
>
>
> /analysis v1 v2 v3 (mnom)
>
>
>
> /dim=2
>
>
>
> /critit .0000001
>
>
>
> /print quant obj
>
>
>
> /plot none.
>
>
>
>
>
>
>
>data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1 v3cat2

v3cat3 v3cat4 .

>
>
>
>begin data.
>
>
>
>1 0 0 0 1 0 0 0 1 0
>
>
>
>0 1 0 1 0 0 0 0 1 0
>
>
>
>0 1 0 0 1 0 0 1 0 0
>
>
>
>0 0 1 1 0 0 1 0 0 0
>
>
>
>0 1 0 0 0 1 0 0 0 1
>
>
>
>0 1 0 0 1 0 0 1 0 0
>
>
>
>1 0 0 0 1 0 0 0 0 1
>
>
>
>end data.
>
>
>
>
>
>
>
>CORRESPONDENCE
>
>
>
> TABLE = all (7,10)
>
>
>
> /DIMENSIONS = 2
>
>
>
> /NORMALIZATION = cprin
>
>
>
> /PRINT = RPOINTS CPOINTS
>
>
>
> /PLOT = none .
>
>
>
>
>
>
>
>________________________________
>
>
>
>From: SPSSX(r) Discussion on behalf of Hector Maletta
>
>Sent: Thu 17/08/2006 19:56
>
>To: [hidden email] <mailto:[hidden email]>
>
>Subject: Re: Reference category for dummies in factor analysis
>
>
>
>
>
>
>
>Thank you, Anita. I will certainly look into your suggestion about
>CATCPA.
>
>However, I suspect some mathematical properties of the scores generated

>by
>
>CATPCA are not the ones I hope to have in our scale, because of the
>
>non-parametric nature of the procedure (too long to explain here, and
>not
>
>sure of understanding it myself).
>
>As for your second idea, I think if you try to apply PCA on dummies not
>
>omitting any category you'd run into trouble because any category of
>each
>
>original census question would be an exact linear function of the
>remaining
>
>categories of the question. In the indicator matrix, one category will
>have
>
>zeroes on all indicator variables, and that one is the "omitted"
>category.
>
>Hector
>
>
>
>
>
>-----Mensaje original-----
>
>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
>
>Kooij, A.J. van der
>
>Enviado el: Thursday, August 17, 2006 2:37 PM
>
>Para: [hidden email] <mailto:[hidden email]>
>
>Asunto: Re: Reference category for dummies in factor analysis
>
>
>
>CATPCA (in Data Reduction menu, under Optimal Scaling) is PCA for
>
>(ordered//ordinal and unorderd/nominal) categorical variables; no need
>to
>
>use dummies then.
>
>Using PCA on dummies I think you should not omit dummies (for nominal
>
>variables you can do PCA on an indicator maxtrix (that has columns that

>can
>
>be regarded as dummy variables; a column for each category, thus
>without
>
>omitting one)).
>
>
>
>Regards,
>
>Anita van der Kooij
>
>Data Theory Group
>
>Leiden University.
>
>
>
>________________________________
>
>
>
>From: SPSSX(r) Discussion on behalf of Hector Maletta
>
>Sent: Thu 17/08/2006 17:52
>
>To: [hidden email] <mailto:[hidden email]>
>
>Subject: Reference category for dummies in factor analysis
>
>
>
>
>
>
>
>Dear colleagues,
>
>
>
>I am re-posting (slightly re-phrased for added clarity) a question I
>sent
>
>the list about a week ago without eliciting any response as yet. I hope

some
>
>factor analysis experts may be able to help.
>
>
>
>In a research project on which we work together, a colleague of mine
>
>constructed a scale based on factor scores obtained through classical
factor
>
>analysis (principal components) of a number of categorical census
variables
>
>all transformed into dummies. The variables concerned the standard of
living
>
>of households and included quality of dwelling and basic services such
>as
>
>sanitation, water supply, electricity and the like. (The scale was not
>
>simply the score for the first factor, but the average score of several
>
>factors, weighted by their respective contribution to explaining the
overall

>
>variance of observed variables, but this is, I surmise, beside the
>point.)
>
>
>
>Now, he found out that the choice of reference or "omitted" category
>for
>
>defining the dummies has an influence on results. He first ran the
>analysis
>
>using the first category of all categorical variables as the reference
>
>category, and then repeated the analysis using the last category as the
>
>reference or omitted category, whatever they might be. He found that
>the
>
>resulting scale varied not only in absolute value but also in the shape

>of
>
>its distribution.
>
>
>
>I can understand that the absolute value of the factor scores may
>change
and

>
>even the ranking of the categories of the various variables (in terms
>of
>
>their average scores) may also be different, since after all the list
>of
>
>dummies used has varied and the categories are tallied each time
>against a
>
>different reference category. But the shape of the scale distribution

should

>
>not change, I guess, especially not in a drastic manner. In this case
>the
>
>shape of the scale frequency distribution did change. Both
>distributions
>
>were roughly normal, with a kind of "hump" on one side, one of them on
>the
>
>left and the other on the right, probably due to the change in
>reference
>
>categories, but also with changes in the range of the scale and other
>
>details.
>
>
>
>Also, he found that the two scales had not a perfect correlation, and
>
>moreover, that their correlation was negative. That the correlation was
>
>negative may be understandable: the first category in such census
>variables
>
>is usually a "good" one (for instance, a home with walls made of brick
>or
>
>concrete) and the last one is frequently a "bad" one (earthen floor) or

>a
>
>residual heterogeneous one including bad options ("other" kinds of
>roof).
>
>But since the two scales are just different combinations of the same
>
>categorical variables based on the same statistical treatment of their
given
>
>covariance matrix, one should expect a closer, indeed a perfect
correlation,

>
>even if a negative one is possible for the reasons stated above.
>Changing
>
>the reference category should be like changing the unit of measurement
>or
>
>the position of the zero point (like passing from Celsius to
>Fahrenheit), a
>
>decision not affecting the correlation coefficient with other
>variables. In
>
>this case, instead, the two scales had r = -0.54, implying they shared
>only
>
>29% of their variance, even in the extreme case when ALL the possible
>
>factors (as many as variables) were extracted and all their scores
>averaged
>
>into the scale, and therefore the entire variance, common or specific,
>of
>
>the whole set of variables was taken into account).
>
>
>
>I should add that the dataset was a large sample of census data, and
>all

the

>
>results were statistically significant.
>
>
>
>Any ideas why choosing different reference categories for dummy
>conversion
>
>could have such impact on results? I would greatly appreciate your
>thoughts
>
>in this regard.
>
>
>
>Hector
>
>
>
>
>
>
>
>**********************************************************************
>
>This email and any files transmitted with it are confidential and
>
>intended solely for the use of the individual or entity to whom they
>
>are addressed. If you have received this email in error please notify
>
>the system manager.
>
>**********************************************************************
>
>
>
>
>
>
>

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error please notify the
system manager.
**********************************************************************