SPSSX Discussion - Re: Reference category for dummies in factor analysis

Re: Reference category for dummies in factor analysis

Posted by Kooij, A.J. van der on Aug 18, 2006; 2:03pm
URL: http://spssx-discussion.165.s1.nabble.com/Reference-category-for-dummies-in-factor-analysis-tp1070339p1070353.html

Hector,
using all the categories of a categorical variable, factor analysis and
regression are not impossible, only impossible when following the
mathematical approach involving the inverse of matrix X'X (X matrix of
dummies). But CATPCA and CATREG find solutions (iteratively) from the
data itself, not from the correlation matrix of the data, so no problem.

If you have dummies for all categories and use CATREG for regression
applying numerical scaling level, you do not need to leave dummies out,
and the result is equal to linear regression on dummies omitting 1 dummy
for each variable.
For PCA I really don't think you should omit a dummy. PCA on nominal
variables is Multiple Correspondence Analysis (is CATPCA applying
multiple nominal (mnom) scaling level to all variables), which is
analyzing the indicator matrix in the way I described in previous mail,
thus using all columns of indicator matrix = using all dummies.

About max. number of components: we agree: I was not clear in writing
it, should have been
"The maximum number of dimensions is number of categories minus 1 for
each variable, summed over variables" .

In a previous mail I said transformed variables are z-scores, but this
not true with mnom scaling: then variables centered but not
standardized.
(with mnom scaling level the quantification for a category in a
dimension/component is the centroid of the component scores for the
cases that have scored in that category).

Regards,
Anita

-----Original Message-----
From: Hector Maletta [mailto:[hidden email]]
Sent: 18 August 2006 05:20
To: Kooij, A.J. van der
Cc: [hidden email]
Subject: RE: Reference category for dummies in factor analysis

Anita,

Thanks again.

You wrote: "The rescaled G'G matrix is not positive definite,
thus cannot be analyzed using SPSS Factor. Maybe this is the trouble you
think of when using dummies for all categories?"

Quite possibly. I'm not sure we're talking about the same
matrix. However, in fact, when you use all the categories of a
categorical variable, one of the categories is redundant and factor
analysis (or regression) is impossible.

You also wrote: "The maximum number of dimensions is the sum
over variables of number of categories minus 1. Maybe this is what you
are thinking of when omitting a category?" Not exactly: In the case of
categorical variables converted into dummies one category is omitted
from each variable, not one category over all variables. So for m
variables with k categories each, you have a sum total of km categories.
If you convert them into dummies you get mk-m dummies, and that's the
maximum number of dimensions; as you explain it, in your case you'd the
maximum number of dimensions is mk-1 > mk-m.

The difference is due to the fact that CATPCA does not require
excluding one category in each categorical variable.

But that was not my original question, which is still
unanswered.

Hector

Hector had written:

>And do not forget my original question about the impact of
different

>omitted categories in factor analysis.

Anita responded:

I don't know about PCA on dummy variables, so I don't know about
omitting category, but I know how to obtain solution from indicator
matrix, maybe that will help a bit.

data list free/v1 v2 v3.

begin data.

1 2 3

2 1 3

2 2 2

3 1 1

2 3 4

2 2 2

1 2 4

end data.

catpca v1 v2 v3

/analysis v1 v2 v3 (mnom)

/dim=6

/critit .0000001

/print vaf quant obj

/plot none.

resulting eigenvalues:

2.587

1.608

1.500

1.083

.222

.000

Indicator matrix G is:

data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1
v3cat2 v3cat3 v3cat4 .

begin data.

1 0 0 0 1 0 0 0 1 0

0 1 0 1 0 0 0 0 1 0

0 1 0 0 1 0 0 1 0 0

0 0 1 1 0 0 1 0 0 0

0 1 0 0 0 1 0 0 0 1

0 1 0 0 1 0 0 1 0 0

1 0 0 0 1 0 0 0 0 1

end data.

Eigenvalue decomposition of rescaled G'G gives Catpca solution
(but with 1 trivial/extraneous eigenvalue equal to number of variables
because G is not centered, which is avoided in Catpca and Multiple
Correspendence by centering the quantifications)

MATRIX.

get g /file = 'e:\...\g.sav'.

compute gg = T(g) * g.

compute freq = CSUM(g).

compute d= MDIAG(freq).

compute mat = INV(SQRT(d)) * gg * INV(SQRT(d)).

CALL EIGEN (mat,eigvec,eigval).

print eigval.

END MATRIX.

result:

EIGVAL

3.000000000

2.586836818

1.607968964

1.500000000

1.083262358

.221931860

.000000000

.000000000

.000000000

.000000000

Some thoughts:

The rescaled G'G matrix is not positive definite, thus cannot be
analyzed using SPSS Factor. Maybe this is the trouble you think of when
using dummies for all categories?

The maximum number of dimensions is the sum over variables of
number of categories minus 1. Maybe this is what you are thinking of
when omitting a category?

Regards,

Anita

________________________________

From: SPSSX(r) Discussion on behalf of Hector Maletta

Sent: Thu 17/08/2006 22:06

To: [hidden email]

Subject: Re: Reference category for dummies in factor analysis

Anita, you ARE indeed a useful source of advice in these
abstruse matters.

First, sorry for alluding mistakenly to ALSCAL. In fact I was
thinking of CATPCA when I wrote that phrase about categorical factor
analysis.

Now, if you could just possibly find that UN piece you seem to
recall having seen, I would be eternally grateful. In fact our work
started in the context of the 2005 Human Development report for Bolivia,
funded by the UNDP, though is now running independently of any UN
support.

Just to shed some additional light into my brick-and-mud head: I
suppose that with CATPCA, if the factor score is a (linear?) function of
the transformed variables, it can also be expressed as a function of the
original categories. Brick and mud example: suppose having a brick wall
is quantified as 3.40, and a mud wall is 1.35; assume the Wall
categorical variable enters a factor score with a coefficient of 0.20.
Thus having a brick wall contributes 0.20x3.40=0.68 towards the factor
score, and a mud wall contributes 0.20x1.35=0.27. Is that so? Also: Are
these factor scores measured as z-scores, with zero mean and unit STD
DEV? What are the measurement units, means and SD of the transformed
variables?

And do not forget my original question about the impact of
different omitted categories in factor analysis.

Thanks again for your help.

Hector

-----Mensaje original-----

De: SPSSX(r) Discussion [mailto:[hidden email]] En
nombre de Kooij, A.J. van der Enviado el: Thursday, August 17, 2006 2:51
PM

Para: [hidden email]

Asunto: Re: Reference category for dummies in factor analysis

Hector,

Some remarks:

>...categorical factor analysis by alternating least squares
(ALSCAL in

>SPSS

jargon) ...

ALSCAL is MDS (Multi Dimensional Scaling). The preferred
procedure to use for MDS is PROXSCAL, added to SPSS some versions ago.

>... such as optimal scaling or multiple correspondence, but
initially

>tried PCA because of its mathematical properties, which come in
handy

>for the intended use of the scale in the project. Notice that
in this

particular

>application we use factor analysis only as an intermediate
step, i.e.

>as a way of constructing a scale that is a linear combination
of

>variables

taking

>their covariances into account. We are not interested in the
factors

>themselves.

With optimal scaling you obtain transformed (is optimally
quantified) variables that are continuous. All mathematical properties
of PCA apply also to CATPCA, but with respect to the transformed
variables. The scale you obtain using CATPCA is continuous.

Some years ago a UN-paper was publiced using CATPCA to create a
scale for variables much the same as you describe. If you are interested
I can try to find the reference.

Regards,

Anita van der Kooij

Data Theory Group

Leiden University.

________________________________

From: SPSSX(r) Discussion on behalf of Hector Maletta

Sent: Thu 17/08/2006 19:04

To: [hidden email]

Subject: Re: Reference category for dummies in factor analysis

Dan,

Yours is a sound question. Latent classes unfortunately would
not do in this case because we need a continuous scale, not a set of
discrete classes, even if they are ordered. We have considered using
categorical factor analysis by alternating least squares (ALSCAL in SPSS
jargon) or other non parametric procedures such as optimal scaling or
multiple correspondence, but initially tried PCA because of its
mathematical properties, which come in handy for the intended use of the
scale in the project. Notice that in this particular application we use
factor analysis only as an intermediate step, i.e. as a way of
constructing a scale that is a linear combination of variables taking
their covariances into account. We are not interested in the factors
themselves.

Now about the use of FA with dummy variables: there are
conflicting opinions in the literature about this. Half the library is
in favour and the other half is against. Dummies can indeed be
considered as interval scales, since they have only one interval between
their two values, and that interval is implicitly used as their unit of
measurement. The main objection is about normality of their sampling
distribution. Binary random variables have a binomial distribution,
which approximates the normal as n (sample size) grows larger. Another
frequent objection is about normality of residuals in

regression: obviously, if you predict a binary with a binary
prediction, your predicted value would either 1 or 0, and the residual
would be either 0 or 1, so you'll have either all residuals to one side
of your predictions, or all residuals to the other side, and you'll
never have residuals normally distributed around your prediction. Take
your pick in the library.

However, I do not wish for this thread to become a discussion of
our use of factor analysis in this way, but only of the particular
question of the impact of choosing one or another reference category.
The other discussion is most interesting, but we can address it later.

Hector

-----Mensaje original-----

De: SPSSX(r) Discussion [mailto:[hidden email]] En
nombre de Dan Zetu Enviado el: Thursday, August 17, 2006 1:36 PM

Para: [hidden email]

Asunto: Re: Reference category for dummies in factor analysis

Hector:

What I am having a little difficulty comprehending is how a
classical factor analysis can be conducted on a set of dummy (binary)
variables? I thought that's what latent class analysis was for. Perhaps
I am missing something in your post?

Dan

>From: Hector Maletta <[hidden email]>

>Reply-To: Hector Maletta <[hidden email]>

>To: [hidden email]

>Subject: Reference category for dummies in factor analysis

>Date: Thu, 17 Aug 2006 12:52:55 -0300

>

>Dear colleagues,

>

>I am re-posting (slightly re-phrased for added clarity) a
question I

>sent the list about a week ago without eliciting any response
as yet. I

>hope some factor analysis experts may be able to help.

>

>In a research project on which we work together, a colleague of
mine

>constructed a scale based on factor scores obtained through
classical

>factor analysis (principal components) of a number of
categorical

>census variables all transformed into dummies. The variables
concerned

>the standard of living of households and included quality of
dwelling

>and basic services such as sanitation, water supply,
electricity and

>the like. (The scale was not simply the score for the first
factor, but

>the average score of several factors, weighted by their
respective

>contribution to explaining the overall variance of observed
variables,

>but this is, I surmise, beside the point.)

>

>Now, he found out that the choice of reference or "omitted"
category

>for defining the dummies has an influence on results. He first
ran the

>analysis using the first category of all categorical variables
as the

>reference category, and then repeated the analysis using the
last

>category as the reference or omitted category, whatever they
might be.

>He found that the resulting scale varied not only in absolute
value but

>also in the shape of its distribution.

>

>I can understand that the absolute value of the factor scores
may

>change and even the ranking of the categories of the various
variables

>(in terms of their average scores) may also be different, since
after

>all the list of dummies used has varied and the categories are
tallied

>each time against a different reference category. But the shape
of the

>scale distribution should not change, I guess, especially not
in a

>drastic manner. In this case the shape of the scale frequency

>distribution did change. Both distributions were roughly
normal, with

>a kind of "hump" on one side, one of them on the left and the
other on

>the right, probably due to the change in reference categories,
but also

>with changes in the range of the scale and other details.

>

>Also, he found that the two scales had not a perfect
correlation, and

>moreover, that their correlation was negative. That the
correlation was

>negative may be understandable: the first category in such
census

>variables is usually a "good" one (for instance, a home with
walls made

>of brick or

>concrete) and the last one is frequently a "bad" one (earthen
floor) or

>a residual heterogeneous one including bad options ("other"
kinds of roof).

>But since the two scales are just different combinations of the
same

>categorical variables based on the same statistical treatment
of their

>given covariance matrix, one should expect a closer, indeed a
perfect

>correlation, even if a negative one is possible for the reasons
stated

>above. Changing the reference category should be like changing
the unit

>of measurement or the position of the zero point (like passing
from

>Celsius to Fahrenheit), a decision not affecting the
correlation

>coefficient with other variables. In this case, instead, the
two scales

>had r = -0.54, implying they shared only 29% of their variance,
even in

>the extreme case when ALL the possible factors (as many as
variables)

>were extracted and all their scores averaged into the scale,
and

>therefore the entire variance, common or specific, of the whole
set of

>variables was taken into account).

>

>I should add that the dataset was a large sample of census
data, and

>all the results were statistically significant.

>

>Any ideas why choosing different reference categories for dummy

>conversion could have such impact on results? I would greatly

>appreciate your thoughts in this regard.

>

>Hector

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************