SPSSX Discussion - Re: Reference category for dummies in factor analysis

Re: Reference category for dummies in factor analysis

Posted by Hector Maletta on Aug 18, 2006; 4:19am
URL: http://spssx-discussion.165.s1.nabble.com/Reference-category-for-dummies-in-factor-analysis-tp1070339p1070359.html

Anita,

Thanks again.

You wrote: "The rescaled G'G matrix is not positive definite, thus cannot be
analyzed using SPSS Factor. Maybe this is the trouble you think of when
using dummies for all categories?"

Quite possibly. I'm not sure we're talking about the same matrix. However,
in fact, when you use all the categories of a categorical variable, one of
the categories is redundant and factor analysis (or regression) is
impossible.

You also wrote: "The maximum number of dimensions is the sum over variables
of number of categories minus 1. Maybe this is what you are thinking of when
omitting a category?" Not exactly: In the case of categorical variables
converted into dummies one category is omitted from each variable, not one
category over all variables. So for m variables with k categories each, you
have a sum total of km categories. If you convert them into dummies you get
mk-m dummies, and that's the maximum number of dimensions; as you explain
it, in your case you'd the maximum number of dimensions is mk-1 > mk-m.

The difference is due to the fact that CATPCA does not require excluding one
category in each categorical variable.

But that was not my original question, which is still unanswered.

Hector

Hector had written:

>And do not forget my original question about the impact of different

>omitted categories in factor analysis.

Anita responded:

I don't know about PCA on dummy variables, so I don't know about omitting
category, but I know how to obtain solution from indicator matrix, maybe
that will help a bit.

data list free/v1 v2 v3.

begin data.

1 2 3

2 1 3

2 2 2

3 1 1

2 3 4

2 2 2

1 2 4

end data.

catpca v1 v2 v3

/analysis v1 v2 v3 (mnom)

/dim=6

/critit .0000001

/print vaf quant obj

/plot none.

resulting eigenvalues:

2.587

1.608

1.500

1.083

.222

.000

Indicator matrix G is:

data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1 v3cat2
v3cat3 v3cat4 .

begin data.

1 0 0 0 1 0 0 0 1 0

0 1 0 1 0 0 0 0 1 0

0 1 0 0 1 0 0 1 0 0

0 0 1 1 0 0 1 0 0 0

0 1 0 0 0 1 0 0 0 1

0 1 0 0 1 0 0 1 0 0

1 0 0 0 1 0 0 0 0 1

end data.

Eigenvalue decomposition of rescaled G'G gives Catpca solution (but with 1
trivial/extraneous eigenvalue equal to number of variables because G is not
centered, which is avoided in Catpca and Multiple Correspendence by
centering the quantifications)

MATRIX.

get g /file = 'e:\...\g.sav'.

compute gg = T(g) * g.

compute freq = CSUM(g).

compute d= MDIAG(freq).

compute mat = INV(SQRT(d)) * gg * INV(SQRT(d)).

CALL EIGEN (mat,eigvec,eigval).

print eigval.

END MATRIX.

result:

EIGVAL

3.000000000

2.586836818

1.607968964

1.500000000

1.083262358

.221931860

.000000000

.000000000

.000000000

.000000000

Some thoughts:

The rescaled G'G matrix is not positive definite, thus cannot be analyzed
using SPSS Factor. Maybe this is the trouble you think of when using dummies
for all categories?

The maximum number of dimensions is the sum over variables of number of
categories minus 1. Maybe this is what you are thinking of when omitting a
category?

Regards,

Anita

________________________________

From: SPSSX(r) Discussion on behalf of Hector Maletta

Sent: Thu 17/08/2006 22:06

To: [hidden email]

Subject: Re: Reference category for dummies in factor analysis

Anita, you ARE indeed a useful source of advice in these abstruse matters.

First, sorry for alluding mistakenly to ALSCAL. In fact I was thinking of
CATPCA when I wrote that phrase about categorical factor analysis.

Now, if you could just possibly find that UN piece you seem to recall having
seen, I would be eternally grateful. In fact our work started in the context
of the 2005 Human Development report for Bolivia, funded by the UNDP, though
is now running independently of any UN support.

Just to shed some additional light into my brick-and-mud head: I suppose
that with CATPCA, if the factor score is a (linear?) function of the
transformed variables, it can also be expressed as a function of the
original categories. Brick and mud example: suppose having a brick wall is
quantified as 3.40, and a mud wall is 1.35; assume the Wall categorical
variable enters a factor score with a coefficient of 0.20. Thus having a
brick wall contributes 0.20x3.40=0.68 towards the factor score, and a mud
wall contributes 0.20x1.35=0.27. Is that so? Also: Are these factor scores
measured as z-scores, with zero mean and unit STD DEV? What are the
measurement units, means and SD of the transformed variables?

And do not forget my original question about the impact of different omitted
categories in factor analysis.

Thanks again for your help.

Hector

-----Mensaje original-----

De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Kooij, A.J. van der Enviado el: Thursday, August 17, 2006 2:51 PM

Para: [hidden email]

Asunto: Re: Reference category for dummies in factor analysis

Hector,

Some remarks:

>...categorical factor analysis by alternating least squares (ALSCAL in

>SPSS

jargon) ...

ALSCAL is MDS (Multi Dimensional Scaling). The preferred procedure to use
for MDS is PROXSCAL, added to SPSS some versions ago.

>... such as optimal scaling or multiple correspondence, but initially

>tried PCA because of its mathematical properties, which come in handy

>for the intended use of the scale in the project. Notice that in this

particular

>application we use factor analysis only as an intermediate step, i.e.

>as a way of constructing a scale that is a linear combination of

>variables

taking

>their covariances into account. We are not interested in the factors

>themselves.

With optimal scaling you obtain transformed (is optimally quantified)
variables that are continuous. All mathematical properties of PCA apply also
to CATPCA, but with respect to the transformed variables. The scale you
obtain using CATPCA is continuous.

Some years ago a UN-paper was publiced using CATPCA to create a scale for
variables much the same as you describe. If you are interested I can try to
find the reference.

Regards,

Anita van der Kooij

Data Theory Group

Leiden University.

________________________________

From: SPSSX(r) Discussion on behalf of Hector Maletta

Sent: Thu 17/08/2006 19:04

To: [hidden email]

Subject: Re: Reference category for dummies in factor analysis

Dan,

Yours is a sound question. Latent classes unfortunately would not do in this
case because we need a continuous scale, not a set of discrete classes, even
if they are ordered. We have considered using categorical factor analysis by
alternating least squares (ALSCAL in SPSS jargon) or other non parametric
procedures such as optimal scaling or multiple correspondence, but initially
tried PCA because of its mathematical properties, which come in handy for
the intended use of the scale in the project. Notice that in this particular
application we use factor analysis only as an intermediate step, i.e. as a
way of constructing a scale that is a linear combination of variables taking
their covariances into account. We are not interested in the factors
themselves.

Now about the use of FA with dummy variables: there are conflicting opinions
in the literature about this. Half the library is in favour and the other
half is against. Dummies can indeed be considered as interval scales, since
they have only one interval between their two values, and that interval is
implicitly used as their unit of measurement. The main objection is about
normality of their sampling distribution. Binary random variables have a
binomial distribution, which approximates the normal as n (sample size)
grows larger. Another frequent objection is about normality of residuals in

regression: obviously, if you predict a binary with a binary prediction,
your predicted value would either 1 or 0, and the residual would be either 0
or 1, so you'll have either all residuals to one side of your predictions,
or all residuals to the other side, and you'll never have residuals normally
distributed around your prediction. Take your pick in the library.

However, I do not wish for this thread to become a discussion of our use of
factor analysis in this way, but only of the particular question of the
impact of choosing one or another reference category. The other discussion
is most interesting, but we can address it later.

Hector

-----Mensaje original-----

De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Dan
Zetu Enviado el: Thursday, August 17, 2006 1:36 PM

Para: [hidden email]

Asunto: Re: Reference category for dummies in factor analysis

Hector:

What I am having a little difficulty comprehending is how a classical factor
analysis can be conducted on a set of dummy (binary) variables? I thought
that's what latent class analysis was for. Perhaps I am missing something in
your post?

Dan

>From: Hector Maletta <[hidden email]>

>Reply-To: Hector Maletta <[hidden email]>

>To: [hidden email]

>Subject: Reference category for dummies in factor analysis

>Date: Thu, 17 Aug 2006 12:52:55 -0300

>

>Dear colleagues,

>

>I am re-posting (slightly re-phrased for added clarity) a question I

>sent the list about a week ago without eliciting any response as yet. I

>hope some factor analysis experts may be able to help.

>

>In a research project on which we work together, a colleague of mine

>constructed a scale based on factor scores obtained through classical

>factor analysis (principal components) of a number of categorical

>census variables all transformed into dummies. The variables concerned

>the standard of living of households and included quality of dwelling

>and basic services such as sanitation, water supply, electricity and

>the like. (The scale was not simply the score for the first factor, but

>the average score of several factors, weighted by their respective

>contribution to explaining the overall variance of observed variables,

>but this is, I surmise, beside the point.)

>

>Now, he found out that the choice of reference or "omitted" category

>for defining the dummies has an influence on results. He first ran the

>analysis using the first category of all categorical variables as the

>reference category, and then repeated the analysis using the last

>category as the reference or omitted category, whatever they might be.

>He found that the resulting scale varied not only in absolute value but

>also in the shape of its distribution.

>

>I can understand that the absolute value of the factor scores may

>change and even the ranking of the categories of the various variables

>(in terms of their average scores) may also be different, since after

>all the list of dummies used has varied and the categories are tallied

>each time against a different reference category. But the shape of the

>scale distribution should not change, I guess, especially not in a

>drastic manner. In this case the shape of the scale frequency

>distribution did change. Both distributions were roughly normal, with

>a kind of "hump" on one side, one of them on the left and the other on

>the right, probably due to the change in reference categories, but also

>with changes in the range of the scale and other details.

>

>Also, he found that the two scales had not a perfect correlation, and

>moreover, that their correlation was negative. That the correlation was

>negative may be understandable: the first category in such census

>variables is usually a "good" one (for instance, a home with walls made

>of brick or

>concrete) and the last one is frequently a "bad" one (earthen floor) or

>a residual heterogeneous one including bad options ("other" kinds of roof).

>But since the two scales are just different combinations of the same

>categorical variables based on the same statistical treatment of their

>given covariance matrix, one should expect a closer, indeed a perfect

>correlation, even if a negative one is possible for the reasons stated

>above. Changing the reference category should be like changing the unit

>of measurement or the position of the zero point (like passing from

>Celsius to Fahrenheit), a decision not affecting the correlation

>coefficient with other variables. In this case, instead, the two scales

>had r = -0.54, implying they shared only 29% of their variance, even in

>the extreme case when ALL the possible factors (as many as variables)

>were extracted and all their scores averaged into the scale, and

>therefore the entire variance, common or specific, of the whole set of

>variables was taken into account).

>

>I should add that the dataset was a large sample of census data, and

>all the results were statistically significant.

>

>Any ideas why choosing different reference categories for dummy

>conversion could have such impact on results? I would greatly

>appreciate your thoughts in this regard.

>

>Hector