Reference category for dummies in factor analysis

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Reference category for dummies in factor analysis

Hector Maletta
Dear colleagues,

I am re-posting (slightly re-phrased for added clarity) a question I sent
the list about a week ago without eliciting any response as yet. I hope some
factor analysis experts may be able to help.

In a research project on which we work together, a colleague of mine
constructed a scale based on factor scores obtained through classical factor
analysis  (principal components) of a number of categorical census variables
all transformed into dummies. The variables concerned the standard of living
of households and included quality of dwelling and basic services such as
sanitation, water supply, electricity and the like. (The scale was not
simply the score for the first factor, but the average score of several
factors, weighted by their respective contribution to explaining the overall
variance of observed variables, but this is, I surmise, beside the point.)

Now, he found out that the choice of reference or "omitted" category for
defining the dummies has an influence on results. He first ran the analysis
using the first category of all categorical variables as the reference
category, and then repeated the analysis using the last category as the
reference or omitted category, whatever they might be. He found that the
resulting scale varied not only in absolute value but also in the shape of
its distribution.

I can understand that the absolute value of the factor scores may change and
even the ranking of the categories of the various variables (in terms of
their average scores) may also be different, since after all the list of
dummies used has varied and the categories are tallied each time against a
different reference category. But the shape of the scale distribution should
not change, I guess, especially not in a drastic manner. In this case the
shape of the scale frequency distribution did change.  Both distributions
were roughly normal, with a kind of "hump" on one side, one of them on the
left and the other on the right, probably due to the change in reference
categories, but also with changes in the range of the scale and other
details.

Also, he found that the two scales had not a perfect correlation, and
moreover, that their correlation was negative. That the correlation was
negative may be understandable: the first category in such census variables
is usually a "good" one (for instance, a home with walls made of brick or
concrete) and the last one is frequently a "bad" one (earthen floor) or a
residual heterogeneous one including bad options ("other" kinds of roof).
But since the two scales are just different combinations of the same
categorical variables based on the same statistical treatment of their given
covariance matrix, one should expect a closer, indeed a perfect correlation,
even if a negative one is possible for the reasons stated above. Changing
the reference category should be like changing the unit of measurement or
the position of the zero point (like passing from Celsius to Fahrenheit), a
decision not affecting the correlation coefficient with other variables. In
this case, instead, the two scales had r = -0.54, implying they shared only
29% of their variance, even in the extreme case when ALL the possible
factors (as many as variables) were extracted and all their scores averaged
into the scale, and therefore the entire variance, common or specific, of
the whole set of variables was taken into account).

I should add that the dataset was a large sample of census data, and all the
results were statistically significant.

Any ideas why choosing different reference categories for dummy conversion
could have such impact on results? I would greatly appreciate your thoughts
in this regard.

Hector
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Dan Zetu
Hector:

What I am having a little difficulty comprehending is how a classical factor
analysis can be conducted on a set of dummy (binary) variables? I thought
that's what latent class analysis was for. Perhaps I am missing something in
your post?

Dan


>From: Hector Maletta <[hidden email]>
>Reply-To: Hector Maletta <[hidden email]>
>To: [hidden email]
>Subject: Reference category for dummies in factor analysis
>Date: Thu, 17 Aug 2006 12:52:55 -0300
>
>Dear colleagues,
>
>I am re-posting (slightly re-phrased for added clarity) a question I sent
>the list about a week ago without eliciting any response as yet. I hope
>some
>factor analysis experts may be able to help.
>
>In a research project on which we work together, a colleague of mine
>constructed a scale based on factor scores obtained through classical
>factor
>analysis  (principal components) of a number of categorical census
>variables
>all transformed into dummies. The variables concerned the standard of
>living
>of households and included quality of dwelling and basic services such as
>sanitation, water supply, electricity and the like. (The scale was not
>simply the score for the first factor, but the average score of several
>factors, weighted by their respective contribution to explaining the
>overall
>variance of observed variables, but this is, I surmise, beside the point.)
>
>Now, he found out that the choice of reference or "omitted" category for
>defining the dummies has an influence on results. He first ran the analysis
>using the first category of all categorical variables as the reference
>category, and then repeated the analysis using the last category as the
>reference or omitted category, whatever they might be. He found that the
>resulting scale varied not only in absolute value but also in the shape of
>its distribution.
>
>I can understand that the absolute value of the factor scores may change
>and
>even the ranking of the categories of the various variables (in terms of
>their average scores) may also be different, since after all the list of
>dummies used has varied and the categories are tallied each time against a
>different reference category. But the shape of the scale distribution
>should
>not change, I guess, especially not in a drastic manner. In this case the
>shape of the scale frequency distribution did change.  Both distributions
>were roughly normal, with a kind of "hump" on one side, one of them on the
>left and the other on the right, probably due to the change in reference
>categories, but also with changes in the range of the scale and other
>details.
>
>Also, he found that the two scales had not a perfect correlation, and
>moreover, that their correlation was negative. That the correlation was
>negative may be understandable: the first category in such census variables
>is usually a "good" one (for instance, a home with walls made of brick or
>concrete) and the last one is frequently a "bad" one (earthen floor) or a
>residual heterogeneous one including bad options ("other" kinds of roof).
>But since the two scales are just different combinations of the same
>categorical variables based on the same statistical treatment of their
>given
>covariance matrix, one should expect a closer, indeed a perfect
>correlation,
>even if a negative one is possible for the reasons stated above. Changing
>the reference category should be like changing the unit of measurement or
>the position of the zero point (like passing from Celsius to Fahrenheit), a
>decision not affecting the correlation coefficient with other variables. In
>this case, instead, the two scales had r = -0.54, implying they shared only
>29% of their variance, even in the extreme case when ALL the possible
>factors (as many as variables) were extracted and all their scores averaged
>into the scale, and therefore the entire variance, common or specific, of
>the whole set of variables was taken into account).
>
>I should add that the dataset was a large sample of census data, and all
>the
>results were statistically significant.
>
>Any ideas why choosing different reference categories for dummy conversion
>could have such impact on results? I would greatly appreciate your thoughts
>in this regard.
>
>Hector
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Hector Maletta
Dan,
Yours is a sound question. Latent classes unfortunately would not do in this
case because we need a continuous scale, not a set of discrete classes, even
if they are ordered. We have considered using categorical factor analysis by
alternating least squares (ALSCAL in SPSS jargon) or other non parametric
procedures such as optimal scaling or multiple correspondence, but initially
tried PCA because of its mathematical properties, which come in handy for
the intended use of the scale in the project. Notice that in this particular
application we use factor analysis only as an intermediate step, i.e. as a
way of constructing a scale that is a linear combination of variables taking
their covariances into account. We are not interested in the factors
themselves.
Now about the use of FA with dummy variables: there are conflicting opinions
in the literature about this. Half the library is in favour and the other
half is against. Dummies can indeed be considered as interval scales, since
they have only one interval between their two values, and that interval is
implicitly used as their unit of measurement. The main objection is about
normality of their sampling distribution. Binary random variables have a
binomial distribution, which approximates the normal as n (sample size)
grows larger. Another frequent objection is about normality of residuals in
regression: obviously, if you predict a binary with a binary prediction,
your predicted value would either 1 or 0, and the residual would be either 0
or 1, so you'll have either all residuals to one side of your predictions,
or all residuals to the other side, and you'll never have residuals normally
distributed around your prediction. Take your pick in the library.
However, I do not wish for this thread to become a discussion of our use of
factor analysis in this way, but only of the particular question of the
impact of choosing one or another reference category. The other discussion
is most interesting, but we can address it later.

Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Dan
Zetu
Enviado el: Thursday, August 17, 2006 1:36 PM
Para: [hidden email]
Asunto: Re: Reference category for dummies in factor analysis

Hector:

What I am having a little difficulty comprehending is how a classical factor
analysis can be conducted on a set of dummy (binary) variables? I thought
that's what latent class analysis was for. Perhaps I am missing something in
your post?

Dan


>From: Hector Maletta <[hidden email]>
>Reply-To: Hector Maletta <[hidden email]>
>To: [hidden email]
>Subject: Reference category for dummies in factor analysis
>Date: Thu, 17 Aug 2006 12:52:55 -0300
>
>Dear colleagues,
>
>I am re-posting (slightly re-phrased for added clarity) a question I sent
>the list about a week ago without eliciting any response as yet. I hope
>some
>factor analysis experts may be able to help.
>
>In a research project on which we work together, a colleague of mine
>constructed a scale based on factor scores obtained through classical
>factor
>analysis  (principal components) of a number of categorical census
>variables
>all transformed into dummies. The variables concerned the standard of
>living
>of households and included quality of dwelling and basic services such as
>sanitation, water supply, electricity and the like. (The scale was not
>simply the score for the first factor, but the average score of several
>factors, weighted by their respective contribution to explaining the
>overall
>variance of observed variables, but this is, I surmise, beside the point.)
>
>Now, he found out that the choice of reference or "omitted" category for
>defining the dummies has an influence on results. He first ran the analysis
>using the first category of all categorical variables as the reference
>category, and then repeated the analysis using the last category as the
>reference or omitted category, whatever they might be. He found that the
>resulting scale varied not only in absolute value but also in the shape of
>its distribution.
>
>I can understand that the absolute value of the factor scores may change
>and
>even the ranking of the categories of the various variables (in terms of
>their average scores) may also be different, since after all the list of
>dummies used has varied and the categories are tallied each time against a
>different reference category. But the shape of the scale distribution
>should
>not change, I guess, especially not in a drastic manner. In this case the
>shape of the scale frequency distribution did change.  Both distributions
>were roughly normal, with a kind of "hump" on one side, one of them on the
>left and the other on the right, probably due to the change in reference
>categories, but also with changes in the range of the scale and other
>details.
>
>Also, he found that the two scales had not a perfect correlation, and
>moreover, that their correlation was negative. That the correlation was
>negative may be understandable: the first category in such census variables
>is usually a "good" one (for instance, a home with walls made of brick or
>concrete) and the last one is frequently a "bad" one (earthen floor) or a
>residual heterogeneous one including bad options ("other" kinds of roof).
>But since the two scales are just different combinations of the same
>categorical variables based on the same statistical treatment of their
>given
>covariance matrix, one should expect a closer, indeed a perfect
>correlation,
>even if a negative one is possible for the reasons stated above. Changing
>the reference category should be like changing the unit of measurement or
>the position of the zero point (like passing from Celsius to Fahrenheit), a
>decision not affecting the correlation coefficient with other variables. In
>this case, instead, the two scales had r = -0.54, implying they shared only
>29% of their variance, even in the extreme case when ALL the possible
>factors (as many as variables) were extracted and all their scores averaged
>into the scale, and therefore the entire variance, common or specific, of
>the whole set of variables was taken into account).
>
>I should add that the dataset was a large sample of census data, and all
>the
>results were statistically significant.
>
>Any ideas why choosing different reference categories for dummy conversion
>could have such impact on results? I would greatly appreciate your thoughts
>in this regard.
>
>Hector
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Kooij, A.J. van der
In reply to this post by Hector Maletta
CATPCA (in Data Reduction menu, under Optimal Scaling) is PCA for (ordered//ordinal and unorderd/nominal) categorical variables; no need to use dummies then.
Using PCA on dummies I think you should not omit dummies (for nominal variables you can do PCA on an indicator maxtrix (that has columns that can be regarded as dummy variables; a column for each category, thus without omitting one)).
 
Regards,
Anita van der Kooij
Data Theory Group
Leiden University.

________________________________

From: SPSSX(r) Discussion on behalf of Hector Maletta
Sent: Thu 17/08/2006 17:52
To: [hidden email]
Subject: Reference category for dummies in factor analysis



Dear colleagues,

I am re-posting (slightly re-phrased for added clarity) a question I sent
the list about a week ago without eliciting any response as yet. I hope some
factor analysis experts may be able to help.

In a research project on which we work together, a colleague of mine
constructed a scale based on factor scores obtained through classical factor
analysis  (principal components) of a number of categorical census variables
all transformed into dummies. The variables concerned the standard of living
of households and included quality of dwelling and basic services such as
sanitation, water supply, electricity and the like. (The scale was not
simply the score for the first factor, but the average score of several
factors, weighted by their respective contribution to explaining the overall
variance of observed variables, but this is, I surmise, beside the point.)

Now, he found out that the choice of reference or "omitted" category for
defining the dummies has an influence on results. He first ran the analysis
using the first category of all categorical variables as the reference
category, and then repeated the analysis using the last category as the
reference or omitted category, whatever they might be. He found that the
resulting scale varied not only in absolute value but also in the shape of
its distribution.

I can understand that the absolute value of the factor scores may change and
even the ranking of the categories of the various variables (in terms of
their average scores) may also be different, since after all the list of
dummies used has varied and the categories are tallied each time against a
different reference category. But the shape of the scale distribution should
not change, I guess, especially not in a drastic manner. In this case the
shape of the scale frequency distribution did change.  Both distributions
were roughly normal, with a kind of "hump" on one side, one of them on the
left and the other on the right, probably due to the change in reference
categories, but also with changes in the range of the scale and other
details.

Also, he found that the two scales had not a perfect correlation, and
moreover, that their correlation was negative. That the correlation was
negative may be understandable: the first category in such census variables
is usually a "good" one (for instance, a home with walls made of brick or
concrete) and the last one is frequently a "bad" one (earthen floor) or a
residual heterogeneous one including bad options ("other" kinds of roof).
But since the two scales are just different combinations of the same
categorical variables based on the same statistical treatment of their given
covariance matrix, one should expect a closer, indeed a perfect correlation,
even if a negative one is possible for the reasons stated above. Changing
the reference category should be like changing the unit of measurement or
the position of the zero point (like passing from Celsius to Fahrenheit), a
decision not affecting the correlation coefficient with other variables. In
this case, instead, the two scales had r = -0.54, implying they shared only
29% of their variance, even in the extreme case when ALL the possible
factors (as many as variables) were extracted and all their scores averaged
into the scale, and therefore the entire variance, common or specific, of
the whole set of variables was taken into account).

I should add that the dataset was a large sample of census data, and all the
results were statistically significant.

Any ideas why choosing different reference categories for dummy conversion
could have such impact on results? I would greatly appreciate your thoughts
in this regard.

Hector



**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Kooij, A.J. van der
In reply to this post by Hector Maletta
Hector,
Some remarks:
>...categorical factor analysis by alternating least squares (ALSCAL in SPSS jargon) ...
ALSCAL is MDS (Multi Dimensional Scaling). The preferred procedure to use for MDS is PROXSCAL, added to SPSS some versions ago.
 
>... such as optimal scaling or multiple correspondence, but initially
>tried PCA because of its mathematical properties, which come in handy for
>the intended use of the scale in the project. Notice that in this particular
>application we use factor analysis only as an intermediate step, i.e. as a
>way of constructing a scale that is a linear combination of variables taking
>their covariances into account. We are not interested in the factors
>themselves.
With optimal scaling you obtain transformed (is optimally quantified) variables that are continuous. All mathematical properties of PCA apply also to CATPCA, but with respect to the transformed variables. The scale you obtain using CATPCA is continuous.
Some years ago a UN-paper was publiced using CATPCA to create a scale for variables much the same as you describe. If you are interested I can try to find the reference.
 
Regards,
Anita van der Kooij
Data Theory Group
Leiden University.



________________________________

From: SPSSX(r) Discussion on behalf of Hector Maletta
Sent: Thu 17/08/2006 19:04
To: [hidden email]
Subject: Re: Reference category for dummies in factor analysis



Dan,
Yours is a sound question. Latent classes unfortunately would not do in this
case because we need a continuous scale, not a set of discrete classes, even
if they are ordered. We have considered using categorical factor analysis by
alternating least squares (ALSCAL in SPSS jargon) or other non parametric
procedures such as optimal scaling or multiple correspondence, but initially
tried PCA because of its mathematical properties, which come in handy for
the intended use of the scale in the project. Notice that in this particular
application we use factor analysis only as an intermediate step, i.e. as a
way of constructing a scale that is a linear combination of variables taking
their covariances into account. We are not interested in the factors
themselves.
Now about the use of FA with dummy variables: there are conflicting opinions
in the literature about this. Half the library is in favour and the other
half is against. Dummies can indeed be considered as interval scales, since
they have only one interval between their two values, and that interval is
implicitly used as their unit of measurement. The main objection is about
normality of their sampling distribution. Binary random variables have a
binomial distribution, which approximates the normal as n (sample size)
grows larger. Another frequent objection is about normality of residuals in
regression: obviously, if you predict a binary with a binary prediction,
your predicted value would either 1 or 0, and the residual would be either 0
or 1, so you'll have either all residuals to one side of your predictions,
or all residuals to the other side, and you'll never have residuals normally
distributed around your prediction. Take your pick in the library.
However, I do not wish for this thread to become a discussion of our use of
factor analysis in this way, but only of the particular question of the
impact of choosing one or another reference category. The other discussion
is most interesting, but we can address it later.

Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Dan
Zetu
Enviado el: Thursday, August 17, 2006 1:36 PM
Para: [hidden email]
Asunto: Re: Reference category for dummies in factor analysis

Hector:

What I am having a little difficulty comprehending is how a classical factor
analysis can be conducted on a set of dummy (binary) variables? I thought
that's what latent class analysis was for. Perhaps I am missing something in
your post?

Dan


>From: Hector Maletta <[hidden email]>
>Reply-To: Hector Maletta <[hidden email]>
>To: [hidden email]
>Subject: Reference category for dummies in factor analysis
>Date: Thu, 17 Aug 2006 12:52:55 -0300
>
>Dear colleagues,
>
>I am re-posting (slightly re-phrased for added clarity) a question I sent
>the list about a week ago without eliciting any response as yet. I hope
>some
>factor analysis experts may be able to help.
>
>In a research project on which we work together, a colleague of mine
>constructed a scale based on factor scores obtained through classical
>factor
>analysis  (principal components) of a number of categorical census
>variables
>all transformed into dummies. The variables concerned the standard of
>living
>of households and included quality of dwelling and basic services such as
>sanitation, water supply, electricity and the like. (The scale was not
>simply the score for the first factor, but the average score of several
>factors, weighted by their respective contribution to explaining the
>overall
>variance of observed variables, but this is, I surmise, beside the point.)
>
>Now, he found out that the choice of reference or "omitted" category for
>defining the dummies has an influence on results. He first ran the analysis
>using the first category of all categorical variables as the reference
>category, and then repeated the analysis using the last category as the
>reference or omitted category, whatever they might be. He found that the
>resulting scale varied not only in absolute value but also in the shape of
>its distribution.
>
>I can understand that the absolute value of the factor scores may change
>and
>even the ranking of the categories of the various variables (in terms of
>their average scores) may also be different, since after all the list of
>dummies used has varied and the categories are tallied each time against a
>different reference category. But the shape of the scale distribution
>should
>not change, I guess, especially not in a drastic manner. In this case the
>shape of the scale frequency distribution did change.  Both distributions
>were roughly normal, with a kind of "hump" on one side, one of them on the
>left and the other on the right, probably due to the change in reference
>categories, but also with changes in the range of the scale and other
>details.
>
>Also, he found that the two scales had not a perfect correlation, and
>moreover, that their correlation was negative. That the correlation was
>negative may be understandable: the first category in such census variables
>is usually a "good" one (for instance, a home with walls made of brick or
>concrete) and the last one is frequently a "bad" one (earthen floor) or a
>residual heterogeneous one including bad options ("other" kinds of roof).
>But since the two scales are just different combinations of the same
>categorical variables based on the same statistical treatment of their
>given
>covariance matrix, one should expect a closer, indeed a perfect
>correlation,
>even if a negative one is possible for the reasons stated above. Changing
>the reference category should be like changing the unit of measurement or
>the position of the zero point (like passing from Celsius to Fahrenheit), a
>decision not affecting the correlation coefficient with other variables. In
>this case, instead, the two scales had r = -0.54, implying they shared only
>29% of their variance, even in the extreme case when ALL the possible
>factors (as many as variables) were extracted and all their scores averaged
>into the scale, and therefore the entire variance, common or specific, of
>the whole set of variables was taken into account).
>
>I should add that the dataset was a large sample of census data, and all
>the
>results were statistically significant.
>
>Any ideas why choosing different reference categories for dummy conversion
>could have such impact on results? I would greatly appreciate your thoughts
>in this regard.
>
>Hector



**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Hector Maletta
In reply to this post by Kooij, A.J. van der
Thank you, Anita. I will certainly look into your suggestion about CATCPA.
However, I suspect some mathematical properties of the scores generated by
CATPCA are not the ones I hope to have in our scale, because of the
non-parametric nature of the procedure (too long to explain here, and not
sure of understanding it myself).
As for your second idea, I think if you try to apply PCA on dummies not
omitting any category you'd run into trouble because any category of each
original census question would be an exact linear function of the remaining
categories of the question. In the indicator matrix, one category will have
zeroes on all indicator variables, and that one is the "omitted" category.
Hector


-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Kooij, A.J. van der
Enviado el: Thursday, August 17, 2006 2:37 PM
Para: [hidden email]
Asunto: Re: Reference category for dummies in factor analysis

CATPCA (in Data Reduction menu, under Optimal Scaling) is PCA for
(ordered//ordinal and unorderd/nominal) categorical variables; no need to
use dummies then.
Using PCA on dummies I think you should not omit dummies (for nominal
variables you can do PCA on an indicator maxtrix (that has columns that can
be regarded as dummy variables; a column for each category, thus without
omitting one)).

Regards,
Anita van der Kooij
Data Theory Group
Leiden University.

________________________________

From: SPSSX(r) Discussion on behalf of Hector Maletta
Sent: Thu 17/08/2006 17:52
To: [hidden email]
Subject: Reference category for dummies in factor analysis



Dear colleagues,

I am re-posting (slightly re-phrased for added clarity) a question I sent
the list about a week ago without eliciting any response as yet. I hope some
factor analysis experts may be able to help.

In a research project on which we work together, a colleague of mine
constructed a scale based on factor scores obtained through classical factor
analysis  (principal components) of a number of categorical census variables
all transformed into dummies. The variables concerned the standard of living
of households and included quality of dwelling and basic services such as
sanitation, water supply, electricity and the like. (The scale was not
simply the score for the first factor, but the average score of several
factors, weighted by their respective contribution to explaining the overall
variance of observed variables, but this is, I surmise, beside the point.)

Now, he found out that the choice of reference or "omitted" category for
defining the dummies has an influence on results. He first ran the analysis
using the first category of all categorical variables as the reference
category, and then repeated the analysis using the last category as the
reference or omitted category, whatever they might be. He found that the
resulting scale varied not only in absolute value but also in the shape of
its distribution.

I can understand that the absolute value of the factor scores may change and
even the ranking of the categories of the various variables (in terms of
their average scores) may also be different, since after all the list of
dummies used has varied and the categories are tallied each time against a
different reference category. But the shape of the scale distribution should
not change, I guess, especially not in a drastic manner. In this case the
shape of the scale frequency distribution did change.  Both distributions
were roughly normal, with a kind of "hump" on one side, one of them on the
left and the other on the right, probably due to the change in reference
categories, but also with changes in the range of the scale and other
details.

Also, he found that the two scales had not a perfect correlation, and
moreover, that their correlation was negative. That the correlation was
negative may be understandable: the first category in such census variables
is usually a "good" one (for instance, a home with walls made of brick or
concrete) and the last one is frequently a "bad" one (earthen floor) or a
residual heterogeneous one including bad options ("other" kinds of roof).
But since the two scales are just different combinations of the same
categorical variables based on the same statistical treatment of their given
covariance matrix, one should expect a closer, indeed a perfect correlation,
even if a negative one is possible for the reasons stated above. Changing
the reference category should be like changing the unit of measurement or
the position of the zero point (like passing from Celsius to Fahrenheit), a
decision not affecting the correlation coefficient with other variables. In
this case, instead, the two scales had r = -0.54, implying they shared only
29% of their variance, even in the extreme case when ALL the possible
factors (as many as variables) were extracted and all their scores averaged
into the scale, and therefore the entire variance, common or specific, of
the whole set of variables was taken into account).

I should add that the dataset was a large sample of census data, and all the
results were statistically significant.

Any ideas why choosing different reference categories for dummy conversion
could have such impact on results? I would greatly appreciate your thoughts
in this regard.

Hector



**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Kooij, A.J. van der
>... trouble because any category of each original census question would be an exact linear
>function of the remaining categories of the question.
Yes, but this gives trouble in regression, not in PCA, as far as I know.
 
> In the indicator matrix, one category will have zeroes on all indicator variables.
No, and, sorry, I was confused with CA on indicator matrix, but this is "sort of" PCA.  See syntax below (object scores=component scores are equal to row scores CA, category quantifications equal to column scores CA).
Regards,
Anita.
 

data list free/v1 v2 v3.

begin data.

1 2 3

2 1 3

2 2 2

3 1 1

2 3 4

2 2 2

1 2 4

end data.

 

Multiple Correspondence v1 v2 v3

 /analysis v1 v2 v3

 /dim=2

 /critit .0000001

 /print discrim quant obj

 /plot  none.

 

catpca v1 v2 v3

 /analysis v1 v2 v3 (mnom)

 /dim=2

 /critit .0000001

 /print quant obj

 /plot  none.

 

data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1 v3cat2 v3cat3 v3cat4 .

begin data.

1 0 0 0 1 0 0 0 1 0

0 1 0 1 0 0 0 0 1 0

0 1 0 0 1 0 0 1 0 0

0 0 1 1 0 0 1 0 0 0

0 1 0 0 0 1 0 0 0 1

0 1 0 0 1 0 0 1 0 0

1 0 0 0 1 0 0 0 0 1

end data.

 

CORRESPONDENCE

  TABLE = all (7,10)

  /DIMENSIONS = 2

  /NORMALIZATION = cprin

  /PRINT = RPOINTS CPOINTS

  /PLOT = none .

 
 
________________________________

From: SPSSX(r) Discussion on behalf of Hector Maletta
Sent: Thu 17/08/2006 19:56
To: [hidden email]
Subject: Re: Reference category for dummies in factor analysis



Thank you, Anita. I will certainly look into your suggestion about CATCPA.
However, I suspect some mathematical properties of the scores generated by
CATPCA are not the ones I hope to have in our scale, because of the
non-parametric nature of the procedure (too long to explain here, and not
sure of understanding it myself).
As for your second idea, I think if you try to apply PCA on dummies not
omitting any category you'd run into trouble because any category of each
original census question would be an exact linear function of the remaining
categories of the question. In the indicator matrix, one category will have
zeroes on all indicator variables, and that one is the "omitted" category.
Hector


-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Kooij, A.J. van der
Enviado el: Thursday, August 17, 2006 2:37 PM
Para: [hidden email]
Asunto: Re: Reference category for dummies in factor analysis

CATPCA (in Data Reduction menu, under Optimal Scaling) is PCA for
(ordered//ordinal and unorderd/nominal) categorical variables; no need to
use dummies then.
Using PCA on dummies I think you should not omit dummies (for nominal
variables you can do PCA on an indicator maxtrix (that has columns that can
be regarded as dummy variables; a column for each category, thus without
omitting one)).

Regards,
Anita van der Kooij
Data Theory Group
Leiden University.

________________________________

From: SPSSX(r) Discussion on behalf of Hector Maletta
Sent: Thu 17/08/2006 17:52
To: [hidden email]
Subject: Reference category for dummies in factor analysis



Dear colleagues,

I am re-posting (slightly re-phrased for added clarity) a question I sent
the list about a week ago without eliciting any response as yet. I hope some
factor analysis experts may be able to help.

In a research project on which we work together, a colleague of mine
constructed a scale based on factor scores obtained through classical factor
analysis  (principal components) of a number of categorical census variables
all transformed into dummies. The variables concerned the standard of living
of households and included quality of dwelling and basic services such as
sanitation, water supply, electricity and the like. (The scale was not
simply the score for the first factor, but the average score of several
factors, weighted by their respective contribution to explaining the overall
variance of observed variables, but this is, I surmise, beside the point.)

Now, he found out that the choice of reference or "omitted" category for
defining the dummies has an influence on results. He first ran the analysis
using the first category of all categorical variables as the reference
category, and then repeated the analysis using the last category as the
reference or omitted category, whatever they might be. He found that the
resulting scale varied not only in absolute value but also in the shape of
its distribution.

I can understand that the absolute value of the factor scores may change and
even the ranking of the categories of the various variables (in terms of
their average scores) may also be different, since after all the list of
dummies used has varied and the categories are tallied each time against a
different reference category. But the shape of the scale distribution should
not change, I guess, especially not in a drastic manner. In this case the
shape of the scale frequency distribution did change.  Both distributions
were roughly normal, with a kind of "hump" on one side, one of them on the
left and the other on the right, probably due to the change in reference
categories, but also with changes in the range of the scale and other
details.

Also, he found that the two scales had not a perfect correlation, and
moreover, that their correlation was negative. That the correlation was
negative may be understandable: the first category in such census variables
is usually a "good" one (for instance, a home with walls made of brick or
concrete) and the last one is frequently a "bad" one (earthen floor) or a
residual heterogeneous one including bad options ("other" kinds of roof).
But since the two scales are just different combinations of the same
categorical variables based on the same statistical treatment of their given
covariance matrix, one should expect a closer, indeed a perfect correlation,
even if a negative one is possible for the reasons stated above. Changing
the reference category should be like changing the unit of measurement or
the position of the zero point (like passing from Celsius to Fahrenheit), a
decision not affecting the correlation coefficient with other variables. In
this case, instead, the two scales had r = -0.54, implying they shared only
29% of their variance, even in the extreme case when ALL the possible
factors (as many as variables) were extracted and all their scores averaged
into the scale, and therefore the entire variance, common or specific, of
the whole set of variables was taken into account).

I should add that the dataset was a large sample of census data, and all the
results were statistically significant.

Any ideas why choosing different reference categories for dummy conversion
could have such impact on results? I would greatly appreciate your thoughts
in this regard.

Hector



**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Hector Maletta
In reply to this post by Kooij, A.J. van der
Anita, you ARE indeed a useful source of advice in these abstruse matters.
First, sorry for alluding mistakenly to ALSCAL. In fact I was thinking of
CATPCA when I wrote that phrase about categorical factor analysis.
Now, if you could just possibly find that UN piece you seem to recall having
seen, I would be eternally grateful. In fact our work started in the context
of the 2005 Human Development report for Bolivia, funded by the UNDP, though
is now running independently of any UN support.
Just to shed some additional light into my brick-and-mud head: I suppose
that with CATPCA, if the factor score is a (linear?) function of the
transformed variables, it can also be expressed as a function of the
original categories. Brick and mud example: suppose having a brick wall is
quantified as 3.40, and a mud wall is 1.35; assume the Wall categorical
variable enters a factor score with a coefficient of 0.20. Thus having a
brick wall contributes 0.20x3.40=0.68 towards the factor score, and a mud
wall contributes 0.20x1.35=0.27. Is that so? Also: Are these factor scores
measured as z-scores, with zero mean and unit STD DEV? What are the
measurement units, means and SD of the transformed variables?
And do not forget my original question about the impact of different omitted
categories in factor analysis.
Thanks again for your help.
Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Kooij, A.J. van der
Enviado el: Thursday, August 17, 2006 2:51 PM
Para: [hidden email]
Asunto: Re: Reference category for dummies in factor analysis

Hector,
Some remarks:
>...categorical factor analysis by alternating least squares (ALSCAL in SPSS
jargon) ...
ALSCAL is MDS (Multi Dimensional Scaling). The preferred procedure to use
for MDS is PROXSCAL, added to SPSS some versions ago.

>... such as optimal scaling or multiple correspondence, but initially
>tried PCA because of its mathematical properties, which come in handy for
>the intended use of the scale in the project. Notice that in this
particular
>application we use factor analysis only as an intermediate step, i.e. as a
>way of constructing a scale that is a linear combination of variables
taking
>their covariances into account. We are not interested in the factors
>themselves.
With optimal scaling you obtain transformed (is optimally quantified)
variables that are continuous. All mathematical properties of PCA apply also
to CATPCA, but with respect to the transformed variables. The scale you
obtain using CATPCA is continuous.
Some years ago a UN-paper was publiced using CATPCA to create a scale for
variables much the same as you describe. If you are interested I can try to
find the reference.

Regards,
Anita van der Kooij
Data Theory Group
Leiden University.



________________________________

From: SPSSX(r) Discussion on behalf of Hector Maletta
Sent: Thu 17/08/2006 19:04
To: [hidden email]
Subject: Re: Reference category for dummies in factor analysis



Dan,
Yours is a sound question. Latent classes unfortunately would not do in this
case because we need a continuous scale, not a set of discrete classes, even
if they are ordered. We have considered using categorical factor analysis by
alternating least squares (ALSCAL in SPSS jargon) or other non parametric
procedures such as optimal scaling or multiple correspondence, but initially
tried PCA because of its mathematical properties, which come in handy for
the intended use of the scale in the project. Notice that in this particular
application we use factor analysis only as an intermediate step, i.e. as a
way of constructing a scale that is a linear combination of variables taking
their covariances into account. We are not interested in the factors
themselves.
Now about the use of FA with dummy variables: there are conflicting opinions
in the literature about this. Half the library is in favour and the other
half is against. Dummies can indeed be considered as interval scales, since
they have only one interval between their two values, and that interval is
implicitly used as their unit of measurement. The main objection is about
normality of their sampling distribution. Binary random variables have a
binomial distribution, which approximates the normal as n (sample size)
grows larger. Another frequent objection is about normality of residuals in
regression: obviously, if you predict a binary with a binary prediction,
your predicted value would either 1 or 0, and the residual would be either 0
or 1, so you'll have either all residuals to one side of your predictions,
or all residuals to the other side, and you'll never have residuals normally
distributed around your prediction. Take your pick in the library.
However, I do not wish for this thread to become a discussion of our use of
factor analysis in this way, but only of the particular question of the
impact of choosing one or another reference category. The other discussion
is most interesting, but we can address it later.

Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Dan
Zetu
Enviado el: Thursday, August 17, 2006 1:36 PM
Para: [hidden email]
Asunto: Re: Reference category for dummies in factor analysis

Hector:

What I am having a little difficulty comprehending is how a classical factor
analysis can be conducted on a set of dummy (binary) variables? I thought
that's what latent class analysis was for. Perhaps I am missing something in
your post?

Dan


>From: Hector Maletta <[hidden email]>
>Reply-To: Hector Maletta <[hidden email]>
>To: [hidden email]
>Subject: Reference category for dummies in factor analysis
>Date: Thu, 17 Aug 2006 12:52:55 -0300
>
>Dear colleagues,
>
>I am re-posting (slightly re-phrased for added clarity) a question I sent
>the list about a week ago without eliciting any response as yet. I hope
>some
>factor analysis experts may be able to help.
>
>In a research project on which we work together, a colleague of mine
>constructed a scale based on factor scores obtained through classical
>factor
>analysis  (principal components) of a number of categorical census
>variables
>all transformed into dummies. The variables concerned the standard of
>living
>of households and included quality of dwelling and basic services such as
>sanitation, water supply, electricity and the like. (The scale was not
>simply the score for the first factor, but the average score of several
>factors, weighted by their respective contribution to explaining the
>overall
>variance of observed variables, but this is, I surmise, beside the point.)
>
>Now, he found out that the choice of reference or "omitted" category for
>defining the dummies has an influence on results. He first ran the analysis
>using the first category of all categorical variables as the reference
>category, and then repeated the analysis using the last category as the
>reference or omitted category, whatever they might be. He found that the
>resulting scale varied not only in absolute value but also in the shape of
>its distribution.
>
>I can understand that the absolute value of the factor scores may change
>and
>even the ranking of the categories of the various variables (in terms of
>their average scores) may also be different, since after all the list of
>dummies used has varied and the categories are tallied each time against a
>different reference category. But the shape of the scale distribution
>should
>not change, I guess, especially not in a drastic manner. In this case the
>shape of the scale frequency distribution did change.  Both distributions
>were roughly normal, with a kind of "hump" on one side, one of them on the
>left and the other on the right, probably due to the change in reference
>categories, but also with changes in the range of the scale and other
>details.
>
>Also, he found that the two scales had not a perfect correlation, and
>moreover, that their correlation was negative. That the correlation was
>negative may be understandable: the first category in such census variables
>is usually a "good" one (for instance, a home with walls made of brick or
>concrete) and the last one is frequently a "bad" one (earthen floor) or a
>residual heterogeneous one including bad options ("other" kinds of roof).
>But since the two scales are just different combinations of the same
>categorical variables based on the same statistical treatment of their
>given
>covariance matrix, one should expect a closer, indeed a perfect
>correlation,
>even if a negative one is possible for the reasons stated above. Changing
>the reference category should be like changing the unit of measurement or
>the position of the zero point (like passing from Celsius to Fahrenheit), a
>decision not affecting the correlation coefficient with other variables. In
>this case, instead, the two scales had r = -0.54, implying they shared only
>29% of their variance, even in the extreme case when ALL the possible
>factors (as many as variables) were extracted and all their scores averaged
>into the scale, and therefore the entire variance, common or specific, of
>the whole set of variables was taken into account).
>
>I should add that the dataset was a large sample of census data, and all
>the
>results were statistically significant.
>
>Any ideas why choosing different reference categories for dummy conversion
>could have such impact on results? I would greatly appreciate your thoughts
>in this regard.
>
>Hector



**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Kooij, A.J. van der
> ...with CATPCA, if the factor score is a (linear?) function of the
>transformed variables
Yes, linear,  then same as in linear PCA. The CATPCA model is linear, the same model as in PCA; the nonlinearity is only in the transformation of the variables.
 
>...Thus having a brick wall contributes 0.20x3.40=0.68 towards the factor score, and a mud
> wall contributes 0.20x1.35=0.27. Is that so?
Yes.

>Also: Are these factor scores measured as z-scores, with zero mean and unit STD DEV?
Yes.
>What are the measurement units, means and SD of the transformed variables?
z-scores, with zero mean and unit STD DEV.
(although differs slightly from z-scores by a factor because CATPCA standardizes on N in stead of N-1).

Regards,
Anita
________________________________

From: SPSSX(r) Discussion on behalf of Hector Maletta
Sent: Thu 17/08/2006 22:06
To: [hidden email]
Subject: Re: Reference category for dummies in factor analysis



Anita, you ARE indeed a useful source of advice in these abstruse matters.
First, sorry for alluding mistakenly to ALSCAL. In fact I was thinking of
CATPCA when I wrote that phrase about categorical factor analysis.
Now, if you could just possibly find that UN piece you seem to recall having
seen, I would be eternally grateful. In fact our work started in the context
of the 2005 Human Development report for Bolivia, funded by the UNDP, though
is now running independently of any UN support.
Just to shed some additional light into my brick-and-mud head: I suppose
that with CATPCA, if the factor score is a (linear?) function of the
transformed variables, it can also be expressed as a function of the
original categories. Brick and mud example: suppose having a brick wall is
quantified as 3.40, and a mud wall is 1.35; assume the Wall categorical
variable enters a factor score with a coefficient of 0.20. Thus having a
brick wall contributes 0.20x3.40=0.68 towards the factor score, and a mud
wall contributes 0.20x1.35=0.27. Is that so? Also: Are these factor scores
measured as z-scores, with zero mean and unit STD DEV? What are the
measurement units, means and SD of the transformed variables?
And do not forget my original question about the impact of different omitted
categories in factor analysis.
Thanks again for your help.
Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Kooij, A.J. van der
Enviado el: Thursday, August 17, 2006 2:51 PM
Para: [hidden email]
Asunto: Re: Reference category for dummies in factor analysis

Hector,
Some remarks:
>...categorical factor analysis by alternating least squares (ALSCAL in SPSS
jargon) ...
ALSCAL is MDS (Multi Dimensional Scaling). The preferred procedure to use
for MDS is PROXSCAL, added to SPSS some versions ago.

>... such as optimal scaling or multiple correspondence, but initially
>tried PCA because of its mathematical properties, which come in handy for
>the intended use of the scale in the project. Notice that in this
particular
>application we use factor analysis only as an intermediate step, i.e. as a
>way of constructing a scale that is a linear combination of variables
taking
>their covariances into account. We are not interested in the factors
>themselves.
With optimal scaling you obtain transformed (is optimally quantified)
variables that are continuous. All mathematical properties of PCA apply also
to CATPCA, but with respect to the transformed variables. The scale you
obtain using CATPCA is continuous.
Some years ago a UN-paper was publiced using CATPCA to create a scale for
variables much the same as you describe. If you are interested I can try to
find the reference.

Regards,
Anita van der Kooij
Data Theory Group
Leiden University.



________________________________

From: SPSSX(r) Discussion on behalf of Hector Maletta
Sent: Thu 17/08/2006 19:04
To: [hidden email]
Subject: Re: Reference category for dummies in factor analysis



Dan,
Yours is a sound question. Latent classes unfortunately would not do in this
case because we need a continuous scale, not a set of discrete classes, even
if they are ordered. We have considered using categorical factor analysis by
alternating least squares (ALSCAL in SPSS jargon) or other non parametric
procedures such as optimal scaling or multiple correspondence, but initially
tried PCA because of its mathematical properties, which come in handy for
the intended use of the scale in the project. Notice that in this particular
application we use factor analysis only as an intermediate step, i.e. as a
way of constructing a scale that is a linear combination of variables taking
their covariances into account. We are not interested in the factors
themselves.
Now about the use of FA with dummy variables: there are conflicting opinions
in the literature about this. Half the library is in favour and the other
half is against. Dummies can indeed be considered as interval scales, since
they have only one interval between their two values, and that interval is
implicitly used as their unit of measurement. The main objection is about
normality of their sampling distribution. Binary random variables have a
binomial distribution, which approximates the normal as n (sample size)
grows larger. Another frequent objection is about normality of residuals in
regression: obviously, if you predict a binary with a binary prediction,
your predicted value would either 1 or 0, and the residual would be either 0
or 1, so you'll have either all residuals to one side of your predictions,
or all residuals to the other side, and you'll never have residuals normally
distributed around your prediction. Take your pick in the library.
However, I do not wish for this thread to become a discussion of our use of
factor analysis in this way, but only of the particular question of the
impact of choosing one or another reference category. The other discussion
is most interesting, but we can address it later.

Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Dan
Zetu
Enviado el: Thursday, August 17, 2006 1:36 PM
Para: [hidden email]
Asunto: Re: Reference category for dummies in factor analysis

Hector:

What I am having a little difficulty comprehending is how a classical factor
analysis can be conducted on a set of dummy (binary) variables? I thought
that's what latent class analysis was for. Perhaps I am missing something in
your post?

Dan


>From: Hector Maletta <[hidden email]>
>Reply-To: Hector Maletta <[hidden email]>
>To: [hidden email]
>Subject: Reference category for dummies in factor analysis
>Date: Thu, 17 Aug 2006 12:52:55 -0300
>
>Dear colleagues,
>
>I am re-posting (slightly re-phrased for added clarity) a question I sent
>the list about a week ago without eliciting any response as yet. I hope
>some
>factor analysis experts may be able to help.
>
>In a research project on which we work together, a colleague of mine
>constructed a scale based on factor scores obtained through classical
>factor
>analysis  (principal components) of a number of categorical census
>variables
>all transformed into dummies. The variables concerned the standard of
>living
>of households and included quality of dwelling and basic services such as
>sanitation, water supply, electricity and the like. (The scale was not
>simply the score for the first factor, but the average score of several
>factors, weighted by their respective contribution to explaining the
>overall
>variance of observed variables, but this is, I surmise, beside the point.)
>
>Now, he found out that the choice of reference or "omitted" category for
>defining the dummies has an influence on results. He first ran the analysis
>using the first category of all categorical variables as the reference
>category, and then repeated the analysis using the last category as the
>reference or omitted category, whatever they might be. He found that the
>resulting scale varied not only in absolute value but also in the shape of
>its distribution.
>
>I can understand that the absolute value of the factor scores may change
>and
>even the ranking of the categories of the various variables (in terms of
>their average scores) may also be different, since after all the list of
>dummies used has varied and the categories are tallied each time against a
>different reference category. But the shape of the scale distribution
>should
>not change, I guess, especially not in a drastic manner. In this case the
>shape of the scale frequency distribution did change.  Both distributions
>were roughly normal, with a kind of "hump" on one side, one of them on the
>left and the other on the right, probably due to the change in reference
>categories, but also with changes in the range of the scale and other
>details.
>
>Also, he found that the two scales had not a perfect correlation, and
>moreover, that their correlation was negative. That the correlation was
>negative may be understandable: the first category in such census variables
>is usually a "good" one (for instance, a home with walls made of brick or
>concrete) and the last one is frequently a "bad" one (earthen floor) or a
>residual heterogeneous one including bad options ("other" kinds of roof).
>But since the two scales are just different combinations of the same
>categorical variables based on the same statistical treatment of their
>given
>covariance matrix, one should expect a closer, indeed a perfect
>correlation,
>even if a negative one is possible for the reasons stated above. Changing
>the reference category should be like changing the unit of measurement or
>the position of the zero point (like passing from Celsius to Fahrenheit), a
>decision not affecting the correlation coefficient with other variables. In
>this case, instead, the two scales had r = -0.54, implying they shared only
>29% of their variance, even in the extreme case when ALL the possible
>factors (as many as variables) were extracted and all their scores averaged
>into the scale, and therefore the entire variance, common or specific, of
>the whole set of variables was taken into account).
>
>I should add that the dataset was a large sample of census data, and all
>the
>results were statistically significant.
>
>Any ideas why choosing different reference categories for dummy conversion
>could have such impact on results? I would greatly appreciate your thoughts
>in this regard.
>
>Hector



**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Kooij, A.J. van der
In reply to this post by Hector Maletta
>And do not forget my original question about the impact of different omitted
>categories in factor analysis.
 

I don't know about PCA on dummy variables, so I don't know about omitting category, but I know how to obtain solution from indicator matrix, maybe that will help a bit.

data list free/v1 v2 v3.

begin data.

1 2 3

2 1 3

2 2 2

3 1 1

2 3 4

2 2 2

1 2 4

end data.

catpca v1 v2 v3

 /analysis v1 v2 v3 (mnom)

 /dim=6

 /critit .0000001

 /print vaf quant obj

 /plot  none.

resulting eigenvalues:

2.587

1.608

1.500

1.083

.222

.000

 

Indicator matrix G is:

data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1 v3cat2 v3cat3 v3cat4 .

begin data.

1 0 0 0 1 0 0 0 1 0

0 1 0 1 0 0 0 0 1 0

0 1 0 0 1 0 0 1 0 0

0 0 1 1 0 0 1 0 0 0

0 1 0 0 0 1 0 0 0 1

0 1 0 0 1 0 0 1 0 0

1 0 0 0 1 0 0 0 0 1

end data.

 

Eigenvalue decomposition of rescaled G'G gives Catpca solution (but with 1 trivial/extraneous eigenvalue equal to number of variables because G is not centered, which is avoided in Catpca and Multiple Correspendence by centering the quantifications)

MATRIX.

get g /file = 'e:\...\g.sav'.

compute gg = T(g) * g.

compute freq = CSUM(g).

compute d= MDIAG(freq).

compute mat = INV(SQRT(d)) * gg * INV(SQRT(d)).

CALL EIGEN (mat,eigvec,eigval).

print eigval.

END MATRIX.

result:

EIGVAL

   3.000000000

   2.586836818

   1.607968964

   1.500000000

   1.083262358

    .221931860

    .000000000

    .000000000

    .000000000

    .000000000

 

Some thoughts:

The rescaled G'G matrix is not positive definite, thus cannot be analyzed using SPSS Factor. Maybe this is the trouble you think of when using dummies for all categories?

The maximum number of dimensions is the sum over variables of number of categories minus 1. Maybe this is what you are thinking of when omitting a category?

 

Regards,

Anita


________________________________

From: SPSSX(r) Discussion on behalf of Hector Maletta
Sent: Thu 17/08/2006 22:06
To: [hidden email]
Subject: Re: Reference category for dummies in factor analysis



Anita, you ARE indeed a useful source of advice in these abstruse matters.
First, sorry for alluding mistakenly to ALSCAL. In fact I was thinking of
CATPCA when I wrote that phrase about categorical factor analysis.
Now, if you could just possibly find that UN piece you seem to recall having
seen, I would be eternally grateful. In fact our work started in the context
of the 2005 Human Development report for Bolivia, funded by the UNDP, though
is now running independently of any UN support.
Just to shed some additional light into my brick-and-mud head: I suppose
that with CATPCA, if the factor score is a (linear?) function of the
transformed variables, it can also be expressed as a function of the
original categories. Brick and mud example: suppose having a brick wall is
quantified as 3.40, and a mud wall is 1.35; assume the Wall categorical
variable enters a factor score with a coefficient of 0.20. Thus having a
brick wall contributes 0.20x3.40=0.68 towards the factor score, and a mud
wall contributes 0.20x1.35=0.27. Is that so? Also: Are these factor scores
measured as z-scores, with zero mean and unit STD DEV? What are the
measurement units, means and SD of the transformed variables?
And do not forget my original question about the impact of different omitted
categories in factor analysis.
Thanks again for your help.
Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Kooij, A.J. van der
Enviado el: Thursday, August 17, 2006 2:51 PM
Para: [hidden email]
Asunto: Re: Reference category for dummies in factor analysis

Hector,
Some remarks:
>...categorical factor analysis by alternating least squares (ALSCAL in SPSS
jargon) ...
ALSCAL is MDS (Multi Dimensional Scaling). The preferred procedure to use
for MDS is PROXSCAL, added to SPSS some versions ago.

>... such as optimal scaling or multiple correspondence, but initially
>tried PCA because of its mathematical properties, which come in handy for
>the intended use of the scale in the project. Notice that in this
particular
>application we use factor analysis only as an intermediate step, i.e. as a
>way of constructing a scale that is a linear combination of variables
taking
>their covariances into account. We are not interested in the factors
>themselves.
With optimal scaling you obtain transformed (is optimally quantified)
variables that are continuous. All mathematical properties of PCA apply also
to CATPCA, but with respect to the transformed variables. The scale you
obtain using CATPCA is continuous.
Some years ago a UN-paper was publiced using CATPCA to create a scale for
variables much the same as you describe. If you are interested I can try to
find the reference.

Regards,
Anita van der Kooij
Data Theory Group
Leiden University.



________________________________

From: SPSSX(r) Discussion on behalf of Hector Maletta
Sent: Thu 17/08/2006 19:04
To: [hidden email]
Subject: Re: Reference category for dummies in factor analysis



Dan,
Yours is a sound question. Latent classes unfortunately would not do in this
case because we need a continuous scale, not a set of discrete classes, even
if they are ordered. We have considered using categorical factor analysis by
alternating least squares (ALSCAL in SPSS jargon) or other non parametric
procedures such as optimal scaling or multiple correspondence, but initially
tried PCA because of its mathematical properties, which come in handy for
the intended use of the scale in the project. Notice that in this particular
application we use factor analysis only as an intermediate step, i.e. as a
way of constructing a scale that is a linear combination of variables taking
their covariances into account. We are not interested in the factors
themselves.
Now about the use of FA with dummy variables: there are conflicting opinions
in the literature about this. Half the library is in favour and the other
half is against. Dummies can indeed be considered as interval scales, since
they have only one interval between their two values, and that interval is
implicitly used as their unit of measurement. The main objection is about
normality of their sampling distribution. Binary random variables have a
binomial distribution, which approximates the normal as n (sample size)
grows larger. Another frequent objection is about normality of residuals in
regression: obviously, if you predict a binary with a binary prediction,
your predicted value would either 1 or 0, and the residual would be either 0
or 1, so you'll have either all residuals to one side of your predictions,
or all residuals to the other side, and you'll never have residuals normally
distributed around your prediction. Take your pick in the library.
However, I do not wish for this thread to become a discussion of our use of
factor analysis in this way, but only of the particular question of the
impact of choosing one or another reference category. The other discussion
is most interesting, but we can address it later.

Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Dan
Zetu
Enviado el: Thursday, August 17, 2006 1:36 PM
Para: [hidden email]
Asunto: Re: Reference category for dummies in factor analysis

Hector:

What I am having a little difficulty comprehending is how a classical factor
analysis can be conducted on a set of dummy (binary) variables? I thought
that's what latent class analysis was for. Perhaps I am missing something in
your post?

Dan


>From: Hector Maletta <[hidden email]>
>Reply-To: Hector Maletta <[hidden email]>
>To: [hidden email]
>Subject: Reference category for dummies in factor analysis
>Date: Thu, 17 Aug 2006 12:52:55 -0300
>
>Dear colleagues,
>
>I am re-posting (slightly re-phrased for added clarity) a question I sent
>the list about a week ago without eliciting any response as yet. I hope
>some
>factor analysis experts may be able to help.
>
>In a research project on which we work together, a colleague of mine
>constructed a scale based on factor scores obtained through classical
>factor
>analysis  (principal components) of a number of categorical census
>variables
>all transformed into dummies. The variables concerned the standard of
>living
>of households and included quality of dwelling and basic services such as
>sanitation, water supply, electricity and the like. (The scale was not
>simply the score for the first factor, but the average score of several
>factors, weighted by their respective contribution to explaining the
>overall
>variance of observed variables, but this is, I surmise, beside the point.)
>
>Now, he found out that the choice of reference or "omitted" category for
>defining the dummies has an influence on results. He first ran the analysis
>using the first category of all categorical variables as the reference
>category, and then repeated the analysis using the last category as the
>reference or omitted category, whatever they might be. He found that the
>resulting scale varied not only in absolute value but also in the shape of
>its distribution.
>
>I can understand that the absolute value of the factor scores may change
>and
>even the ranking of the categories of the various variables (in terms of
>their average scores) may also be different, since after all the list of
>dummies used has varied and the categories are tallied each time against a
>different reference category. But the shape of the scale distribution
>should
>not change, I guess, especially not in a drastic manner. In this case the
>shape of the scale frequency distribution did change.  Both distributions
>were roughly normal, with a kind of "hump" on one side, one of them on the
>left and the other on the right, probably due to the change in reference
>categories, but also with changes in the range of the scale and other
>details.
>
>Also, he found that the two scales had not a perfect correlation, and
>moreover, that their correlation was negative. That the correlation was
>negative may be understandable: the first category in such census variables
>is usually a "good" one (for instance, a home with walls made of brick or
>concrete) and the last one is frequently a "bad" one (earthen floor) or a
>residual heterogeneous one including bad options ("other" kinds of roof).
>But since the two scales are just different combinations of the same
>categorical variables based on the same statistical treatment of their
>given
>covariance matrix, one should expect a closer, indeed a perfect
>correlation,
>even if a negative one is possible for the reasons stated above. Changing
>the reference category should be like changing the unit of measurement or
>the position of the zero point (like passing from Celsius to Fahrenheit), a
>decision not affecting the correlation coefficient with other variables. In
>this case, instead, the two scales had r = -0.54, implying they shared only
>29% of their variance, even in the extreme case when ALL the possible
>factors (as many as variables) were extracted and all their scores averaged
>into the scale, and therefore the entire variance, common or specific, of
>the whole set of variables was taken into account).
>
>I should add that the dataset was a large sample of census data, and all
>the
>results were statistically significant.
>
>Any ideas why choosing different reference categories for dummy conversion
>could have such impact on results? I would greatly appreciate your thoughts
>in this regard.
>
>Hector



**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Hector Maletta
Anita,

Thanks again.

You wrote: "The rescaled G'G matrix is not positive definite, thus cannot be
analyzed using SPSS Factor. Maybe this is the trouble you think of when
using dummies for all categories?"

Quite possibly. I'm not sure we're talking about the same matrix. However,
in fact, when you use all the categories of a categorical variable, one of
the categories is redundant and factor analysis (or regression) is
impossible.

You also wrote: "The maximum number of dimensions is the sum over variables
of number of categories minus 1. Maybe this is what you are thinking of when
omitting a category?" Not exactly: In the case of categorical variables
converted into dummies one category is omitted from each variable, not one
category over all variables. So for m variables with k categories each, you
have a sum total of km categories. If you convert them into dummies you get
mk-m dummies, and that's the maximum number of dimensions; as you explain
it, in your case you'd the maximum number of dimensions is mk-1 > mk-m.

The difference is due to the fact that CATPCA does not require excluding one
category in each categorical variable.

But that was not my original question, which is still unanswered.

Hector



Hector had written:

>And do not forget my original question about the impact of different

>omitted categories in factor analysis.



Anita responded:

I don't know about PCA on dummy variables, so I don't know about omitting
category, but I know how to obtain solution from indicator matrix, maybe
that will help a bit.



data list free/v1 v2 v3.



begin data.



1 2 3



2 1 3



2 2 2



3 1 1



2 3 4



2 2 2



1 2 4



end data.



catpca v1 v2 v3



 /analysis v1 v2 v3 (mnom)



 /dim=6



 /critit .0000001



 /print vaf quant obj



 /plot  none.



resulting eigenvalues:



2.587



1.608



1.500



1.083



.222



.000







Indicator matrix G is:



data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1 v3cat2
v3cat3 v3cat4 .



begin data.



1 0 0 0 1 0 0 0 1 0



0 1 0 1 0 0 0 0 1 0



0 1 0 0 1 0 0 1 0 0



0 0 1 1 0 0 1 0 0 0



0 1 0 0 0 1 0 0 0 1



0 1 0 0 1 0 0 1 0 0



1 0 0 0 1 0 0 0 0 1



end data.







Eigenvalue decomposition of rescaled G'G gives Catpca solution (but with 1
trivial/extraneous eigenvalue equal to number of variables because G is not
centered, which is avoided in Catpca and Multiple Correspendence by
centering the quantifications)



MATRIX.



get g /file = 'e:\...\g.sav'.



compute gg = T(g) * g.



compute freq = CSUM(g).



compute d= MDIAG(freq).



compute mat = INV(SQRT(d)) * gg * INV(SQRT(d)).



CALL EIGEN (mat,eigvec,eigval).



print eigval.



END MATRIX.



result:



EIGVAL



   3.000000000



   2.586836818



   1.607968964



   1.500000000



   1.083262358



    .221931860



    .000000000



    .000000000



    .000000000



    .000000000







Some thoughts:



The rescaled G'G matrix is not positive definite, thus cannot be analyzed
using SPSS Factor. Maybe this is the trouble you think of when using dummies
for all categories?



The maximum number of dimensions is the sum over variables of number of
categories minus 1. Maybe this is what you are thinking of when omitting a
category?







Regards,



Anita





________________________________



From: SPSSX(r) Discussion on behalf of Hector Maletta

Sent: Thu 17/08/2006 22:06

To: [hidden email]

Subject: Re: Reference category for dummies in factor analysis







Anita, you ARE indeed a useful source of advice in these abstruse matters.

First, sorry for alluding mistakenly to ALSCAL. In fact I was thinking of
CATPCA when I wrote that phrase about categorical factor analysis.

Now, if you could just possibly find that UN piece you seem to recall having
seen, I would be eternally grateful. In fact our work started in the context
of the 2005 Human Development report for Bolivia, funded by the UNDP, though
is now running independently of any UN support.

Just to shed some additional light into my brick-and-mud head: I suppose
that with CATPCA, if the factor score is a (linear?) function of the
transformed variables, it can also be expressed as a function of the
original categories. Brick and mud example: suppose having a brick wall is
quantified as 3.40, and a mud wall is 1.35; assume the Wall categorical
variable enters a factor score with a coefficient of 0.20. Thus having a
brick wall contributes 0.20x3.40=0.68 towards the factor score, and a mud
wall contributes 0.20x1.35=0.27. Is that so? Also: Are these factor scores
measured as z-scores, with zero mean and unit STD DEV? What are the
measurement units, means and SD of the transformed variables?

And do not forget my original question about the impact of different omitted
categories in factor analysis.

Thanks again for your help.

Hector



-----Mensaje original-----

De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Kooij, A.J. van der Enviado el: Thursday, August 17, 2006 2:51 PM

Para: [hidden email]

Asunto: Re: Reference category for dummies in factor analysis



Hector,

Some remarks:

>...categorical factor analysis by alternating least squares (ALSCAL in

>SPSS

jargon) ...

ALSCAL is MDS (Multi Dimensional Scaling). The preferred procedure to use
for MDS is PROXSCAL, added to SPSS some versions ago.



>... such as optimal scaling or multiple correspondence, but initially

>tried PCA because of its mathematical properties, which come in handy

>for the intended use of the scale in the project. Notice that in this

particular

>application we use factor analysis only as an intermediate step, i.e.

>as a way of constructing a scale that is a linear combination of

>variables

taking

>their covariances into account. We are not interested in the factors

>themselves.

With optimal scaling you obtain transformed (is optimally quantified)
variables that are continuous. All mathematical properties of PCA apply also
to CATPCA, but with respect to the transformed variables. The scale you
obtain using CATPCA is continuous.

Some years ago a UN-paper was publiced using CATPCA to create a scale for
variables much the same as you describe. If you are interested I can try to
find the reference.



Regards,

Anita van der Kooij

Data Theory Group

Leiden University.







________________________________



From: SPSSX(r) Discussion on behalf of Hector Maletta

Sent: Thu 17/08/2006 19:04

To: [hidden email]

Subject: Re: Reference category for dummies in factor analysis







Dan,

Yours is a sound question. Latent classes unfortunately would not do in this
case because we need a continuous scale, not a set of discrete classes, even
if they are ordered. We have considered using categorical factor analysis by
alternating least squares (ALSCAL in SPSS jargon) or other non parametric
procedures such as optimal scaling or multiple correspondence, but initially
tried PCA because of its mathematical properties, which come in handy for
the intended use of the scale in the project. Notice that in this particular
application we use factor analysis only as an intermediate step, i.e. as a
way of constructing a scale that is a linear combination of variables taking
their covariances into account. We are not interested in the factors
themselves.

Now about the use of FA with dummy variables: there are conflicting opinions
in the literature about this. Half the library is in favour and the other
half is against. Dummies can indeed be considered as interval scales, since
they have only one interval between their two values, and that interval is
implicitly used as their unit of measurement. The main objection is about
normality of their sampling distribution. Binary random variables have a
binomial distribution, which approximates the normal as n (sample size)
grows larger. Another frequent objection is about normality of residuals in

regression: obviously, if you predict a binary with a binary prediction,
your predicted value would either 1 or 0, and the residual would be either 0
or 1, so you'll have either all residuals to one side of your predictions,
or all residuals to the other side, and you'll never have residuals normally
distributed around your prediction. Take your pick in the library.

However, I do not wish for this thread to become a discussion of our use of
factor analysis in this way, but only of the particular question of the
impact of choosing one or another reference category. The other discussion
is most interesting, but we can address it later.



Hector



-----Mensaje original-----

De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Dan
Zetu Enviado el: Thursday, August 17, 2006 1:36 PM

Para: [hidden email]

Asunto: Re: Reference category for dummies in factor analysis



Hector:



What I am having a little difficulty comprehending is how a classical factor
analysis can be conducted on a set of dummy (binary) variables? I thought
that's what latent class analysis was for. Perhaps I am missing something in
your post?



Dan





>From: Hector Maletta <[hidden email]>

>Reply-To: Hector Maletta <[hidden email]>

>To: [hidden email]

>Subject: Reference category for dummies in factor analysis

>Date: Thu, 17 Aug 2006 12:52:55 -0300

>

>Dear colleagues,

>

>I am re-posting (slightly re-phrased for added clarity) a question I

>sent the list about a week ago without eliciting any response as yet. I

>hope some factor analysis experts may be able to help.

>

>In a research project on which we work together, a colleague of mine

>constructed a scale based on factor scores obtained through classical

>factor analysis  (principal components) of a number of categorical

>census variables all transformed into dummies. The variables concerned

>the standard of living of households and included quality of dwelling

>and basic services such as sanitation, water supply, electricity and

>the like. (The scale was not simply the score for the first factor, but

>the average score of several factors, weighted by their respective

>contribution to explaining the overall variance of observed variables,

>but this is, I surmise, beside the point.)

>

>Now, he found out that the choice of reference or "omitted" category

>for defining the dummies has an influence on results. He first ran the

>analysis using the first category of all categorical variables as the

>reference category, and then repeated the analysis using the last

>category as the reference or omitted category, whatever they might be.

>He found that the resulting scale varied not only in absolute value but

>also in the shape of its distribution.

>

>I can understand that the absolute value of the factor scores may

>change and even the ranking of the categories of the various variables

>(in terms of their average scores) may also be different, since after

>all the list of dummies used has varied and the categories are tallied

>each time against a different reference category. But the shape of the

>scale distribution should not change, I guess, especially not in a

>drastic manner. In this case the shape of the scale frequency

>distribution did change.  Both distributions were roughly normal, with

>a kind of "hump" on one side, one of them on the left and the other on

>the right, probably due to the change in reference categories, but also

>with changes in the range of the scale and other details.

>

>Also, he found that the two scales had not a perfect correlation, and

>moreover, that their correlation was negative. That the correlation was

>negative may be understandable: the first category in such census

>variables is usually a "good" one (for instance, a home with walls made

>of brick or

>concrete) and the last one is frequently a "bad" one (earthen floor) or

>a residual heterogeneous one including bad options ("other" kinds of roof).

>But since the two scales are just different combinations of the same

>categorical variables based on the same statistical treatment of their

>given covariance matrix, one should expect a closer, indeed a perfect

>correlation, even if a negative one is possible for the reasons stated

>above. Changing the reference category should be like changing the unit

>of measurement or the position of the zero point (like passing from

>Celsius to Fahrenheit), a decision not affecting the correlation

>coefficient with other variables. In this case, instead, the two scales

>had r = -0.54, implying they shared only 29% of their variance, even in

>the extreme case when ALL the possible factors (as many as variables)

>were extracted and all their scores averaged into the scale, and

>therefore the entire variance, common or specific, of the whole set of

>variables was taken into account).

>

>I should add that the dataset was a large sample of census data, and

>all the results were statistically significant.

>

>Any ideas why choosing different reference categories for dummy

>conversion could have such impact on results? I would greatly

>appreciate your thoughts in this regard.

>

>Hector
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Art Kendall
In reply to this post by Kooij, A.J. van der
This has been an interesting discussion.  I don't know why the FA and
scores would change  depending on which category is omitted.  Were there
errors in recoding to dummies that could have created different missing
values?


You also said classical FA, but then said PCA.  What did you use for
communality estimates.? 1.00? Squared multiple correlations?

(I'm not sure why you would create a single score if you have multiple
factors either, but that is another question.)

What I do know is that people who know a lot more about CA, MDS, and
factor analysis than I do ( Like Joe Kruskal, Doug Carroll, Willem
Heisser, Phipps Arabie, Shizuhiko Nishimoto, et al)  follow the class-l
and mpsych-l discussion lists.
see

http://aris.ss.uci.edu/smp/mpsych.html

and

http://www.classification-society.org/csna/lists.html#class-l

Art Kendall
[hidden email]



Kooij, A.J. van der wrote:

>>... trouble because any category of each original census question would be an exact linear
>>function of the remaining categories of the question.
>>
>>
>Yes, but this gives trouble in regression, not in PCA, as far as I know.
>
>
>
>>In the indicator matrix, one category will have zeroes on all indicator variables.
>>
>>
>No, and, sorry, I was confused with CA on indicator matrix, but this is "sort of" PCA.  See syntax below (object scores=component scores are equal to row scores CA, category quantifications equal to column scores CA).
>Regards,
>Anita.
>
>
>data list free/v1 v2 v3.
>
>begin data.
>
>1 2 3
>
>2 1 3
>
>2 2 2
>
>3 1 1
>
>2 3 4
>
>2 2 2
>
>1 2 4
>
>end data.
>
>
>
>Multiple Correspondence v1 v2 v3
>
> /analysis v1 v2 v3
>
> /dim=2
>
> /critit .0000001
>
> /print discrim quant obj
>
> /plot  none.
>
>
>
>catpca v1 v2 v3
>
> /analysis v1 v2 v3 (mnom)
>
> /dim=2
>
> /critit .0000001
>
> /print quant obj
>
> /plot  none.
>
>
>
>data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1 v3cat2 v3cat3 v3cat4 .
>
>begin data.
>
>1 0 0 0 1 0 0 0 1 0
>
>0 1 0 1 0 0 0 0 1 0
>
>0 1 0 0 1 0 0 1 0 0
>
>0 0 1 1 0 0 1 0 0 0
>
>0 1 0 0 0 1 0 0 0 1
>
>0 1 0 0 1 0 0 1 0 0
>
>1 0 0 0 1 0 0 0 0 1
>
>end data.
>
>
>
>CORRESPONDENCE
>
>  TABLE = all (7,10)
>
>  /DIMENSIONS = 2
>
>  /NORMALIZATION = cprin
>
>  /PRINT = RPOINTS CPOINTS
>
>  /PLOT = none .
>
>
>
>________________________________
>
>From: SPSSX(r) Discussion on behalf of Hector Maletta
>Sent: Thu 17/08/2006 19:56
>To: [hidden email]
>Subject: Re: Reference category for dummies in factor analysis
>
>
>
>Thank you, Anita. I will certainly look into your suggestion about CATCPA.
>However, I suspect some mathematical properties of the scores generated by
>CATPCA are not the ones I hope to have in our scale, because of the
>non-parametric nature of the procedure (too long to explain here, and not
>sure of understanding it myself).
>As for your second idea, I think if you try to apply PCA on dummies not
>omitting any category you'd run into trouble because any category of each
>original census question would be an exact linear function of the remaining
>categories of the question. In the indicator matrix, one category will have
>zeroes on all indicator variables, and that one is the "omitted" category.
>Hector
>
>
>-----Mensaje original-----
>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
>Kooij, A.J. van der
>Enviado el: Thursday, August 17, 2006 2:37 PM
>Para: [hidden email]
>Asunto: Re: Reference category for dummies in factor analysis
>
>CATPCA (in Data Reduction menu, under Optimal Scaling) is PCA for
>(ordered//ordinal and unorderd/nominal) categorical variables; no need to
>use dummies then.
>Using PCA on dummies I think you should not omit dummies (for nominal
>variables you can do PCA on an indicator maxtrix (that has columns that can
>be regarded as dummy variables; a column for each category, thus without
>omitting one)).
>
>Regards,
>Anita van der Kooij
>Data Theory Group
>Leiden University.
>
>________________________________
>
>From: SPSSX(r) Discussion on behalf of Hector Maletta
>Sent: Thu 17/08/2006 17:52
>To: [hidden email]
>Subject: Reference category for dummies in factor analysis
>
>
>
>Dear colleagues,
>
>I am re-posting (slightly re-phrased for added clarity) a question I sent
>the list about a week ago without eliciting any response as yet. I hope some
>factor analysis experts may be able to help.
>
>In a research project on which we work together, a colleague of mine
>constructed a scale based on factor scores obtained through classical factor
>analysis  (principal components) of a number of categorical census variables
>all transformed into dummies. The variables concerned the standard of living
>of households and included quality of dwelling and basic services such as
>sanitation, water supply, electricity and the like. (The scale was not
>simply the score for the first factor, but the average score of several
>factors, weighted by their respective contribution to explaining the overall
>variance of observed variables, but this is, I surmise, beside the point.)
>
>Now, he found out that the choice of reference or "omitted" category for
>defining the dummies has an influence on results. He first ran the analysis
>using the first category of all categorical variables as the reference
>category, and then repeated the analysis using the last category as the
>reference or omitted category, whatever they might be. He found that the
>resulting scale varied not only in absolute value but also in the shape of
>its distribution.
>
>I can understand that the absolute value of the factor scores may change and
>even the ranking of the categories of the various variables (in terms of
>their average scores) may also be different, since after all the list of
>dummies used has varied and the categories are tallied each time against a
>different reference category. But the shape of the scale distribution should
>not change, I guess, especially not in a drastic manner. In this case the
>shape of the scale frequency distribution did change.  Both distributions
>were roughly normal, with a kind of "hump" on one side, one of them on the
>left and the other on the right, probably due to the change in reference
>categories, but also with changes in the range of the scale and other
>details.
>
>Also, he found that the two scales had not a perfect correlation, and
>moreover, that their correlation was negative. That the correlation was
>negative may be understandable: the first category in such census variables
>is usually a "good" one (for instance, a home with walls made of brick or
>concrete) and the last one is frequently a "bad" one (earthen floor) or a
>residual heterogeneous one including bad options ("other" kinds of roof).
>But since the two scales are just different combinations of the same
>categorical variables based on the same statistical treatment of their given
>covariance matrix, one should expect a closer, indeed a perfect correlation,
>even if a negative one is possible for the reasons stated above. Changing
>the reference category should be like changing the unit of measurement or
>the position of the zero point (like passing from Celsius to Fahrenheit), a
>decision not affecting the correlation coefficient with other variables. In
>this case, instead, the two scales had r = -0.54, implying they shared only
>29% of their variance, even in the extreme case when ALL the possible
>factors (as many as variables) were extracted and all their scores averaged
>into the scale, and therefore the entire variance, common or specific, of
>the whole set of variables was taken into account).
>
>I should add that the dataset was a large sample of census data, and all the
>results were statistically significant.
>
>Any ideas why choosing different reference categories for dummy conversion
>could have such impact on results? I would greatly appreciate your thoughts
>in this regard.
>
>Hector
>
>
>
>**********************************************************************
>This email and any files transmitted with it are confidential and
>intended solely for the use of the individual or entity to whom they
>are addressed. If you have received this email in error please notify
>the system manager.
>**********************************************************************
>
>
>
>
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Kooij, A.J. van der
In reply to this post by Hector Maletta
Hector,
using all the categories of a categorical variable, factor analysis and
regression are not  impossible, only impossible when following the
mathematical approach involving the inverse of matrix X'X (X matrix of
dummies). But CATPCA and CATREG find solutions (iteratively) from the
data itself, not from the correlation matrix of the data, so no problem.

If you have dummies for all categories and use CATREG for regression
applying numerical scaling level, you do not need to leave dummies out,
and the result is equal to linear regression on dummies omitting 1 dummy
for each variable.
For PCA I really don't think you should omit a dummy. PCA on nominal
variables is Multiple Correspondence Analysis (is CATPCA applying
multiple nominal (mnom) scaling level to all variables), which is
analyzing the indicator matrix in the way I described in previous mail,
thus using all columns of indicator matrix = using all dummies.
 
About max. number of components: we agree: I was not clear in writing
it, should have been
"The maximum number of dimensions is number of categories minus 1 for
each variable, summed over variables" .
 
In a previous mail I said transformed variables are z-scores, but this
not true with mnom scaling: then variables centered but not
standardized.
(with mnom scaling level the quantification for a category in a
dimension/component is the centroid of the component scores for the
cases that have scored in that category).
 
Regards,
Anita
 
-----Original Message-----
From: Hector Maletta [mailto:[hidden email]]
Sent: 18 August 2006 05:20
To: Kooij, A.J. van der
Cc: [hidden email]
Subject: RE: Reference category for dummies in factor analysis



        Anita,

        Thanks again.

        You wrote: "The rescaled G'G matrix is not positive definite,
thus cannot be analyzed using SPSS Factor. Maybe this is the trouble you
think of when using dummies for all categories?"

        Quite possibly. I'm not sure we're talking about the same
matrix. However, in fact, when you use all the categories of a
categorical variable, one of the categories is redundant and factor
analysis (or regression) is impossible.

        You also wrote: "The maximum number of dimensions is the sum
over variables of number of categories minus 1. Maybe this is what you
are thinking of when omitting a category?" Not exactly: In the case of
categorical variables converted into dummies one category is omitted
from each variable, not one category over all variables. So for m
variables with k categories each, you have a sum total of km categories.
If you convert them into dummies you get mk-m dummies, and that's the
maximum number of dimensions; as you explain it, in your case you'd the
maximum number of dimensions is mk-1 > mk-m.

        The difference is due to the fact that CATPCA does not require
excluding one category in each categorical variable.

        But that was not my original question, which is still
unanswered.

        Hector

         

        Hector had written:

        >And do not forget my original question about the impact of
different

        >omitted categories in factor analysis.

         

        Anita responded:

        I don't know about PCA on dummy variables, so I don't know about
omitting category, but I know how to obtain solution from indicator
matrix, maybe that will help a bit.

         

        data list free/v1 v2 v3.

         

        begin data.

         

        1 2 3

         

        2 1 3

         

        2 2 2

         

        3 1 1

         

        2 3 4

         

        2 2 2

         

        1 2 4

         

        end data.

         

        catpca v1 v2 v3

         

         /analysis v1 v2 v3 (mnom)

         

         /dim=6

         

         /critit .0000001

         

         /print vaf quant obj

         

         /plot  none.

         

        resulting eigenvalues:

         

        2.587

         

        1.608

         

        1.500

         

        1.083

         

        .222

         

        .000

         

         

         

        Indicator matrix G is:

         

        data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1
v3cat2 v3cat3 v3cat4 .

         

        begin data.

         

        1 0 0 0 1 0 0 0 1 0

         

        0 1 0 1 0 0 0 0 1 0

         

        0 1 0 0 1 0 0 1 0 0

         

        0 0 1 1 0 0 1 0 0 0

         

        0 1 0 0 0 1 0 0 0 1

         

        0 1 0 0 1 0 0 1 0 0

         

        1 0 0 0 1 0 0 0 0 1

         

        end data.

         

         

         

        Eigenvalue decomposition of rescaled G'G gives Catpca solution
(but with 1 trivial/extraneous eigenvalue equal to number of variables
because G is not centered, which is avoided in Catpca and Multiple
Correspendence by centering the quantifications)

         

        MATRIX.

         

        get g /file = 'e:\...\g.sav'.

         

        compute gg = T(g) * g.

         

        compute freq = CSUM(g).

         

        compute d= MDIAG(freq).

         

        compute mat = INV(SQRT(d)) * gg * INV(SQRT(d)).

         

        CALL EIGEN (mat,eigvec,eigval).

         

        print eigval.

         

        END MATRIX.

         

        result:

         

        EIGVAL

         

           3.000000000

         

           2.586836818

         

           1.607968964

         

           1.500000000

         

           1.083262358

         

            .221931860

         

            .000000000

         

            .000000000

         

            .000000000

         

            .000000000

         

         

         

        Some thoughts:

         

        The rescaled G'G matrix is not positive definite, thus cannot be
analyzed using SPSS Factor. Maybe this is the trouble you think of when
using dummies for all categories?

         

        The maximum number of dimensions is the sum over variables of
number of categories minus 1. Maybe this is what you are thinking of
when omitting a category?

         

         

         

        Regards,

         

        Anita

         

         

        ________________________________

         

        From: SPSSX(r) Discussion on behalf of Hector Maletta

        Sent: Thu 17/08/2006 22:06

        To: [hidden email]

        Subject: Re: Reference category for dummies in factor analysis

         

         

         

        Anita, you ARE indeed a useful source of advice in these
abstruse matters.

        First, sorry for alluding mistakenly to ALSCAL. In fact I was
thinking of CATPCA when I wrote that phrase about categorical factor
analysis.

        Now, if you could just possibly find that UN piece you seem to
recall having seen, I would be eternally grateful. In fact our work
started in the context of the 2005 Human Development report for Bolivia,
funded by the UNDP, though is now running independently of any UN
support.

        Just to shed some additional light into my brick-and-mud head: I
suppose that with CATPCA, if the factor score is a (linear?) function of
the transformed variables, it can also be expressed as a function of the
original categories. Brick and mud example: suppose having a brick wall
is quantified as 3.40, and a mud wall is 1.35; assume the Wall
categorical variable enters a factor score with a coefficient of 0.20.
Thus having a brick wall contributes 0.20x3.40=0.68 towards the factor
score, and a mud wall contributes 0.20x1.35=0.27. Is that so? Also: Are
these factor scores measured as z-scores, with zero mean and unit STD
DEV? What are the measurement units, means and SD of the transformed
variables?

        And do not forget my original question about the impact of
different omitted categories in factor analysis.

        Thanks again for your help.

        Hector

         

        -----Mensaje original-----

        De: SPSSX(r) Discussion [mailto:[hidden email]] En
nombre de Kooij, A.J. van der Enviado el: Thursday, August 17, 2006 2:51
PM

        Para: [hidden email]

        Asunto: Re: Reference category for dummies in factor analysis

         

        Hector,

        Some remarks:

        >...categorical factor analysis by alternating least squares
(ALSCAL in

        >SPSS

        jargon) ...

        ALSCAL is MDS (Multi Dimensional Scaling). The preferred
procedure to use for MDS is PROXSCAL, added to SPSS some versions ago.

         

        >... such as optimal scaling or multiple correspondence, but
initially

        >tried PCA because of its mathematical properties, which come in
handy

        >for the intended use of the scale in the project. Notice that
in this

        particular

        >application we use factor analysis only as an intermediate
step, i.e.

        >as a way of constructing a scale that is a linear combination
of

        >variables

        taking

        >their covariances into account. We are not interested in the
factors

        >themselves.

        With optimal scaling you obtain transformed (is optimally
quantified) variables that are continuous. All mathematical properties
of PCA apply also to CATPCA, but with respect to the transformed
variables. The scale you obtain using CATPCA is continuous.

        Some years ago a UN-paper was publiced using CATPCA to create a
scale for variables much the same as you describe. If you are interested
I can try to find the reference.

         

        Regards,

        Anita van der Kooij

        Data Theory Group

        Leiden University.

         

         

         

        ________________________________

         

        From: SPSSX(r) Discussion on behalf of Hector Maletta

        Sent: Thu 17/08/2006 19:04

        To: [hidden email]

        Subject: Re: Reference category for dummies in factor analysis

         

         

         

        Dan,

        Yours is a sound question. Latent classes unfortunately would
not do in this case because we need a continuous scale, not a set of
discrete classes, even if they are ordered. We have considered using
categorical factor analysis by alternating least squares (ALSCAL in SPSS
jargon) or other non parametric procedures such as optimal scaling or
multiple correspondence, but initially tried PCA because of its
mathematical properties, which come in handy for the intended use of the
scale in the project. Notice that in this particular application we use
factor analysis only as an intermediate step, i.e. as a way of
constructing a scale that is a linear combination of variables taking
their covariances into account. We are not interested in the factors
themselves.

        Now about the use of FA with dummy variables: there are
conflicting opinions in the literature about this. Half the library is
in favour and the other half is against. Dummies can indeed be
considered as interval scales, since they have only one interval between
their two values, and that interval is implicitly used as their unit of
measurement. The main objection is about normality of their sampling
distribution. Binary random variables have a binomial distribution,
which approximates the normal as n (sample size) grows larger. Another
frequent objection is about normality of residuals in

        regression: obviously, if you predict a binary with a binary
prediction, your predicted value would either 1 or 0, and the residual
would be either 0 or 1, so you'll have either all residuals to one side
of your predictions, or all residuals to the other side, and you'll
never have residuals normally distributed around your prediction. Take
your pick in the library.

        However, I do not wish for this thread to become a discussion of
our use of factor analysis in this way, but only of the particular
question of the impact of choosing one or another reference category.
The other discussion is most interesting, but we can address it later.

         

        Hector

         

        -----Mensaje original-----

        De: SPSSX(r) Discussion [mailto:[hidden email]] En
nombre de Dan Zetu Enviado el: Thursday, August 17, 2006 1:36 PM

        Para: [hidden email]

        Asunto: Re: Reference category for dummies in factor analysis

         

        Hector:

         

        What I am having a little difficulty comprehending is how a
classical factor analysis can be conducted on a set of dummy (binary)
variables? I thought that's what latent class analysis was for. Perhaps
I am missing something in your post?

         

        Dan

         

         

        >From: Hector Maletta <[hidden email]>

        >Reply-To: Hector Maletta <[hidden email]>

        >To: [hidden email]

        >Subject: Reference category for dummies in factor analysis

        >Date: Thu, 17 Aug 2006 12:52:55 -0300

        >

        >Dear colleagues,

        >

        >I am re-posting (slightly re-phrased for added clarity) a
question I

        >sent the list about a week ago without eliciting any response
as yet. I

        >hope some factor analysis experts may be able to help.

        >

        >In a research project on which we work together, a colleague of
mine

        >constructed a scale based on factor scores obtained through
classical

        >factor analysis  (principal components) of a number of
categorical

        >census variables all transformed into dummies. The variables
concerned

        >the standard of living of households and included quality of
dwelling

        >and basic services such as sanitation, water supply,
electricity and

        >the like. (The scale was not simply the score for the first
factor, but

        >the average score of several factors, weighted by their
respective

        >contribution to explaining the overall variance of observed
variables,

        >but this is, I surmise, beside the point.)

        >

        >Now, he found out that the choice of reference or "omitted"
category

        >for defining the dummies has an influence on results. He first
ran the

        >analysis using the first category of all categorical variables
as the

        >reference category, and then repeated the analysis using the
last

        >category as the reference or omitted category, whatever they
might be.

        >He found that the resulting scale varied not only in absolute
value but

        >also in the shape of its distribution.

        >

        >I can understand that the absolute value of the factor scores
may

        >change and even the ranking of the categories of the various
variables

        >(in terms of their average scores) may also be different, since
after

        >all the list of dummies used has varied and the categories are
tallied

        >each time against a different reference category. But the shape
of the

        >scale distribution should not change, I guess, especially not
in a

        >drastic manner. In this case the shape of the scale frequency

        >distribution did change.  Both distributions were roughly
normal, with

        >a kind of "hump" on one side, one of them on the left and the
other on

        >the right, probably due to the change in reference categories,
but also

        >with changes in the range of the scale and other details.

        >

        >Also, he found that the two scales had not a perfect
correlation, and

        >moreover, that their correlation was negative. That the
correlation was

        >negative may be understandable: the first category in such
census

        >variables is usually a "good" one (for instance, a home with
walls made

        >of brick or

        >concrete) and the last one is frequently a "bad" one (earthen
floor) or

        >a residual heterogeneous one including bad options ("other"
kinds of roof).

        >But since the two scales are just different combinations of the
same

        >categorical variables based on the same statistical treatment
of their

        >given covariance matrix, one should expect a closer, indeed a
perfect

        >correlation, even if a negative one is possible for the reasons
stated

        >above. Changing the reference category should be like changing
the unit

        >of measurement or the position of the zero point (like passing
from

        >Celsius to Fahrenheit), a decision not affecting the
correlation

        >coefficient with other variables. In this case, instead, the
two scales

        >had r = -0.54, implying they shared only 29% of their variance,
even in

        >the extreme case when ALL the possible factors (as many as
variables)

        >were extracted and all their scores averaged into the scale,
and

        >therefore the entire variance, common or specific, of the whole
set of

        >variables was taken into account).

        >

        >I should add that the dataset was a large sample of census
data, and

        >all the results were statistically significant.

        >

        >Any ideas why choosing different reference categories for dummy


        >conversion could have such impact on results? I would greatly

        >appreciate your thoughts in this regard.

        >

        >Hector


**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Hector Maletta
In reply to this post by Art Kendall
Art,

Thanks for your interesting response. We used PCA, with 1.00 commonality,
i.e. extracting 100% variance. By "classical" I meant parametric factor
analysis and not any form of optimal scaling or alternating least squares
forms of data reduction. I am now considering these other alternatives under
advice of Anita van der Kooij.



I share your uncertainty about why results change depending on which
category is omitted, and that was the original question starting this
thread. Since nobody else seems to have an answer I will offer one purely
numerical hypothesis. The exercise with the varying results, it turns out,
was done by my colleague not with the entire census but with a SAMPLE of the
Peru census (about 200,000 households, still a lot but perhaps not so much
for so many variables and factors), and the contributions of latter factors
were pretty small. SPSS provides, as is well known, a precision no higher
than 15 decimal places approx. So it is just possible that some matrix
figures for some of the minor factors differed only on the 15th or 16th
decimal place (or further down), and then were taken as equal, and this may
have caused some matrix to be singular or (most probably) near singular, and
the results to be still computable but unstable. Moreover, some of the
categories in census questions used as reference or omitted categories were
populated by very few cases, which may have compounded the problem. Since
running this on the entire census (which would enhance statistical
significance and stability of results) takes a lot of computer time and has
to be done several times with different reference categories, we have not
done it yet but will proceed soon and report back. But I wanted to know
whether some mathematical reason existed for the discrepancy.



About why I would want to create a single score out of multiple factors, let
us leave it for another occasion since it is a rather complicated story of a
project connecting factor analysis with index number theory and economic
welfare theory.



Hector



  _____

De: Art Kendall [mailto:[hidden email]]
Enviado el: Friday, August 18, 2006 9:34 AM
Para: Kooij, A.J. van der; [hidden email]
CC: [hidden email]
Asunto: Re: Reference category for dummies in factor analysis



This has been an interesting discussion.  I don't know why the FA and scores
would change  depending on which category is omitted.  Were there errors in
recoding to dummies that could have created different missing values?


You also said classical FA, but then said PCA.  What did you use for
communality estimates.? 1.00? Squared multiple correlations?

(I'm not sure why you would create a single score if you have multiple
factors either, but that is another question.)

What I do know is that people who know a lot more about CA, MDS, and factor
analysis than I do ( Like Joe Kruskal, Doug Carroll, Willem Heisser, Phipps
Arabie, Shizuhiko Nishimoto, et al)  follow the class-l and mpsych-l
discussion lists.
see

http://aris.ss.uci.edu/smp/mpsych.html

and

http://www.classification-society.org/csna/lists.html#class-l

Art Kendall
[hidden email]



Kooij, A.J. van der wrote:



... trouble because any category of each original census question would be
an exact linear
function of the remaining categories of the question.


Yes, but this gives trouble in regression, not in PCA, as far as I know.



In the indicator matrix, one category will have zeroes on all indicator
variables.


No, and, sorry, I was confused with CA on indicator matrix, but this is
"sort of" PCA.  See syntax below (object scores=component scores are equal
to row scores CA, category quantifications equal to column scores CA).
Regards,
Anita.


data list free/v1 v2 v3.

begin data.

1 2 3

2 1 3

2 2 2

3 1 1

2 3 4

2 2 2

1 2 4

end data.



Multiple Correspondence v1 v2 v3

 /analysis v1 v2 v3

 /dim=2

 /critit .0000001

 /print discrim quant obj

 /plot  none.



catpca v1 v2 v3

 /analysis v1 v2 v3 (mnom)

 /dim=2

 /critit .0000001

 /print quant obj

 /plot  none.



data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1 v3cat2
v3cat3 v3cat4 .

begin data.

1 0 0 0 1 0 0 0 1 0

0 1 0 1 0 0 0 0 1 0

0 1 0 0 1 0 0 1 0 0

0 0 1 1 0 0 1 0 0 0

0 1 0 0 0 1 0 0 0 1

0 1 0 0 1 0 0 1 0 0

1 0 0 0 1 0 0 0 0 1

end data.



CORRESPONDENCE

  TABLE = all (7,10)

  /DIMENSIONS = 2

  /NORMALIZATION = cprin

  /PRINT = RPOINTS CPOINTS

  /PLOT = none .



________________________________

From: SPSSX(r) Discussion on behalf of Hector Maletta
Sent: Thu 17/08/2006 19:56
To: [hidden email]
Subject: Re: Reference category for dummies in factor analysis



Thank you, Anita. I will certainly look into your suggestion about CATCPA.
However, I suspect some mathematical properties of the scores generated by
CATPCA are not the ones I hope to have in our scale, because of the
non-parametric nature of the procedure (too long to explain here, and not
sure of understanding it myself).
As for your second idea, I think if you try to apply PCA on dummies not
omitting any category you'd run into trouble because any category of each
original census question would be an exact linear function of the remaining
categories of the question. In the indicator matrix, one category will have
zeroes on all indicator variables, and that one is the "omitted" category.
Hector


-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Kooij, A.J. van der
Enviado el: Thursday, August 17, 2006 2:37 PM
Para: [hidden email]
Asunto: Re: Reference category for dummies in factor analysis

CATPCA (in Data Reduction menu, under Optimal Scaling) is PCA for
(ordered//ordinal and unorderd/nominal) categorical variables; no need to
use dummies then.
Using PCA on dummies I think you should not omit dummies (for nominal
variables you can do PCA on an indicator maxtrix (that has columns that can
be regarded as dummy variables; a column for each category, thus without
omitting one)).

Regards,
Anita van der Kooij
Data Theory Group
Leiden University.

________________________________

From: SPSSX(r) Discussion on behalf of Hector Maletta
Sent: Thu 17/08/2006 17:52
To: [hidden email]
Subject: Reference category for dummies in factor analysis



Dear colleagues,

I am re-posting (slightly re-phrased for added clarity) a question I sent
the list about a week ago without eliciting any response as yet. I hope some
factor analysis experts may be able to help.

In a research project on which we work together, a colleague of mine
constructed a scale based on factor scores obtained through classical factor
analysis  (principal components) of a number of categorical census variables
all transformed into dummies. The variables concerned the standard of living
of households and included quality of dwelling and basic services such as
sanitation, water supply, electricity and the like. (The scale was not
simply the score for the first factor, but the average score of several
factors, weighted by their respective contribution to explaining the overall
variance of observed variables, but this is, I surmise, beside the point.)

Now, he found out that the choice of reference or "omitted" category for
defining the dummies has an influence on results. He first ran the analysis
using the first category of all categorical variables as the reference
category, and then repeated the analysis using the last category as the
reference or omitted category, whatever they might be. He found that the
resulting scale varied not only in absolute value but also in the shape of
its distribution.

I can understand that the absolute value of the factor scores may change and
even the ranking of the categories of the various variables (in terms of
their average scores) may also be different, since after all the list of
dummies used has varied and the categories are tallied each time against a
different reference category. But the shape of the scale distribution should
not change, I guess, especially not in a drastic manner. In this case the
shape of the scale frequency distribution did change.  Both distributions
were roughly normal, with a kind of "hump" on one side, one of them on the
left and the other on the right, probably due to the change in reference
categories, but also with changes in the range of the scale and other
details.

Also, he found that the two scales had not a perfect correlation, and
moreover, that their correlation was negative. That the correlation was
negative may be understandable: the first category in such census variables
is usually a "good" one (for instance, a home with walls made of brick or
concrete) and the last one is frequently a "bad" one (earthen floor) or a
residual heterogeneous one including bad options ("other" kinds of roof).
But since the two scales are just different combinations of the same
categorical variables based on the same statistical treatment of their given
covariance matrix, one should expect a closer, indeed a perfect correlation,
even if a negative one is possible for the reasons stated above. Changing
the reference category should be like changing the unit of measurement or
the position of the zero point (like passing from Celsius to Fahrenheit), a
decision not affecting the correlation coefficient with other variables. In
this case, instead, the two scales had r = -0.54, implying they shared only
29% of their variance, even in the extreme case when ALL the possible
factors (as many as variables) were extracted and all their scores averaged
into the scale, and therefore the entire variance, common or specific, of
the whole set of variables was taken into account).

I should add that the dataset was a large sample of census data, and all the
results were statistically significant.

Any ideas why choosing different reference categories for dummy conversion
could have such impact on results? I would greatly appreciate your thoughts
in this regard.

Hector



**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Art Kendall
If you think that you might have been approaching a zero determinant,
you might try a pfa and see what the communalities look like.
Obviously the determinant would be zero if all of the categories for a
single variable had dummies.
Also, the 16 or so decimal digits is only in the mantissa, the exponent
can go to something like 1022 or 1023 depending on the sign.

It really would be interesting to hear what some of the inventors of
extensions to FA, like CA,INDSCAL, ALSCAL, PROXSCAL have to say about
your problem.  I would urge you to post a description of what you are
trying to do and the problem you ran into on the lists I mentioned..

Just to stir the pot.  Is it possible that the relation of the variables
would vary by market areas e.g., manufacturing zones, fishing coastal,
vs field agricultural vs herding etc.?
So that something like an INDSCAL with a correlation or other variable
similarity measure per "region" would be informative and fit the data
even better?

Art
Social Research Consultants
[hidden email]

Hector Maletta wrote:

> Art,
>
> Thanks for your interesting response. We used PCA, with 1.00
> commonality, i.e. extracting 100% variance. By "classical" I meant
> parametric factor analysis and not any form of optimal scaling or
> alternating least squares forms of data reduction. I am now
> considering these other alternatives under advice of Anita van der Kooij.
>
>
>
> I share your uncertainty about why results change depending on which
> category is omitted, and that was the original question starting this
> thread. Since nobody else seems to have an answer I will offer one
> purely numerical hypothesis. The exercise with the varying results, it
> turns out, was done by my colleague not with the entire census but
> with a SAMPLE of the Peru census (about 200,000 households, still a
> lot but perhaps not so much for so many variables and factors), and
> the contributions of latter factors were pretty small. SPSS provides,
> as is well known, a precision no higher than 15 decimal places approx.
> So it is just possible that some matrix figures for some of the minor
> factors differed only on the 15th or 16th decimal place (or further
> down), and then were taken as equal, and this may have caused some
> matrix to be singular or (most probably) near singular, and the
> results to be still computable but unstable. Moreover, some of the
> categories in census questions used as reference or omitted categories
> were populated by very few cases, which may have compounded the
> problem. Since running this on the entire census (which would enhance
> statistical significance and stability of results) takes a lot of
> computer time and has to be done several times with different
> reference categories, we have not done it yet but will proceed soon
> and report back. But I wanted to know whether some mathematical reason
> existed for the discrepancy.
>
>
>
> About why I would want to create a single score out of multiple
> factors, let us leave it for another occasion since it is a rather
> complicated story of a project connecting factor analysis with index
> number theory and economic welfare theory.
>
>
>
> Hector
>
>
>
> ------------------------------------------------------------------------
>
> De: Art Kendall [mailto:[hidden email]]
> Enviado el: Friday, August 18, 2006 9:34 AM
> Para: Kooij, A.J. van der; [hidden email]
> CC: [hidden email]
> Asunto: Re: Reference category for dummies in factor analysis
>
>
>
> This has been an interesting discussion.  I don't know why the FA and
> scores would change  depending on which category is omitted.  Were
> there errors in recoding to dummies that could have created different
> missing values?
>
>
> You also said classical FA, but then said PCA.  What did you use for
> communality estimates.? 1.00? Squared multiple correlations?
>
> (I'm not sure why you would create a single score if you have multiple
> factors either, but that is another question.)
>
> What I do know is that people who know a lot more about CA, MDS, and
> factor analysis than I do ( Like Joe Kruskal, Doug Carroll, Willem
> Heisser, Phipps Arabie, Shizuhiko Nishimoto, et al)  follow the
> class-l and mpsych-l discussion lists.
> see
>
> http://aris.ss.uci.edu/smp/mpsych.html
>
> and
>
> http://www.classification-society.org/csna/lists.html#class-l
>
> Art Kendall
> [hidden email] <mailto:[hidden email]>
>
>
>
> Kooij, A.J. van der wrote:
>
>>... trouble because any category of each original census question would be an exact linear
>>
>>function of the remaining categories of the question.
>>
>>
>>
>Yes, but this gives trouble in regression, not in PCA, as far as I know.
>
>
>
>
>
>>In the indicator matrix, one category will have zeroes on all indicator variables.
>>
>>
>>
>No, and, sorry, I was confused with CA on indicator matrix, but this is "sort of" PCA.  See syntax below (object scores=component scores are equal to row scores CA, category quantifications equal to column scores CA).
>
>Regards,
>
>Anita.
>
>
>
>
>
>data list free/v1 v2 v3.
>
>
>
>begin data.
>
>
>
>1 2 3
>
>
>
>2 1 3
>
>
>
>2 2 2
>
>
>
>3 1 1
>
>
>
>2 3 4
>
>
>
>2 2 2
>
>
>
>1 2 4
>
>
>
>end data.
>
>
>
>
>
>
>
>Multiple Correspondence v1 v2 v3
>
>
>
> /analysis v1 v2 v3
>
>
>
> /dim=2
>
>
>
> /critit .0000001
>
>
>
> /print discrim quant obj
>
>
>
> /plot  none.
>
>
>
>
>
>
>
>catpca v1 v2 v3
>
>
>
> /analysis v1 v2 v3 (mnom)
>
>
>
> /dim=2
>
>
>
> /critit .0000001
>
>
>
> /print quant obj
>
>
>
> /plot  none.
>
>
>
>
>
>
>
>data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1 v3cat2 v3cat3 v3cat4 .
>
>
>
>begin data.
>
>
>
>1 0 0 0 1 0 0 0 1 0
>
>
>
>0 1 0 1 0 0 0 0 1 0
>
>
>
>0 1 0 0 1 0 0 1 0 0
>
>
>
>0 0 1 1 0 0 1 0 0 0
>
>
>
>0 1 0 0 0 1 0 0 0 1
>
>
>
>0 1 0 0 1 0 0 1 0 0
>
>
>
>1 0 0 0 1 0 0 0 0 1
>
>
>
>end data.
>
>
>
>
>
>
>
>CORRESPONDENCE
>
>
>
>  TABLE = all (7,10)
>
>
>
>  /DIMENSIONS = 2
>
>
>
>  /NORMALIZATION = cprin
>
>
>
>  /PRINT = RPOINTS CPOINTS
>
>
>
>  /PLOT = none .
>
>
>
>
>
>
>
>________________________________
>
>
>
>From: SPSSX(r) Discussion on behalf of Hector Maletta
>
>Sent: Thu 17/08/2006 19:56
>
>To: [hidden email] <mailto:[hidden email]>
>
>Subject: Re: Reference category for dummies in factor analysis
>
>
>
>
>
>
>
>Thank you, Anita. I will certainly look into your suggestion about CATCPA.
>
>However, I suspect some mathematical properties of the scores generated by
>
>CATPCA are not the ones I hope to have in our scale, because of the
>
>non-parametric nature of the procedure (too long to explain here, and not
>
>sure of understanding it myself).
>
>As for your second idea, I think if you try to apply PCA on dummies not
>
>omitting any category you'd run into trouble because any category of each
>
>original census question would be an exact linear function of the remaining
>
>categories of the question. In the indicator matrix, one category will have
>
>zeroes on all indicator variables, and that one is the "omitted" category.
>
>Hector
>
>
>
>
>
>-----Mensaje original-----
>
>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
>
>Kooij, A.J. van der
>
>Enviado el: Thursday, August 17, 2006 2:37 PM
>
>Para: [hidden email] <mailto:[hidden email]>
>
>Asunto: Re: Reference category for dummies in factor analysis
>
>
>
>CATPCA (in Data Reduction menu, under Optimal Scaling) is PCA for
>
>(ordered//ordinal and unorderd/nominal) categorical variables; no need to
>
>use dummies then.
>
>Using PCA on dummies I think you should not omit dummies (for nominal
>
>variables you can do PCA on an indicator maxtrix (that has columns that can
>
>be regarded as dummy variables; a column for each category, thus without
>
>omitting one)).
>
>
>
>Regards,
>
>Anita van der Kooij
>
>Data Theory Group
>
>Leiden University.
>
>
>
>________________________________
>
>
>
>From: SPSSX(r) Discussion on behalf of Hector Maletta
>
>Sent: Thu 17/08/2006 17:52
>
>To: [hidden email] <mailto:[hidden email]>
>
>Subject: Reference category for dummies in factor analysis
>
>
>
>
>
>
>
>Dear colleagues,
>
>
>
>I am re-posting (slightly re-phrased for added clarity) a question I sent
>
>the list about a week ago without eliciting any response as yet. I hope some
>
>factor analysis experts may be able to help.
>
>
>
>In a research project on which we work together, a colleague of mine
>
>constructed a scale based on factor scores obtained through classical factor
>
>analysis  (principal components) of a number of categorical census variables
>
>all transformed into dummies. The variables concerned the standard of living
>
>of households and included quality of dwelling and basic services such as
>
>sanitation, water supply, electricity and the like. (The scale was not
>
>simply the score for the first factor, but the average score of several
>
>factors, weighted by their respective contribution to explaining the overall
>
>variance of observed variables, but this is, I surmise, beside the point.)
>
>
>
>Now, he found out that the choice of reference or "omitted" category for
>
>defining the dummies has an influence on results. He first ran the analysis
>
>using the first category of all categorical variables as the reference
>
>category, and then repeated the analysis using the last category as the
>
>reference or omitted category, whatever they might be. He found that the
>
>resulting scale varied not only in absolute value but also in the shape of
>
>its distribution.
>
>
>
>I can understand that the absolute value of the factor scores may change and
>
>even the ranking of the categories of the various variables (in terms of
>
>their average scores) may also be different, since after all the list of
>
>dummies used has varied and the categories are tallied each time against a
>
>different reference category. But the shape of the scale distribution should
>
>not change, I guess, especially not in a drastic manner. In this case the
>
>shape of the scale frequency distribution did change.  Both distributions
>
>were roughly normal, with a kind of "hump" on one side, one of them on the
>
>left and the other on the right, probably due to the change in reference
>
>categories, but also with changes in the range of the scale and other
>
>details.
>
>
>
>Also, he found that the two scales had not a perfect correlation, and
>
>moreover, that their correlation was negative. That the correlation was
>
>negative may be understandable: the first category in such census variables
>
>is usually a "good" one (for instance, a home with walls made of brick or
>
>concrete) and the last one is frequently a "bad" one (earthen floor) or a
>
>residual heterogeneous one including bad options ("other" kinds of roof).
>
>But since the two scales are just different combinations of the same
>
>categorical variables based on the same statistical treatment of their given
>
>covariance matrix, one should expect a closer, indeed a perfect correlation,
>
>even if a negative one is possible for the reasons stated above. Changing
>
>the reference category should be like changing the unit of measurement or
>
>the position of the zero point (like passing from Celsius to Fahrenheit), a
>
>decision not affecting the correlation coefficient with other variables. In
>
>this case, instead, the two scales had r = -0.54, implying they shared only
>
>29% of their variance, even in the extreme case when ALL the possible
>
>factors (as many as variables) were extracted and all their scores averaged
>
>into the scale, and therefore the entire variance, common or specific, of
>
>the whole set of variables was taken into account).
>
>
>
>I should add that the dataset was a large sample of census data, and all the
>
>results were statistically significant.
>
>
>
>Any ideas why choosing different reference categories for dummy conversion
>
>could have such impact on results? I would greatly appreciate your thoughts
>
>in this regard.
>
>
>
>Hector
>
>
>
>
>
>
>
>**********************************************************************
>
>This email and any files transmitted with it are confidential and
>
>intended solely for the use of the individual or entity to whom they
>
>are addressed. If you have received this email in error please notify
>
>the system manager.
>
>**********************************************************************
>
>
>
>
>
>
>
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Hector Maletta
Art,
Interesting thoughts, as usual from you. I didn't know about the higher
precision of exponents. About non parametric factor analysis based on
alternating least squares I have received excellent advice from one of the
major figures in the field, Anita van der Kooij (one of the developers of
the Categories SPSS module where these procedures are included). Of course
in those approaches you do not drop any category, all categories are
quantified and you get a unique set of factor scores, so my problem
disappears. Since our project involves using the scores in a model involving
index number theory and neoclassical economics, we are now analyzing whether
by using categorical factor analysis we may lose some useful mathematical
properties of PCA.
Regarding your idea of differentiating by region or sector: Of course, this
kind of scale is computed in order to be used in analytical and practical
applications involving geographical breakdown (e.g. to improve or refine
targeting of social programs) and other analytical subdivisions (such as
sector of employment). For some applications it makes sense to compute a
region-based scale and for other purposes a nationwide one. In our case we
are working in the context of nationally adopted goals of equitable
development within so-called United Nations Millennium Development Goals,
and therefore the standards for of the standard of living should be set at
national level, because governments establish such goals for all the people
of their nations. However, in equations where the scale is a predictor,
regions are certainly another likely predictor along with other variables of
interest (employment and education level of household adults, say), to
predict outcomes such as children dropping out of school or child mortality
which figure outstandingly in Millennium Goals.
Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Art
Kendall
Enviado el: Friday, August 18, 2006 11:53 AM
Para: [hidden email]
Asunto: Re: Reference category for dummies in factor analysis

If you think that you might have been approaching a zero determinant,
you might try a pfa and see what the communalities look like.
Obviously the determinant would be zero if all of the categories for a
single variable had dummies.
Also, the 16 or so decimal digits is only in the mantissa, the exponent
can go to something like 1022 or 1023 depending on the sign.

It really would be interesting to hear what some of the inventors of
extensions to FA, like CA,INDSCAL, ALSCAL, PROXSCAL have to say about
your problem.  I would urge you to post a description of what you are
trying to do and the problem you ran into on the lists I mentioned..

Just to stir the pot.  Is it possible that the relation of the variables
would vary by market areas e.g., manufacturing zones, fishing coastal,
vs field agricultural vs herding etc.?
So that something like an INDSCAL with a correlation or other variable
similarity measure per "region" would be informative and fit the data
even better?

Art
Social Research Consultants
[hidden email]

Hector Maletta wrote:

> Art,
>
> Thanks for your interesting response. We used PCA, with 1.00
> commonality, i.e. extracting 100% variance. By "classical" I meant
> parametric factor analysis and not any form of optimal scaling or
> alternating least squares forms of data reduction. I am now
> considering these other alternatives under advice of Anita van der Kooij.
>
>
>
> I share your uncertainty about why results change depending on which
> category is omitted, and that was the original question starting this
> thread. Since nobody else seems to have an answer I will offer one
> purely numerical hypothesis. The exercise with the varying results, it
> turns out, was done by my colleague not with the entire census but
> with a SAMPLE of the Peru census (about 200,000 households, still a
> lot but perhaps not so much for so many variables and factors), and
> the contributions of latter factors were pretty small. SPSS provides,
> as is well known, a precision no higher than 15 decimal places approx.
> So it is just possible that some matrix figures for some of the minor
> factors differed only on the 15th or 16th decimal place (or further
> down), and then were taken as equal, and this may have caused some
> matrix to be singular or (most probably) near singular, and the
> results to be still computable but unstable. Moreover, some of the
> categories in census questions used as reference or omitted categories
> were populated by very few cases, which may have compounded the
> problem. Since running this on the entire census (which would enhance
> statistical significance and stability of results) takes a lot of
> computer time and has to be done several times with different
> reference categories, we have not done it yet but will proceed soon
> and report back. But I wanted to know whether some mathematical reason
> existed for the discrepancy.
>
>
>
> About why I would want to create a single score out of multiple
> factors, let us leave it for another occasion since it is a rather
> complicated story of a project connecting factor analysis with index
> number theory and economic welfare theory.
>
>
>
> Hector
>
>
>
> ------------------------------------------------------------------------
>
> De: Art Kendall [mailto:[hidden email]]
> Enviado el: Friday, August 18, 2006 9:34 AM
> Para: Kooij, A.J. van der; [hidden email]
> CC: [hidden email]
> Asunto: Re: Reference category for dummies in factor analysis
>
>
>
> This has been an interesting discussion.  I don't know why the FA and
> scores would change  depending on which category is omitted.  Were
> there errors in recoding to dummies that could have created different
> missing values?
>
>
> You also said classical FA, but then said PCA.  What did you use for
> communality estimates.? 1.00? Squared multiple correlations?
>
> (I'm not sure why you would create a single score if you have multiple
> factors either, but that is another question.)
>
> What I do know is that people who know a lot more about CA, MDS, and
> factor analysis than I do ( Like Joe Kruskal, Doug Carroll, Willem
> Heisser, Phipps Arabie, Shizuhiko Nishimoto, et al)  follow the
> class-l and mpsych-l discussion lists.
> see
>
> http://aris.ss.uci.edu/smp/mpsych.html
>
> and
>
> http://www.classification-society.org/csna/lists.html#class-l
>
> Art Kendall
> [hidden email] <mailto:[hidden email]>
>
>
>
> Kooij, A.J. van der wrote:
>
>>... trouble because any category of each original census question would be
an exact linear

>>
>>function of the remaining categories of the question.
>>
>>
>>
>Yes, but this gives trouble in regression, not in PCA, as far as I know.
>
>
>
>
>
>>In the indicator matrix, one category will have zeroes on all indicator
variables.
>>
>>
>>
>No, and, sorry, I was confused with CA on indicator matrix, but this is
"sort of" PCA.  See syntax below (object scores=component scores are equal
to row scores CA, category quantifications equal to column scores CA).

>
>Regards,
>
>Anita.
>
>
>
>
>
>data list free/v1 v2 v3.
>
>
>
>begin data.
>
>
>
>1 2 3
>
>
>
>2 1 3
>
>
>
>2 2 2
>
>
>
>3 1 1
>
>
>
>2 3 4
>
>
>
>2 2 2
>
>
>
>1 2 4
>
>
>
>end data.
>
>
>
>
>
>
>
>Multiple Correspondence v1 v2 v3
>
>
>
> /analysis v1 v2 v3
>
>
>
> /dim=2
>
>
>
> /critit .0000001
>
>
>
> /print discrim quant obj
>
>
>
> /plot  none.
>
>
>
>
>
>
>
>catpca v1 v2 v3
>
>
>
> /analysis v1 v2 v3 (mnom)
>
>
>
> /dim=2
>
>
>
> /critit .0000001
>
>
>
> /print quant obj
>
>
>
> /plot  none.
>
>
>
>
>
>
>
>data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1 v3cat2
v3cat3 v3cat4 .

>
>
>
>begin data.
>
>
>
>1 0 0 0 1 0 0 0 1 0
>
>
>
>0 1 0 1 0 0 0 0 1 0
>
>
>
>0 1 0 0 1 0 0 1 0 0
>
>
>
>0 0 1 1 0 0 1 0 0 0
>
>
>
>0 1 0 0 0 1 0 0 0 1
>
>
>
>0 1 0 0 1 0 0 1 0 0
>
>
>
>1 0 0 0 1 0 0 0 0 1
>
>
>
>end data.
>
>
>
>
>
>
>
>CORRESPONDENCE
>
>
>
>  TABLE = all (7,10)
>
>
>
>  /DIMENSIONS = 2
>
>
>
>  /NORMALIZATION = cprin
>
>
>
>  /PRINT = RPOINTS CPOINTS
>
>
>
>  /PLOT = none .
>
>
>
>
>
>
>
>________________________________
>
>
>
>From: SPSSX(r) Discussion on behalf of Hector Maletta
>
>Sent: Thu 17/08/2006 19:56
>
>To: [hidden email] <mailto:[hidden email]>
>
>Subject: Re: Reference category for dummies in factor analysis
>
>
>
>
>
>
>
>Thank you, Anita. I will certainly look into your suggestion about CATCPA.
>
>However, I suspect some mathematical properties of the scores generated by
>
>CATPCA are not the ones I hope to have in our scale, because of the
>
>non-parametric nature of the procedure (too long to explain here, and not
>
>sure of understanding it myself).
>
>As for your second idea, I think if you try to apply PCA on dummies not
>
>omitting any category you'd run into trouble because any category of each
>
>original census question would be an exact linear function of the remaining
>
>categories of the question. In the indicator matrix, one category will have
>
>zeroes on all indicator variables, and that one is the "omitted" category.
>
>Hector
>
>
>
>
>
>-----Mensaje original-----
>
>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
>
>Kooij, A.J. van der
>
>Enviado el: Thursday, August 17, 2006 2:37 PM
>
>Para: [hidden email] <mailto:[hidden email]>
>
>Asunto: Re: Reference category for dummies in factor analysis
>
>
>
>CATPCA (in Data Reduction menu, under Optimal Scaling) is PCA for
>
>(ordered//ordinal and unorderd/nominal) categorical variables; no need to
>
>use dummies then.
>
>Using PCA on dummies I think you should not omit dummies (for nominal
>
>variables you can do PCA on an indicator maxtrix (that has columns that can
>
>be regarded as dummy variables; a column for each category, thus without
>
>omitting one)).
>
>
>
>Regards,
>
>Anita van der Kooij
>
>Data Theory Group
>
>Leiden University.
>
>
>
>________________________________
>
>
>
>From: SPSSX(r) Discussion on behalf of Hector Maletta
>
>Sent: Thu 17/08/2006 17:52
>
>To: [hidden email] <mailto:[hidden email]>
>
>Subject: Reference category for dummies in factor analysis
>
>
>
>
>
>
>
>Dear colleagues,
>
>
>
>I am re-posting (slightly re-phrased for added clarity) a question I sent
>
>the list about a week ago without eliciting any response as yet. I hope
some
>
>factor analysis experts may be able to help.
>
>
>
>In a research project on which we work together, a colleague of mine
>
>constructed a scale based on factor scores obtained through classical
factor
>
>analysis  (principal components) of a number of categorical census
variables
>
>all transformed into dummies. The variables concerned the standard of
living
>
>of households and included quality of dwelling and basic services such as
>
>sanitation, water supply, electricity and the like. (The scale was not
>
>simply the score for the first factor, but the average score of several
>
>factors, weighted by their respective contribution to explaining the
overall

>
>variance of observed variables, but this is, I surmise, beside the point.)
>
>
>
>Now, he found out that the choice of reference or "omitted" category for
>
>defining the dummies has an influence on results. He first ran the analysis
>
>using the first category of all categorical variables as the reference
>
>category, and then repeated the analysis using the last category as the
>
>reference or omitted category, whatever they might be. He found that the
>
>resulting scale varied not only in absolute value but also in the shape of
>
>its distribution.
>
>
>
>I can understand that the absolute value of the factor scores may change
and
>
>even the ranking of the categories of the various variables (in terms of
>
>their average scores) may also be different, since after all the list of
>
>dummies used has varied and the categories are tallied each time against a
>
>different reference category. But the shape of the scale distribution
should

>
>not change, I guess, especially not in a drastic manner. In this case the
>
>shape of the scale frequency distribution did change.  Both distributions
>
>were roughly normal, with a kind of "hump" on one side, one of them on the
>
>left and the other on the right, probably due to the change in reference
>
>categories, but also with changes in the range of the scale and other
>
>details.
>
>
>
>Also, he found that the two scales had not a perfect correlation, and
>
>moreover, that their correlation was negative. That the correlation was
>
>negative may be understandable: the first category in such census variables
>
>is usually a "good" one (for instance, a home with walls made of brick or
>
>concrete) and the last one is frequently a "bad" one (earthen floor) or a
>
>residual heterogeneous one including bad options ("other" kinds of roof).
>
>But since the two scales are just different combinations of the same
>
>categorical variables based on the same statistical treatment of their
given
>
>covariance matrix, one should expect a closer, indeed a perfect
correlation,

>
>even if a negative one is possible for the reasons stated above. Changing
>
>the reference category should be like changing the unit of measurement or
>
>the position of the zero point (like passing from Celsius to Fahrenheit), a
>
>decision not affecting the correlation coefficient with other variables. In
>
>this case, instead, the two scales had r = -0.54, implying they shared only
>
>29% of their variance, even in the extreme case when ALL the possible
>
>factors (as many as variables) were extracted and all their scores averaged
>
>into the scale, and therefore the entire variance, common or specific, of
>
>the whole set of variables was taken into account).
>
>
>
>I should add that the dataset was a large sample of census data, and all
the

>
>results were statistically significant.
>
>
>
>Any ideas why choosing different reference categories for dummy conversion
>
>could have such impact on results? I would greatly appreciate your thoughts
>
>in this regard.
>
>
>
>Hector
>
>
>
>
>
>
>
>**********************************************************************
>
>This email and any files transmitted with it are confidential and
>
>intended solely for the use of the individual or entity to whom they
>
>are addressed. If you have received this email in error please notify
>
>the system manager.
>
>**********************************************************************
>
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Hector Maletta
In reply to this post by Hector Maletta
Anita,
You are saying that the solution attained with CATPCA by assigning numeric
values to all categories is the same solution attained by CPA run on the
interval variables resulting from the transformation of categories by
CATPCA. This I do not dispute.
Let me ask you this: can you say that the solution achieved by CATPCA on a
set of categorical (nominal) variables is the same achieved by CPA by
converting all categories (minus 1) of each categorical variable into
dummies? If so, it would transpire that the choice of the omitted category
should not matter: all solutions in PCA, with different omitted categories,
should all coincide with CATPCA. Is it so?
Hector

-----Mensaje original-----
De: Kooij, A.J. van der [mailto:[hidden email]]
Enviado el: Friday, August 18, 2006 1:02 PM
Para: Hector Maletta
Asunto: RE: Re: Reference category for dummies in factor analysis

>... we are now analyzing whether by using categorical factor analysis
we may
> lose some useful mathematical properties of PCA

I don't think you will lose anything (but if you do let me know) because
the CATPCA model is the linear PCA model for scaling levels other than
mnom (and also with mnom for 1-dimensional solution). To check this you
can save the transformed data and us these as input to FACTOR PCA:
results are equal to CATPCA results (sometimes slighly different with
default iteration options, then adjust convergence criterion and/or
maximum number of iterations).
With mnom scaling level and multi-dim. solution you have transformed
data for each dimension, then FACTOR PCA on the transformed variables
for a dimension is equal to CATPCA results for that dimension.

Another suggestion regarding regions or something like that:
To explore if regions differ, you can include a region variable as
supplementary. This variable is not used in the analysis, but is fitted
into the solution. If it does not fit well, regions do not differ with
respect to the solution; if it does fit well you can inspect the
categories plot to see how regions relate to the analysis variables.

Regards,
Anita


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Hector Maletta
Sent: 18 August 2006 17:19
To: [hidden email]
Subject: Re: Reference category for dummies in factor analysis


Art,
Interesting thoughts, as usual from you. I didn't know about the higher
precision of exponents. About non parametric factor analysis based on
alternating least squares I have received excellent advice from one of
the major figures in the field, Anita van der Kooij (one of the
developers of the Categories SPSS module where these procedures are
included). Of course in those approaches you do not drop any category,
all categories are quantified and you get a unique set of factor scores,
so my problem disappears. Since our project involves using the scores in
a model involving index number theory and neoclassical economics, we are
now analyzing whether by using categorical factor analysis we may lose
some useful mathematical properties of PCA. Regarding your idea of
differentiating by region or sector: Of course, this kind of scale is
computed in order to be used in analytical and practical applications
involving geographical breakdown (e.g. to improve or refine targeting of
social programs) and other analytical subdivisions (such as sector of
employment). For some applications it makes sense to compute a
region-based scale and for other purposes a nationwide one. In our case
we are working in the context of nationally adopted goals of equitable
development within so-called United Nations Millennium Development
Goals, and therefore the standards for of the standard of living should
be set at national level, because governments establish such goals for
all the people of their nations. However, in equations where the scale
is a predictor, regions are certainly another likely predictor along
with other variables of interest (employment and education level of
household adults, say), to predict outcomes such as children dropping
out of school or child mortality which figure outstandingly in
Millennium Goals. Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Art Kendall Enviado el: Friday, August 18, 2006 11:53 AM
Para: [hidden email]
Asunto: Re: Reference category for dummies in factor analysis

If you think that you might have been approaching a zero determinant,
you might try a pfa and see what the communalities look like. Obviously
the determinant would be zero if all of the categories for a single
variable had dummies. Also, the 16 or so decimal digits is only in the
mantissa, the exponent can go to something like 1022 or 1023 depending
on the sign.

It really would be interesting to hear what some of the inventors of
extensions to FA, like CA,INDSCAL, ALSCAL, PROXSCAL have to say about
your problem.  I would urge you to post a description of what you are
trying to do and the problem you ran into on the lists I mentioned..

Just to stir the pot.  Is it possible that the relation of the variables
would vary by market areas e.g., manufacturing zones, fishing coastal,
vs field agricultural vs herding etc.? So that something like an INDSCAL
with a correlation or other variable similarity measure per "region"
would be informative and fit the data even better?

Art
Social Research Consultants
[hidden email]

Hector Maletta wrote:

> Art,
>
> Thanks for your interesting response. We used PCA, with 1.00
> commonality, i.e. extracting 100% variance. By "classical" I meant
> parametric factor analysis and not any form of optimal scaling or
> alternating least squares forms of data reduction. I am now
> considering these other alternatives under advice of Anita van der
> Kooij.
>
>
>
> I share your uncertainty about why results change depending on which
> category is omitted, and that was the original question starting this
> thread. Since nobody else seems to have an answer I will offer one
> purely numerical hypothesis. The exercise with the varying results, it

> turns out, was done by my colleague not with the entire census but
> with a SAMPLE of the Peru census (about 200,000 households, still a
> lot but perhaps not so much for so many variables and factors), and
> the contributions of latter factors were pretty small. SPSS provides,
> as is well known, a precision no higher than 15 decimal places approx.

> So it is just possible that some matrix figures for some of the minor
> factors differed only on the 15th or 16th decimal place (or further
> down), and then were taken as equal, and this may have caused some
> matrix to be singular or (most probably) near singular, and the
> results to be still computable but unstable. Moreover, some of the
> categories in census questions used as reference or omitted categories

> were populated by very few cases, which may have compounded the
> problem. Since running this on the entire census (which would enhance
> statistical significance and stability of results) takes a lot of
> computer time and has to be done several times with different
> reference categories, we have not done it yet but will proceed soon
> and report back. But I wanted to know whether some mathematical reason

> existed for the discrepancy.
>
>
>
> About why I would want to create a single score out of multiple
> factors, let us leave it for another occasion since it is a rather
> complicated story of a project connecting factor analysis with index
> number theory and economic welfare theory.
>
>
>
> Hector
>
>
>
> ----------------------------------------------------------------------
> --
>
> De: Art Kendall [mailto:[hidden email]]
> Enviado el: Friday, August 18, 2006 9:34 AM
> Para: Kooij, A.J. van der; [hidden email]
> CC: [hidden email]
> Asunto: Re: Reference category for dummies in factor analysis
>
>
>
> This has been an interesting discussion.  I don't know why the FA and
> scores would change  depending on which category is omitted.  Were
> there errors in recoding to dummies that could have created different
> missing values?
>
>
> You also said classical FA, but then said PCA.  What did you use for
> communality estimates.? 1.00? Squared multiple correlations?
>
> (I'm not sure why you would create a single score if you have multiple

> factors either, but that is another question.)
>
> What I do know is that people who know a lot more about CA, MDS, and
> factor analysis than I do ( Like Joe Kruskal, Doug Carroll, Willem
> Heisser, Phipps Arabie, Shizuhiko Nishimoto, et al)  follow the
> class-l and mpsych-l discussion lists. see
>
> http://aris.ss.uci.edu/smp/mpsych.html
>
> and
>
> http://www.classification-society.org/csna/lists.html#class-l
>
> Art Kendall
> [hidden email] <mailto:[hidden email]>
>
>
>
> Kooij, A.J. van der wrote:
>
>>... trouble because any category of each original census question
>>would be
an exact linear

>>
>>function of the remaining categories of the question.
>>
>>
>>
>Yes, but this gives trouble in regression, not in PCA, as far as I
>know.
>
>
>
>
>
>>In the indicator matrix, one category will have zeroes on all
>>indicator
variables.
>>
>>
>>
>No, and, sorry, I was confused with CA on indicator matrix, but this is
"sort of" PCA.  See syntax below (object scores=component scores are
equal to row scores CA, category quantifications equal to column scores
CA).

>
>Regards,
>
>Anita.
>
>
>
>
>
>data list free/v1 v2 v3.
>
>
>
>begin data.
>
>
>
>1 2 3
>
>
>
>2 1 3
>
>
>
>2 2 2
>
>
>
>3 1 1
>
>
>
>2 3 4
>
>
>
>2 2 2
>
>
>
>1 2 4
>
>
>
>end data.
>
>
>
>
>
>
>
>Multiple Correspondence v1 v2 v3
>
>
>
> /analysis v1 v2 v3
>
>
>
> /dim=2
>
>
>
> /critit .0000001
>
>
>
> /print discrim quant obj
>
>
>
> /plot  none.
>
>
>
>
>
>
>
>catpca v1 v2 v3
>
>
>
> /analysis v1 v2 v3 (mnom)
>
>
>
> /dim=2
>
>
>
> /critit .0000001
>
>
>
> /print quant obj
>
>
>
> /plot  none.
>
>
>
>
>
>
>
>data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1 v3cat2
v3cat3 v3cat4 .

>
>
>
>begin data.
>
>
>
>1 0 0 0 1 0 0 0 1 0
>
>
>
>0 1 0 1 0 0 0 0 1 0
>
>
>
>0 1 0 0 1 0 0 1 0 0
>
>
>
>0 0 1 1 0 0 1 0 0 0
>
>
>
>0 1 0 0 0 1 0 0 0 1
>
>
>
>0 1 0 0 1 0 0 1 0 0
>
>
>
>1 0 0 0 1 0 0 0 0 1
>
>
>
>end data.
>
>
>
>
>
>
>
>CORRESPONDENCE
>
>
>
>  TABLE = all (7,10)
>
>
>
>  /DIMENSIONS = 2
>
>
>
>  /NORMALIZATION = cprin
>
>
>
>  /PRINT = RPOINTS CPOINTS
>
>
>
>  /PLOT = none .
>
>
>
>
>
>
>
>________________________________
>
>
>
>From: SPSSX(r) Discussion on behalf of Hector Maletta
>
>Sent: Thu 17/08/2006 19:56
>
>To: [hidden email] <mailto:[hidden email]>
>
>Subject: Re: Reference category for dummies in factor analysis
>
>
>
>
>
>
>
>Thank you, Anita. I will certainly look into your suggestion about
>CATCPA.
>
>However, I suspect some mathematical properties of the scores generated

>by
>
>CATPCA are not the ones I hope to have in our scale, because of the
>
>non-parametric nature of the procedure (too long to explain here, and
>not
>
>sure of understanding it myself).
>
>As for your second idea, I think if you try to apply PCA on dummies not
>
>omitting any category you'd run into trouble because any category of
>each
>
>original census question would be an exact linear function of the
>remaining
>
>categories of the question. In the indicator matrix, one category will
>have
>
>zeroes on all indicator variables, and that one is the "omitted"
>category.
>
>Hector
>
>
>
>
>
>-----Mensaje original-----
>
>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
>
>Kooij, A.J. van der
>
>Enviado el: Thursday, August 17, 2006 2:37 PM
>
>Para: [hidden email] <mailto:[hidden email]>
>
>Asunto: Re: Reference category for dummies in factor analysis
>
>
>
>CATPCA (in Data Reduction menu, under Optimal Scaling) is PCA for
>
>(ordered//ordinal and unorderd/nominal) categorical variables; no need
>to
>
>use dummies then.
>
>Using PCA on dummies I think you should not omit dummies (for nominal
>
>variables you can do PCA on an indicator maxtrix (that has columns that

>can
>
>be regarded as dummy variables; a column for each category, thus
>without
>
>omitting one)).
>
>
>
>Regards,
>
>Anita van der Kooij
>
>Data Theory Group
>
>Leiden University.
>
>
>
>________________________________
>
>
>
>From: SPSSX(r) Discussion on behalf of Hector Maletta
>
>Sent: Thu 17/08/2006 17:52
>
>To: [hidden email] <mailto:[hidden email]>
>
>Subject: Reference category for dummies in factor analysis
>
>
>
>
>
>
>
>Dear colleagues,
>
>
>
>I am re-posting (slightly re-phrased for added clarity) a question I
>sent
>
>the list about a week ago without eliciting any response as yet. I hope
some
>
>factor analysis experts may be able to help.
>
>
>
>In a research project on which we work together, a colleague of mine
>
>constructed a scale based on factor scores obtained through classical
factor
>
>analysis  (principal components) of a number of categorical census
variables
>
>all transformed into dummies. The variables concerned the standard of
living
>
>of households and included quality of dwelling and basic services such
>as
>
>sanitation, water supply, electricity and the like. (The scale was not
>
>simply the score for the first factor, but the average score of several
>
>factors, weighted by their respective contribution to explaining the
overall

>
>variance of observed variables, but this is, I surmise, beside the
>point.)
>
>
>
>Now, he found out that the choice of reference or "omitted" category
>for
>
>defining the dummies has an influence on results. He first ran the
>analysis
>
>using the first category of all categorical variables as the reference
>
>category, and then repeated the analysis using the last category as the
>
>reference or omitted category, whatever they might be. He found that
>the
>
>resulting scale varied not only in absolute value but also in the shape

>of
>
>its distribution.
>
>
>
>I can understand that the absolute value of the factor scores may
>change
and

>
>even the ranking of the categories of the various variables (in terms
>of
>
>their average scores) may also be different, since after all the list
>of
>
>dummies used has varied and the categories are tallied each time
>against a
>
>different reference category. But the shape of the scale distribution
should

>
>not change, I guess, especially not in a drastic manner. In this case
>the
>
>shape of the scale frequency distribution did change.  Both
>distributions
>
>were roughly normal, with a kind of "hump" on one side, one of them on
>the
>
>left and the other on the right, probably due to the change in
>reference
>
>categories, but also with changes in the range of the scale and other
>
>details.
>
>
>
>Also, he found that the two scales had not a perfect correlation, and
>
>moreover, that their correlation was negative. That the correlation was
>
>negative may be understandable: the first category in such census
>variables
>
>is usually a "good" one (for instance, a home with walls made of brick
>or
>
>concrete) and the last one is frequently a "bad" one (earthen floor) or

>a
>
>residual heterogeneous one including bad options ("other" kinds of
>roof).
>
>But since the two scales are just different combinations of the same
>
>categorical variables based on the same statistical treatment of their
given
>
>covariance matrix, one should expect a closer, indeed a perfect
correlation,

>
>even if a negative one is possible for the reasons stated above.
>Changing
>
>the reference category should be like changing the unit of measurement
>or
>
>the position of the zero point (like passing from Celsius to
>Fahrenheit), a
>
>decision not affecting the correlation coefficient with other
>variables. In
>
>this case, instead, the two scales had r = -0.54, implying they shared
>only
>
>29% of their variance, even in the extreme case when ALL the possible
>
>factors (as many as variables) were extracted and all their scores
>averaged
>
>into the scale, and therefore the entire variance, common or specific,
>of
>
>the whole set of variables was taken into account).
>
>
>
>I should add that the dataset was a large sample of census data, and
>all
the

>
>results were statistically significant.
>
>
>
>Any ideas why choosing different reference categories for dummy
>conversion
>
>could have such impact on results? I would greatly appreciate your
>thoughts
>
>in this regard.
>
>
>
>Hector
>
>
>
>
>
>
>
>**********************************************************************
>
>This email and any files transmitted with it are confidential and
>
>intended solely for the use of the individual or entity to whom they
>
>are addressed. If you have received this email in error please notify
>
>the system manager.
>
>**********************************************************************
>
>
>
>
>
>
>

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Kooij, A.J. van der
In reply to this post by Hector Maletta
>can you say that the solution achieved by CATPCA on a set of
categorical (nominal)
> variables is the same achieved by CPA by converting all categories
(minus 1) of each
> categorical variable into dummies?

No. The CATPCA solution with multiple nominal scaling level (is
(Multiple) Correspondence; is equal to CATPCA with nominal scaling level
if only 1 dimension) is the same solution as the solution obtained from
analyzing the indicator matrix (see previous mail). This solution is not
equal to PCA solution for dummies, with or without omitting categories.
Maybe it is possible to compute the CATPCA results from the dummy
solution and v.v., (as is possible for CATREG nominal and linear
regression on dummies) but I don't know how, and actually I don't think
so.

Regards,
Anita

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Hector Maletta
Sent: 18 August 2006 18:16
To: [hidden email]
Subject: Re: Reference category for dummies in factor analysis


Anita,
You are saying that the solution attained with CATPCA by assigning
numeric values to all categories is the same solution attained by CPA
run on the interval variables resulting from the transformation of
categories by CATPCA. This I do not dispute. Let me ask you this: can
you say that the solution achieved by CATPCA on a set of categorical
(nominal) variables is the same achieved by CPA by converting all
categories (minus 1) of each categorical variable into dummies? If so,
it would transpire that the choice of the omitted category should not
matter: all solutions in PCA, with different omitted categories, should
all coincide with CATPCA. Is it so? Hector

-----Mensaje original-----
De: Kooij, A.J. van der [mailto:[hidden email]] Enviado el:
Friday, August 18, 2006 1:02 PM
Para: Hector Maletta
Asunto: RE: Re: Reference category for dummies in factor analysis

>... we are now analyzing whether by using categorical factor analysis
we may
> lose some useful mathematical properties of PCA

I don't think you will lose anything (but if you do let me know) because
the CATPCA model is the linear PCA model for scaling levels other than
mnom (and also with mnom for 1-dimensional solution). To check this you
can save the transformed data and us these as input to FACTOR PCA:
results are equal to CATPCA results (sometimes slighly different with
default iteration options, then adjust convergence criterion and/or
maximum number of iterations). With mnom scaling level and multi-dim.
solution you have transformed data for each dimension, then FACTOR PCA
on the transformed variables for a dimension is equal to CATPCA results
for that dimension.

Another suggestion regarding regions or something like that:
To explore if regions differ, you can include a region variable as
supplementary. This variable is not used in the analysis, but is fitted
into the solution. If it does not fit well, regions do not differ with
respect to the solution; if it does fit well you can inspect the
categories plot to see how regions relate to the analysis variables.

Regards,
Anita


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Hector Maletta
Sent: 18 August 2006 17:19
To: [hidden email]
Subject: Re: Reference category for dummies in factor analysis


Art,
Interesting thoughts, as usual from you. I didn't know about the higher
precision of exponents. About non parametric factor analysis based on
alternating least squares I have received excellent advice from one of
the major figures in the field, Anita van der Kooij (one of the
developers of the Categories SPSS module where these procedures are
included). Of course in those approaches you do not drop any category,
all categories are quantified and you get a unique set of factor scores,
so my problem disappears. Since our project involves using the scores in
a model involving index number theory and neoclassical economics, we are
now analyzing whether by using categorical factor analysis we may lose
some useful mathematical properties of PCA. Regarding your idea of
differentiating by region or sector: Of course, this kind of scale is
computed in order to be used in analytical and practical applications
involving geographical breakdown (e.g. to improve or refine targeting of
social programs) and other analytical subdivisions (such as sector of
employment). For some applications it makes sense to compute a
region-based scale and for other purposes a nationwide one. In our case
we are working in the context of nationally adopted goals of equitable
development within so-called United Nations Millennium Development
Goals, and therefore the standards for of the standard of living should
be set at national level, because governments establish such goals for
all the people of their nations. However, in equations where the scale
is a predictor, regions are certainly another likely predictor along
with other variables of interest (employment and education level of
household adults, say), to predict outcomes such as children dropping
out of school or child mortality which figure outstandingly in
Millennium Goals. Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Art Kendall Enviado el: Friday, August 18, 2006 11:53 AM
Para: [hidden email]
Asunto: Re: Reference category for dummies in factor analysis

If you think that you might have been approaching a zero determinant,
you might try a pfa and see what the communalities look like. Obviously
the determinant would be zero if all of the categories for a single
variable had dummies. Also, the 16 or so decimal digits is only in the
mantissa, the exponent can go to something like 1022 or 1023 depending
on the sign.

It really would be interesting to hear what some of the inventors of
extensions to FA, like CA,INDSCAL, ALSCAL, PROXSCAL have to say about
your problem.  I would urge you to post a description of what you are
trying to do and the problem you ran into on the lists I mentioned..

Just to stir the pot.  Is it possible that the relation of the variables
would vary by market areas e.g., manufacturing zones, fishing coastal,
vs field agricultural vs herding etc.? So that something like an INDSCAL
with a correlation or other variable similarity measure per "region"
would be informative and fit the data even better?

Art
Social Research Consultants
[hidden email]

Hector Maletta wrote:

> Art,
>
> Thanks for your interesting response. We used PCA, with 1.00
> commonality, i.e. extracting 100% variance. By "classical" I meant
> parametric factor analysis and not any form of optimal scaling or
> alternating least squares forms of data reduction. I am now
> considering these other alternatives under advice of Anita van der
> Kooij.
>
>
>
> I share your uncertainty about why results change depending on which
> category is omitted, and that was the original question starting this
> thread. Since nobody else seems to have an answer I will offer one
> purely numerical hypothesis. The exercise with the varying results, it

> turns out, was done by my colleague not with the entire census but
> with a SAMPLE of the Peru census (about 200,000 households, still a
> lot but perhaps not so much for so many variables and factors), and
> the contributions of latter factors were pretty small. SPSS provides,
> as is well known, a precision no higher than 15 decimal places approx.

> So it is just possible that some matrix figures for some of the minor
> factors differed only on the 15th or 16th decimal place (or further
> down), and then were taken as equal, and this may have caused some
> matrix to be singular or (most probably) near singular, and the
> results to be still computable but unstable. Moreover, some of the
> categories in census questions used as reference or omitted categories

> were populated by very few cases, which may have compounded the
> problem. Since running this on the entire census (which would enhance
> statistical significance and stability of results) takes a lot of
> computer time and has to be done several times with different
> reference categories, we have not done it yet but will proceed soon
> and report back. But I wanted to know whether some mathematical reason

> existed for the discrepancy.
>
>
>
> About why I would want to create a single score out of multiple
> factors, let us leave it for another occasion since it is a rather
> complicated story of a project connecting factor analysis with index
> number theory and economic welfare theory.
>
>
>
> Hector
>
>
>
> ----------------------------------------------------------------------
> --
>
> De: Art Kendall [mailto:[hidden email]]
> Enviado el: Friday, August 18, 2006 9:34 AM
> Para: Kooij, A.J. van der; [hidden email]
> CC: [hidden email]
> Asunto: Re: Reference category for dummies in factor analysis
>
>
>
> This has been an interesting discussion.  I don't know why the FA and
> scores would change  depending on which category is omitted.  Were
> there errors in recoding to dummies that could have created different
> missing values?
>
>
> You also said classical FA, but then said PCA.  What did you use for
> communality estimates.? 1.00? Squared multiple correlations?
>
> (I'm not sure why you would create a single score if you have multiple

> factors either, but that is another question.)
>
> What I do know is that people who know a lot more about CA, MDS, and
> factor analysis than I do ( Like Joe Kruskal, Doug Carroll, Willem
> Heisser, Phipps Arabie, Shizuhiko Nishimoto, et al)  follow the
> class-l and mpsych-l discussion lists. see
>
> http://aris.ss.uci.edu/smp/mpsych.html
>
> and
>
> http://www.classification-society.org/csna/lists.html#class-l
>
> Art Kendall
> [hidden email] <mailto:[hidden email]>
>
>
>
> Kooij, A.J. van der wrote:
>
>>... trouble because any category of each original census question
>>would be
an exact linear

>>
>>function of the remaining categories of the question.
>>
>>
>>
>Yes, but this gives trouble in regression, not in PCA, as far as I
>know.
>
>
>
>
>
>>In the indicator matrix, one category will have zeroes on all
>>indicator
variables.
>>
>>
>>
>No, and, sorry, I was confused with CA on indicator matrix, but this is
"sort of" PCA.  See syntax below (object scores=component scores are
equal to row scores CA, category quantifications equal to column scores
CA).

>
>Regards,
>
>Anita.
>
>
>
>
>
>data list free/v1 v2 v3.
>
>
>
>begin data.
>
>
>
>1 2 3
>
>
>
>2 1 3
>
>
>
>2 2 2
>
>
>
>3 1 1
>
>
>
>2 3 4
>
>
>
>2 2 2
>
>
>
>1 2 4
>
>
>
>end data.
>
>
>
>
>
>
>
>Multiple Correspondence v1 v2 v3
>
>
>
> /analysis v1 v2 v3
>
>
>
> /dim=2
>
>
>
> /critit .0000001
>
>
>
> /print discrim quant obj
>
>
>
> /plot  none.
>
>
>
>
>
>
>
>catpca v1 v2 v3
>
>
>
> /analysis v1 v2 v3 (mnom)
>
>
>
> /dim=2
>
>
>
> /critit .0000001
>
>
>
> /print quant obj
>
>
>
> /plot  none.
>
>
>
>
>
>
>
>data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1 v3cat2
v3cat3 v3cat4 .

>
>
>
>begin data.
>
>
>
>1 0 0 0 1 0 0 0 1 0
>
>
>
>0 1 0 1 0 0 0 0 1 0
>
>
>
>0 1 0 0 1 0 0 1 0 0
>
>
>
>0 0 1 1 0 0 1 0 0 0
>
>
>
>0 1 0 0 0 1 0 0 0 1
>
>
>
>0 1 0 0 1 0 0 1 0 0
>
>
>
>1 0 0 0 1 0 0 0 0 1
>
>
>
>end data.
>
>
>
>
>
>
>
>CORRESPONDENCE
>
>
>
>  TABLE = all (7,10)
>
>
>
>  /DIMENSIONS = 2
>
>
>
>  /NORMALIZATION = cprin
>
>
>
>  /PRINT = RPOINTS CPOINTS
>
>
>
>  /PLOT = none .
>
>
>
>
>
>
>
>________________________________
>
>
>
>From: SPSSX(r) Discussion on behalf of Hector Maletta
>
>Sent: Thu 17/08/2006 19:56
>
>To: [hidden email] <mailto:[hidden email]>
>
>Subject: Re: Reference category for dummies in factor analysis
>
>
>
>
>
>
>
>Thank you, Anita. I will certainly look into your suggestion about
>CATCPA.
>
>However, I suspect some mathematical properties of the scores generated

>by
>
>CATPCA are not the ones I hope to have in our scale, because of the
>
>non-parametric nature of the procedure (too long to explain here, and
>not
>
>sure of understanding it myself).
>
>As for your second idea, I think if you try to apply PCA on dummies not
>
>omitting any category you'd run into trouble because any category of
>each
>
>original census question would be an exact linear function of the
>remaining
>
>categories of the question. In the indicator matrix, one category will
>have
>
>zeroes on all indicator variables, and that one is the "omitted"
>category.
>
>Hector
>
>
>
>
>
>-----Mensaje original-----
>
>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
>
>Kooij, A.J. van der
>
>Enviado el: Thursday, August 17, 2006 2:37 PM
>
>Para: [hidden email] <mailto:[hidden email]>
>
>Asunto: Re: Reference category for dummies in factor analysis
>
>
>
>CATPCA (in Data Reduction menu, under Optimal Scaling) is PCA for
>
>(ordered//ordinal and unorderd/nominal) categorical variables; no need
>to
>
>use dummies then.
>
>Using PCA on dummies I think you should not omit dummies (for nominal
>
>variables you can do PCA on an indicator maxtrix (that has columns that

>can
>
>be regarded as dummy variables; a column for each category, thus
>without
>
>omitting one)).
>
>
>
>Regards,
>
>Anita van der Kooij
>
>Data Theory Group
>
>Leiden University.
>
>
>
>________________________________
>
>
>
>From: SPSSX(r) Discussion on behalf of Hector Maletta
>
>Sent: Thu 17/08/2006 17:52
>
>To: [hidden email] <mailto:[hidden email]>
>
>Subject: Reference category for dummies in factor analysis
>
>
>
>
>
>
>
>Dear colleagues,
>
>
>
>I am re-posting (slightly re-phrased for added clarity) a question I
>sent
>
>the list about a week ago without eliciting any response as yet. I hope
some
>
>factor analysis experts may be able to help.
>
>
>
>In a research project on which we work together, a colleague of mine
>
>constructed a scale based on factor scores obtained through classical
factor
>
>analysis  (principal components) of a number of categorical census
variables
>
>all transformed into dummies. The variables concerned the standard of
living
>
>of households and included quality of dwelling and basic services such
>as
>
>sanitation, water supply, electricity and the like. (The scale was not
>
>simply the score for the first factor, but the average score of several
>
>factors, weighted by their respective contribution to explaining the
overall

>
>variance of observed variables, but this is, I surmise, beside the
>point.)
>
>
>
>Now, he found out that the choice of reference or "omitted" category
>for
>
>defining the dummies has an influence on results. He first ran the
>analysis
>
>using the first category of all categorical variables as the reference
>
>category, and then repeated the analysis using the last category as the
>
>reference or omitted category, whatever they might be. He found that
>the
>
>resulting scale varied not only in absolute value but also in the shape

>of
>
>its distribution.
>
>
>
>I can understand that the absolute value of the factor scores may
>change
and

>
>even the ranking of the categories of the various variables (in terms
>of
>
>their average scores) may also be different, since after all the list
>of
>
>dummies used has varied and the categories are tallied each time
>against a
>
>different reference category. But the shape of the scale distribution
should

>
>not change, I guess, especially not in a drastic manner. In this case
>the
>
>shape of the scale frequency distribution did change.  Both
>distributions
>
>were roughly normal, with a kind of "hump" on one side, one of them on
>the
>
>left and the other on the right, probably due to the change in
>reference
>
>categories, but also with changes in the range of the scale and other
>
>details.
>
>
>
>Also, he found that the two scales had not a perfect correlation, and
>
>moreover, that their correlation was negative. That the correlation was
>
>negative may be understandable: the first category in such census
>variables
>
>is usually a "good" one (for instance, a home with walls made of brick
>or
>
>concrete) and the last one is frequently a "bad" one (earthen floor) or

>a
>
>residual heterogeneous one including bad options ("other" kinds of
>roof).
>
>But since the two scales are just different combinations of the same
>
>categorical variables based on the same statistical treatment of their
given
>
>covariance matrix, one should expect a closer, indeed a perfect
correlation,

>
>even if a negative one is possible for the reasons stated above.
>Changing
>
>the reference category should be like changing the unit of measurement
>or
>
>the position of the zero point (like passing from Celsius to
>Fahrenheit), a
>
>decision not affecting the correlation coefficient with other
>variables. In
>
>this case, instead, the two scales had r = -0.54, implying they shared
>only
>
>29% of their variance, even in the extreme case when ALL the possible
>
>factors (as many as variables) were extracted and all their scores
>averaged
>
>into the scale, and therefore the entire variance, common or specific,
>of
>
>the whole set of variables was taken into account).
>
>
>
>I should add that the dataset was a large sample of census data, and
>all
the

>
>results were statistically significant.
>
>
>
>Any ideas why choosing different reference categories for dummy
>conversion
>
>could have such impact on results? I would greatly appreciate your
>thoughts
>
>in this regard.
>
>
>
>Hector
>
>
>
>
>
>
>
>**********************************************************************
>
>This email and any files transmitted with it are confidential and
>
>intended solely for the use of the individual or entity to whom they
>
>are addressed. If you have received this email in error please notify
>
>the system manager.
>
>**********************************************************************
>
>
>
>
>
>
>

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error please notify the
system manager.
**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Art Kendall
In reply to this post by Hector Maletta
In an INDSCAL approach one finds dimensions that are common to the
overall set of matrices.  The common matrix is the same as that found in
a simple MDS.  However, each matrix has  measures of how much use is
made of each dimension in each matrix.  What I was thinking that access
to fresh vegetables may be less defining in rural areas than in urban
areas, or numbers of rooms in temperate micro-climates where much
household activity can be done outdoors.  However, you might also be
able to get at that by clustering localities based on scores on the
first so many factors.

The Leiden people built on the work of the people I mentioned and from
what I have seen at the Classification society and other sources their
work is outstanding.  I haven't yet had an opportunity to use CATPCA or
CATREG, but they look promising for many purposes. My understanding is
that CATPCA produces interval level scores that are interpreted much as
those in PCA.  It seems that if you have a solid basis to construct your
measure you should be ok.   If you come across anything that
compares/contrasts Scores from PCA, CATPCA, and traditional scores from
factor analysis using unit weights, I would appreciate it if you would
pass the citation along.

Art
Social Research Consultants
[hidden email]



Hector Maletta wrote:

>Art,
>Interesting thoughts, as usual from you. I didn't know about the higher
>precision of exponents. About non parametric factor analysis based on
>alternating least squares I have received excellent advice from one of the
>major figures in the field, Anita van der Kooij (one of the developers of
>the Categories SPSS module where these procedures are included). Of course
>in those approaches you do not drop any category, all categories are
>quantified and you get a unique set of factor scores, so my problem
>disappears. Since our project involves using the scores in a model involving
>index number theory and neoclassical economics, we are now analyzing whether
>by using categorical factor analysis we may lose some useful mathematical
>properties of PCA.
>Regarding your idea of differentiating by region or sector: Of course, this
>kind of scale is computed in order to be used in analytical and practical
>applications involving geographical breakdown (e.g. to improve or refine
>targeting of social programs) and other analytical subdivisions (such as
>sector of employment). For some applications it makes sense to compute a
>region-based scale and for other purposes a nationwide one. In our case we
>are working in the context of nationally adopted goals of equitable
>development within so-called United Nations Millennium Development Goals,
>and therefore the standards for of the standard of living should be set at
>national level, because governments establish such goals for all the people
>of their nations. However, in equations where the scale is a predictor,
>regions are certainly another likely predictor along with other variables of
>interest (employment and education level of household adults, say), to
>predict outcomes such as children dropping out of school or child mortality
>which figure outstandingly in Millennium Goals.
>Hector
>
>-----Mensaje original-----
>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Art
>Kendall
>Enviado el: Friday, August 18, 2006 11:53 AM
>Para: [hidden email]
>Asunto: Re: Reference category for dummies in factor analysis
>
>If you think that you might have been approaching a zero determinant,
>you might try a pfa and see what the communalities look like.
>Obviously the determinant would be zero if all of the categories for a
>single variable had dummies.
>Also, the 16 or so decimal digits is only in the mantissa, the exponent
>can go to something like 1022 or 1023 depending on the sign.
>
>It really would be interesting to hear what some of the inventors of
>extensions to FA, like CA,INDSCAL, ALSCAL, PROXSCAL have to say about
>your problem.  I would urge you to post a description of what you are
>trying to do and the problem you ran into on the lists I mentioned..
>
>Just to stir the pot.  Is it possible that the relation of the variables
>would vary by market areas e.g., manufacturing zones, fishing coastal,
>vs field agricultural vs herding etc.?
>So that something like an INDSCAL with a correlation or other variable
>similarity measure per "region" would be informative and fit the data
>even better?
>
>Art
>Social Research Consultants
>[hidden email]
>
>Hector Maletta wrote:
>
>
>
>>Art,
>>
>>Thanks for your interesting response. We used PCA, with 1.00
>>commonality, i.e. extracting 100% variance. By "classical" I meant
>>parametric factor analysis and not any form of optimal scaling or
>>alternating least squares forms of data reduction. I am now
>>considering these other alternatives under advice of Anita van der Kooij.
>>
>>
>>
>>I share your uncertainty about why results change depending on which
>>category is omitted, and that was the original question starting this
>>thread. Since nobody else seems to have an answer I will offer one
>>purely numerical hypothesis. The exercise with the varying results, it
>>turns out, was done by my colleague not with the entire census but
>>with a SAMPLE of the Peru census (about 200,000 households, still a
>>lot but perhaps not so much for so many variables and factors), and
>>the contributions of latter factors were pretty small. SPSS provides,
>>as is well known, a precision no higher than 15 decimal places approx.
>>So it is just possible that some matrix figures for some of the minor
>>factors differed only on the 15th or 16th decimal place (or further
>>down), and then were taken as equal, and this may have caused some
>>matrix to be singular or (most probably) near singular, and the
>>results to be still computable but unstable. Moreover, some of the
>>categories in census questions used as reference or omitted categories
>>were populated by very few cases, which may have compounded the
>>problem. Since running this on the entire census (which would enhance
>>statistical significance and stability of results) takes a lot of
>>computer time and has to be done several times with different
>>reference categories, we have not done it yet but will proceed soon
>>and report back. But I wanted to know whether some mathematical reason
>>existed for the discrepancy.
>>
>>
>>
>>About why I would want to create a single score out of multiple
>>factors, let us leave it for another occasion since it is a rather
>>complicated story of a project connecting factor analysis with index
>>number theory and economic welfare theory.
>>
>>
>>
>>Hector
>>
>>
>>
>>------------------------------------------------------------------------
>>
>>De: Art Kendall [mailto:[hidden email]]
>>Enviado el: Friday, August 18, 2006 9:34 AM
>>Para: Kooij, A.J. van der; [hidden email]
>>CC: [hidden email]
>>Asunto: Re: Reference category for dummies in factor analysis
>>
>>
>>
>>This has been an interesting discussion.  I don't know why the FA and
>>scores would change  depending on which category is omitted.  Were
>>there errors in recoding to dummies that could have created different
>>missing values?
>>
>>
>>You also said classical FA, but then said PCA.  What did you use for
>>communality estimates.? 1.00? Squared multiple correlations?
>>
>>(I'm not sure why you would create a single score if you have multiple
>>factors either, but that is another question.)
>>
>>What I do know is that people who know a lot more about CA, MDS, and
>>factor analysis than I do ( Like Joe Kruskal, Doug Carroll, Willem
>>Heisser, Phipps Arabie, Shizuhiko Nishimoto, et al)  follow the
>>class-l and mpsych-l discussion lists.
>>see
>>
>>http://aris.ss.uci.edu/smp/mpsych.html
>>
>>and
>>
>>http://www.classification-society.org/csna/lists.html#class-l
>>
>>Art Kendall
>>[hidden email] <mailto:[hidden email]>
>>
>>
>>
>>Kooij, A.J. van der wrote:
>>
>>
>>
>>>... trouble because any category of each original census question would be
>>>
>>>
>an exact linear
>
>
>>>function of the remaining categories of the question.
>>>
>>>
>>>
>>>
>>>
>>Yes, but this gives trouble in regression, not in PCA, as far as I know.
>>
>>
>>
>>
>>
>>
>>
>>>In the indicator matrix, one category will have zeroes on all indicator
>>>
>>>
>variables.
>
>
>>>
>>>
>>>
>>No, and, sorry, I was confused with CA on indicator matrix, but this is
>>
>>
>"sort of" PCA.  See syntax below (object scores=component scores are equal
>to row scores CA, category quantifications equal to column scores CA).
>
>
>>Regards,
>>
>>Anita.
>>
>>
>>
>>
>>
>>data list free/v1 v2 v3.
>>
>>
>>
>>begin data.
>>
>>
>>
>>1 2 3
>>
>>
>>
>>2 1 3
>>
>>
>>
>>2 2 2
>>
>>
>>
>>3 1 1
>>
>>
>>
>>2 3 4
>>
>>
>>
>>2 2 2
>>
>>
>>
>>1 2 4
>>
>>
>>
>>end data.
>>
>>
>>
>>
>>
>>
>>
>>Multiple Correspondence v1 v2 v3
>>
>>
>>
>>/analysis v1 v2 v3
>>
>>
>>
>>/dim=2
>>
>>
>>
>>/critit .0000001
>>
>>
>>
>>/print discrim quant obj
>>
>>
>>
>>/plot  none.
>>
>>
>>
>>
>>
>>
>>
>>catpca v1 v2 v3
>>
>>
>>
>>/analysis v1 v2 v3 (mnom)
>>
>>
>>
>>/dim=2
>>
>>
>>
>>/critit .0000001
>>
>>
>>
>>/print quant obj
>>
>>
>>
>>/plot  none.
>>
>>
>>
>>
>>
>>
>>
>>data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1 v3cat2
>>
>>
>v3cat3 v3cat4 .
>
>
>>
>>begin data.
>>
>>
>>
>>1 0 0 0 1 0 0 0 1 0
>>
>>
>>
>>0 1 0 1 0 0 0 0 1 0
>>
>>
>>
>>0 1 0 0 1 0 0 1 0 0
>>
>>
>>
>>0 0 1 1 0 0 1 0 0 0
>>
>>
>>
>>0 1 0 0 0 1 0 0 0 1
>>
>>
>>
>>0 1 0 0 1 0 0 1 0 0
>>
>>
>>
>>1 0 0 0 1 0 0 0 0 1
>>
>>
>>
>>end data.
>>
>>
>>
>>
>>
>>
>>
>>CORRESPONDENCE
>>
>>
>>
>> TABLE = all (7,10)
>>
>>
>>
>> /DIMENSIONS = 2
>>
>>
>>
>> /NORMALIZATION = cprin
>>
>>
>>
>> /PRINT = RPOINTS CPOINTS
>>
>>
>>
>> /PLOT = none .
>>
>>
>>
>>
>>
>>
>>
>>________________________________
>>
>>
>>
>>From: SPSSX(r) Discussion on behalf of Hector Maletta
>>
>>Sent: Thu 17/08/2006 19:56
>>
>>To: [hidden email] <mailto:[hidden email]>
>>
>>Subject: Re: Reference category for dummies in factor analysis
>>
>>
>>
>>
>>
>>
>>
>>Thank you, Anita. I will certainly look into your suggestion about CATCPA.
>>
>>However, I suspect some mathematical properties of the scores generated by
>>
>>CATPCA are not the ones I hope to have in our scale, because of the
>>
>>non-parametric nature of the procedure (too long to explain here, and not
>>
>>sure of understanding it myself).
>>
>>As for your second idea, I think if you try to apply PCA on dummies not
>>
>>omitting any category you'd run into trouble because any category of each
>>
>>original census question would be an exact linear function of the remaining
>>
>>categories of the question. In the indicator matrix, one category will have
>>
>>zeroes on all indicator variables, and that one is the "omitted" category.
>>
>>Hector
>>
>>
>>
>>
>>
>>-----Mensaje original-----
>>
>>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
>>
>>Kooij, A.J. van der
>>
>>Enviado el: Thursday, August 17, 2006 2:37 PM
>>
>>Para: [hidden email] <mailto:[hidden email]>
>>
>>Asunto: Re: Reference category for dummies in factor analysis
>>
>>
>>
>>CATPCA (in Data Reduction menu, under Optimal Scaling) is PCA for
>>
>>(ordered//ordinal and unorderd/nominal) categorical variables; no need to
>>
>>use dummies then.
>>
>>Using PCA on dummies I think you should not omit dummies (for nominal
>>
>>variables you can do PCA on an indicator maxtrix (that has columns that can
>>
>>be regarded as dummy variables; a column for each category, thus without
>>
>>omitting one)).
>>
>>
>>
>>Regards,
>>
>>Anita van der Kooij
>>
>>Data Theory Group
>>
>>Leiden University.
>>
>>
>>
>>________________________________
>>
>>
>>
>>From: SPSSX(r) Discussion on behalf of Hector Maletta
>>
>>Sent: Thu 17/08/2006 17:52
>>
>>To: [hidden email] <mailto:[hidden email]>
>>
>>Subject: Reference category for dummies in factor analysis
>>
>>
>>
>>
>>
>>
>>
>>Dear colleagues,
>>
>>
>>
>>I am re-posting (slightly re-phrased for added clarity) a question I sent
>>
>>the list about a week ago without eliciting any response as yet. I hope
>>
>>
>some
>
>
>>factor analysis experts may be able to help.
>>
>>
>>
>>In a research project on which we work together, a colleague of mine
>>
>>constructed a scale based on factor scores obtained through classical
>>
>>
>factor
>
>
>>analysis  (principal components) of a number of categorical census
>>
>>
>variables
>
>
>>all transformed into dummies. The variables concerned the standard of
>>
>>
>living
>
>
>>of households and included quality of dwelling and basic services such as
>>
>>sanitation, water supply, electricity and the like. (The scale was not
>>
>>simply the score for the first factor, but the average score of several
>>
>>factors, weighted by their respective contribution to explaining the
>>
>>
>overall
>
>
>>variance of observed variables, but this is, I surmise, beside the point.)
>>
>>
>>
>>Now, he found out that the choice of reference or "omitted" category for
>>
>>defining the dummies has an influence on results. He first ran the analysis
>>
>>using the first category of all categorical variables as the reference
>>
>>category, and then repeated the analysis using the last category as the
>>
>>reference or omitted category, whatever they might be. He found that the
>>
>>resulting scale varied not only in absolute value but also in the shape of
>>
>>its distribution.
>>
>>
>>
>>I can understand that the absolute value of the factor scores may change
>>
>>
>and
>
>
>>even the ranking of the categories of the various variables (in terms of
>>
>>their average scores) may also be different, since after all the list of
>>
>>dummies used has varied and the categories are tallied each time against a
>>
>>different reference category. But the shape of the scale distribution
>>
>>
>should
>
>
>>not change, I guess, especially not in a drastic manner. In this case the
>>
>>shape of the scale frequency distribution did change.  Both distributions
>>
>>were roughly normal, with a kind of "hump" on one side, one of them on the
>>
>>left and the other on the right, probably due to the change in reference
>>
>>categories, but also with changes in the range of the scale and other
>>
>>details.
>>
>>
>>
>>Also, he found that the two scales had not a perfect correlation, and
>>
>>moreover, that their correlation was negative. That the correlation was
>>
>>negative may be understandable: the first category in such census variables
>>
>>is usually a "good" one (for instance, a home with walls made of brick or
>>
>>concrete) and the last one is frequently a "bad" one (earthen floor) or a
>>
>>residual heterogeneous one including bad options ("other" kinds of roof).
>>
>>But since the two scales are just different combinations of the same
>>
>>categorical variables based on the same statistical treatment of their
>>
>>
>given
>
>
>>covariance matrix, one should expect a closer, indeed a perfect
>>
>>
>correlation,
>
>
>>even if a negative one is possible for the reasons stated above. Changing
>>
>>the reference category should be like changing the unit of measurement or
>>
>>the position of the zero point (like passing from Celsius to Fahrenheit), a
>>
>>decision not affecting the correlation coefficient with other variables. In
>>
>>this case, instead, the two scales had r = -0.54, implying they shared only
>>
>>29% of their variance, even in the extreme case when ALL the possible
>>
>>factors (as many as variables) were extracted and all their scores averaged
>>
>>into the scale, and therefore the entire variance, common or specific, of
>>
>>the whole set of variables was taken into account).
>>
>>
>>
>>I should add that the dataset was a large sample of census data, and all
>>
>>
>the
>
>
>>results were statistically significant.
>>
>>
>>
>>Any ideas why choosing different reference categories for dummy conversion
>>
>>could have such impact on results? I would greatly appreciate your thoughts
>>
>>in this regard.
>>
>>
>>
>>Hector
>>
>>
>>
>>
>>
>>
>>
>>**********************************************************************
>>
>>This email and any files transmitted with it are confidential and
>>
>>intended solely for the use of the individual or entity to whom they
>>
>>are addressed. If you have received this email in error please notify
>>
>>the system manager.
>>
>>**********************************************************************
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Reference category for dummies in factor analysis

Hector Maletta
Art,

As I told you before, and you may have seen in this thread, I have been
exchanging messages on this topic with Anita van der Kooij, who along with
Jacqueline Meulman and others at Leiden authored CATPCA, CATREG and other
similar procedures based on optimal scaling and alternating least squares.
Now I have gained a lot of enlightenment on this matter (as I hope also
others in the list have done). As yourself, I have also no experience myself
with these procedures.

My original question remains, alas, unanswered: why the SPSS FACTOR
procedure, applied to a number of categorical variables converted into
dummies, would yield different results depending on which category is used
as the reference category in each variable. Not the trivially different
results resulting from the different contrast but non-trivial ones such as
the shape of the distribution etc.

Now we are tilting towards using CATPCA instead of FACTOR as the starting
point of our analysis, if only we could find out whether the mathematical
properties of the solution fit well with our analytical purposes. The
prospects in that regard are so far quite promising.

Hector



-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Art
Kendall
Enviado el: Friday, August 18, 2006 2:49 PM
Para: [hidden email]
Asunto: Re: Reference category for dummies in factor analysis



In an INDSCAL approach one finds dimensions that are common to the

overall set of matrices.  The common matrix is the same as that found in

a simple MDS.  However, each matrix has  measures of how much use is

made of each dimension in each matrix.  What I was thinking that access

to fresh vegetables may be less defining in rural areas than in urban

areas, or numbers of rooms in temperate micro-climates where much

household activity can be done outdoors.  However, you might also be

able to get at that by clustering localities based on scores on the

first so many factors.



The Leiden people built on the work of the people I mentioned and from

what I have seen at the Classification society and other sources their

work is outstanding.  I haven't yet had an opportunity to use CATPCA or

CATREG, but they look promising for many purposes. My understanding is

that CATPCA produces interval level scores that are interpreted much as

those in PCA.  It seems that if you have a solid basis to construct your

measure you should be ok.   If you come across anything that

compares/contrasts Scores from PCA, CATPCA, and traditional scores from

factor analysis using unit weights, I would appreciate it if you would

pass the citation along.



Art

Social Research Consultants

[hidden email]







Hector Maletta wrote:



>Art,

>Interesting thoughts, as usual from you. I didn't know about the higher

>precision of exponents. About non parametric factor analysis based on

>alternating least squares I have received excellent advice from one of the

>major figures in the field, Anita van der Kooij (one of the developers of

>the Categories SPSS module where these procedures are included). Of course

>in those approaches you do not drop any category, all categories are

>quantified and you get a unique set of factor scores, so my problem

>disappears. Since our project involves using the scores in a model
involving

>index number theory and neoclassical economics, we are now analyzing
whether

>by using categorical factor analysis we may lose some useful mathematical

>properties of PCA.

>Regarding your idea of differentiating by region or sector: Of course, this

>kind of scale is computed in order to be used in analytical and practical

>applications involving geographical breakdown (e.g. to improve or refine

>targeting of social programs) and other analytical subdivisions (such as

>sector of employment). For some applications it makes sense to compute a

>region-based scale and for other purposes a nationwide one. In our case we

>are working in the context of nationally adopted goals of equitable

>development within so-called United Nations Millennium Development Goals,

>and therefore the standards for of the standard of living should be set at

>national level, because governments establish such goals for all the people

>of their nations. However, in equations where the scale is a predictor,

>regions are certainly another likely predictor along with other variables
of

>interest (employment and education level of household adults, say), to

>predict outcomes such as children dropping out of school or child mortality

>which figure outstandingly in Millennium Goals.

>Hector

>

>-----Mensaje original-----

>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Art

>Kendall

>Enviado el: Friday, August 18, 2006 11:53 AM

>Para: [hidden email]

>Asunto: Re: Reference category for dummies in factor analysis

>

>If you think that you might have been approaching a zero determinant,

>you might try a pfa and see what the communalities look like.

>Obviously the determinant would be zero if all of the categories for a

>single variable had dummies.

>Also, the 16 or so decimal digits is only in the mantissa, the exponent

>can go to something like 1022 or 1023 depending on the sign.

>

>It really would be interesting to hear what some of the inventors of

>extensions to FA, like CA,INDSCAL, ALSCAL, PROXSCAL have to say about

>your problem.  I would urge you to post a description of what you are

>trying to do and the problem you ran into on the lists I mentioned..

>

>Just to stir the pot.  Is it possible that the relation of the variables

>would vary by market areas e.g., manufacturing zones, fishing coastal,

>vs field agricultural vs herding etc.?

>So that something like an INDSCAL with a correlation or other variable

>similarity measure per "region" would be informative and fit the data

>even better?

>

>Art

>Social Research Consultants

>[hidden email]

>

>Hector Maletta wrote:

>

>

>

>>Art,

>>

>>Thanks for your interesting response. We used PCA, with 1.00

>>commonality, i.e. extracting 100% variance. By "classical" I meant

>>parametric factor analysis and not any form of optimal scaling or

>>alternating least squares forms of data reduction. I am now

>>considering these other alternatives under advice of Anita van der Kooij.

>>

>>

>>

>>I share your uncertainty about why results change depending on which

>>category is omitted, and that was the original question starting this

>>thread. Since nobody else seems to have an answer I will offer one

>>purely numerical hypothesis. The exercise with the varying results, it

>>turns out, was done by my colleague not with the entire census but

>>with a SAMPLE of the Peru census (about 200,000 households, still a

>>lot but perhaps not so much for so many variables and factors), and

>>the contributions of latter factors were pretty small. SPSS provides,

>>as is well known, a precision no higher than 15 decimal places approx.

>>So it is just possible that some matrix figures for some of the minor

>>factors differed only on the 15th or 16th decimal place (or further

>>down), and then were taken as equal, and this may have caused some

>>matrix to be singular or (most probably) near singular, and the

>>results to be still computable but unstable. Moreover, some of the

>>categories in census questions used as reference or omitted categories

>>were populated by very few cases, which may have compounded the

>>problem. Since running this on the entire census (which would enhance

>>statistical significance and stability of results) takes a lot of

>>computer time and has to be done several times with different

>>reference categories, we have not done it yet but will proceed soon

>>and report back. But I wanted to know whether some mathematical reason

>>existed for the discrepancy.

>>

>>

>>

>>About why I would want to create a single score out of multiple

>>factors, let us leave it for another occasion since it is a rather

>>complicated story of a project connecting factor analysis with index

>>number theory and economic welfare theory.

>>

>>

>>

>>Hector

>>

>>

>>

>>------------------------------------------------------------------------

>>

>>De: Art Kendall [mailto:[hidden email]]

>>Enviado el: Friday, August 18, 2006 9:34 AM

>>Para: Kooij, A.J. van der; [hidden email]

>>CC: [hidden email]

>>Asunto: Re: Reference category for dummies in factor analysis

>>

>>

>>

>>This has been an interesting discussion.  I don't know why the FA and

>>scores would change  depending on which category is omitted.  Were

>>there errors in recoding to dummies that could have created different

>>missing values?

>>

>>

>>You also said classical FA, but then said PCA.  What did you use for

>>communality estimates.? 1.00? Squared multiple correlations?

>>

>>(I'm not sure why you would create a single score if you have multiple

>>factors either, but that is another question.)

>>

>>What I do know is that people who know a lot more about CA, MDS, and

>>factor analysis than I do ( Like Joe Kruskal, Doug Carroll, Willem

>>Heisser, Phipps Arabie, Shizuhiko Nishimoto, et al)  follow the

>>class-l and mpsych-l discussion lists.

>>see

>>

>>http://aris.ss.uci.edu/smp/mpsych.html

>>

>>and

>>

>>http://www.classification-society.org/csna/lists.html#class-l

>>

>>Art Kendall

>>[hidden email] <mailto:[hidden email]>

>>

>>

>>

>>Kooij, A.J. van der wrote:

>>

>>

>>

>>>... trouble because any category of each original census question would
be

>>>

>>>

>an exact linear

>

>

>>>function of the remaining categories of the question.

>>>

>>>

>>>

>>>

>>>

>>Yes, but this gives trouble in regression, not in PCA, as far as I know.

>>

>>

>>

>>

>>

>>

>>

>>>In the indicator matrix, one category will have zeroes on all indicator

>>>

>>>

>variables.

>

>

>>>

>>>

>>>

>>No, and, sorry, I was confused with CA on indicator matrix, but this is

>>

>>

>"sort of" PCA.  See syntax below (object scores=component scores are equal

>to row scores CA, category quantifications equal to column scores CA).

>

>

>>Regards,

>>

>>Anita.

>>

>>

>>

>>

>>

>>data list free/v1 v2 v3.

>>

>>

>>

>>begin data.

>>

>>

>>

>>1 2 3

>>

>>

>>

>>2 1 3

>>

>>

>>

>>2 2 2

>>

>>

>>

>>3 1 1

>>

>>

>>

>>2 3 4

>>

>>

>>

>>2 2 2

>>

>>

>>

>>1 2 4

>>

>>

>>

>>end data.

>>

>>

>>

>>

>>

>>

>>

>>Multiple Correspondence v1 v2 v3

>>

>>

>>

>>/analysis v1 v2 v3

>>

>>

>>

>>/dim=2

>>

>>

>>

>>/critit .0000001

>>

>>

>>

>>/print discrim quant obj

>>

>>

>>

>>/plot  none.

>>

>>

>>

>>

>>

>>

>>

>>catpca v1 v2 v3

>>

>>

>>

>>/analysis v1 v2 v3 (mnom)

>>

>>

>>

>>/dim=2

>>

>>

>>

>>/critit .0000001

>>

>>

>>

>>/print quant obj

>>

>>

>>

>>/plot  none.

>>

>>

>>

>>

>>

>>

>>

>>data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1 v3cat2

>>

>>

>v3cat3 v3cat4 .

>

>

>>

>>begin data.

>>

>>

>>

>>1 0 0 0 1 0 0 0 1 0

>>

>>

>>

>>0 1 0 1 0 0 0 0 1 0

>>

>>

>>

>>0 1 0 0 1 0 0 1 0 0

>>

>>

>>

>>0 0 1 1 0 0 1 0 0 0

>>

>>

>>

>>0 1 0 0 0 1 0 0 0 1

>>

>>

>>

>>0 1 0 0 1 0 0 1 0 0

>>

>>

>>

>>1 0 0 0 1 0 0 0 0 1

>>

>>

>>

>>end data.

>>

>>

>>

>>

>>

>>

>>

>>CORRESPONDENCE

>>

>>

>>

>> TABLE = all (7,10)

>>

>>

>>

>> /DIMENSIONS = 2

>>

>>

>>

>> /NORMALIZATION = cprin

>>

>>

>>

>> /PRINT = RPOINTS CPOINTS

>>

>>

>>

>> /PLOT = none .

>>

>>

>>

>>

>>

>>

>>

>>________________________________

>>

>>

>>

>>From: SPSSX(r) Discussion on behalf of Hector Maletta

>>

>>Sent: Thu 17/08/2006 19:56

>>

>>To: [hidden email] <mailto:[hidden email]>

>>

>>Subject: Re: Reference category for dummies in factor analysis

>>

>>

>>

>>

>>

>>

>>

>>Thank you, Anita. I will certainly look into your suggestion about CATCPA.

>>

>>However, I suspect some mathematical properties of the scores generated by

>>

>>CATPCA are not the ones I hope to have in our scale, because of the

>>

>>non-parametric nature of the procedure (too long to explain here, and not

>>

>>sure of understanding it myself).

>>

>>As for your second idea, I think if you try to apply PCA on dummies not

>>

>>omitting any category you'd run into trouble because any category of each

>>

>>original census question would be an exact linear function of the
remaining

>>

>>categories of the question. In the indicator matrix, one category will
have

>>

>>zeroes on all indicator variables, and that one is the "omitted" category.

>>

>>Hector

>>

>>

>>

>>

>>

>>-----Mensaje original-----

>>

>>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de

>>

>>Kooij, A.J. van der

>>

>>Enviado el: Thursday, August 17, 2006 2:37 PM

>>

>>Para: [hidden email] <mailto:[hidden email]>

>>

>>Asunto: Re: Reference category for dummies in factor analysis

>>

>>

>>

>>CATPCA (in Data Reduction menu, under Optimal Scaling) is PCA for

>>

>>(ordered//ordinal and unorderd/nominal) categorical variables; no need to

>>

>>use dummies then.

>>

>>Using PCA on dummies I think you should not omit dummies (for nominal

>>

>>variables you can do PCA on an indicator maxtrix (that has columns that
can

>>

>>be regarded as dummy variables; a column for each category, thus without

>>

>>omitting one)).

>>

>>

>>

>>Regards,

>>

>>Anita van der Kooij

>>

>>Data Theory Group

>>

>>Leiden University.

>>

>>

>>

>>________________________________

>>

>>

>>

>>From: SPSSX(r) Discussion on behalf of Hector Maletta

>>

>>Sent: Thu 17/08/2006 17:52

>>

>>To: [hidden email] <mailto:[hidden email]>

>>

>>Subject: Reference category for dummies in factor analysis

>>

>>

>>

>>

>>

>>

>>

>>Dear colleagues,

>>

>>

>>

>>I am re-posting (slightly re-phrased for added clarity) a question I sent

>>

>>the list about a week ago without eliciting any response as yet. I hope

>>

>>

>some

>

>

>>factor analysis experts may be able to help.

>>

>>

>>

>>In a research project on which we work together, a colleague of mine

>>

>>constructed a scale based on factor scores obtained through classical

>>

>>

>factor

>

>

>>analysis  (principal components) of a number of categorical census

>>

>>

>variables

>

>

>>all transformed into dummies. The variables concerned the standard of

>>

>>

>living

>

>

>>of households and included quality of dwelling and basic services such as

>>

>>sanitation, water supply, electricity and the like. (The scale was not

>>

>>simply the score for the first factor, but the average score of several

>>

>>factors, weighted by their respective contribution to explaining the

>>

>>

>overall

>

>

>>variance of observed variables, but this is, I surmise, beside the point.)

>>

>>

>>

>>Now, he found out that the choice of reference or "omitted" category for

>>

>>defining the dummies has an influence on results. He first ran the
analysis

>>

>>using the first category of all categorical variables as the reference

>>

>>category, and then repeated the analysis using the last category as the

>>

>>reference or omitted category, whatever they might be. He found that the

>>

>>resulting scale varied not only in absolute value but also in the shape of

>>

>>its distribution.

>>

>>

>>

>>I can understand that the absolute value of the factor scores may change

>>

>>

>and

>

>

>>even the ranking of the categories of the various variables (in terms of

>>

>>their average scores) may also be different, since after all the list of

>>

>>dummies used has varied and the categories are tallied each time against a

>>

>>different reference category. But the shape of the scale distribution

>>

>>

>should

>

>

>>not change, I guess, especially not in a drastic manner. In this case the

>>

>>shape of the scale frequency distribution did change.  Both distributions

>>

>>were roughly normal, with a kind of "hump" on one side, one of them on the

>>

>>left and the other on the right, probably due to the change in reference

>>

>>categories, but also with changes in the range of the scale and other

>>

>>details.

>>

>>

>>

>>Also, he found that the two scales had not a perfect correlation, and

>>

>>moreover, that their correlation was negative. That the correlation was

>>

>>negative may be understandable: the first category in such census
variables

>>

>>is usually a "good" one (for instance, a home with walls made of brick or

>>

>>concrete) and the last one is frequently a "bad" one (earthen floor) or a

>>

>>residual heterogeneous one including bad options ("other" kinds of roof).

>>

>>But since the two scales are just different combinations of the same

>>

>>categorical variables based on the same statistical treatment of their

>>

>>

>given

>

>

>>covariance matrix, one should expect a closer, indeed a perfect

>>

>>

>correlation,

>

>

>>even if a negative one is possible for the reasons stated above. Changing

>>

>>the reference category should be like changing the unit of measurement or

>>

>>the position of the zero point (like passing from Celsius to Fahrenheit),
a

>>

>>decision not affecting the correlation coefficient with other variables.
In

>>

>>this case, instead, the two scales had r = -0.54, implying they shared
only

>>

>>29% of their variance, even in the extreme case when ALL the possible

>>

>>factors (as many as variables) were extracted and all their scores
averaged

>>

>>into the scale, and therefore the entire variance, common or specific, of

>>

>>the whole set of variables was taken into account).

>>

>>

>>

>>I should add that the dataset was a large sample of census data, and all

>>

>>

>the

>

>

>>results were statistically significant.

>>

>>

>>

>>Any ideas why choosing different reference categories for dummy conversion

>>

>>could have such impact on results? I would greatly appreciate your
thoughts

>>

>>in this regard.

>>

>>

>>

>>Hector

>>

>>

>>

>>

>>

>>

>>

>>**********************************************************************

>>

>>This email and any files transmitted with it are confidential and

>>

>>intended solely for the use of the individual or entity to whom they

>>

>>are addressed. If you have received this email in error please notify

>>

>>the system manager.

>>

>>**********************************************************************

>>

>>

>>

>>

>>

>>

>>

>>

>>

>

>

>

>

>

>
12