SPSSX Discussion

Reference category for dummies in factor analysis

Classic

List

Threaded

22 messages Options

statisticsdoc

Re: Reference category for dummies in factor analysis

Hector,

This has been a fascinating thread. I am just curious - have you computed
factor scores from the different solutions? Do the same cases get different
factor scores when different reference categories are used?

Best,

Stephen Brand

For personalized and professional consultation in statistics and research
design, visit
www.statisticsdoc.com

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of
Hector Maletta
Sent: Friday, August 18, 2006 2:16 PM
To: [hidden email]
Subject: Re: Reference category for dummies in factor analysis

Art,

As I told you before, and you may have seen in this thread, I have been
exchanging messages on this topic with Anita van der Kooij, who along with
Jacqueline Meulman and others at Leiden authored CATPCA, CATREG and other
similar procedures based on optimal scaling and alternating least squares.
Now I have gained a lot of enlightenment on this matter (as I hope also
others in the list have done). As yourself, I have also no experience myself
with these procedures.

My original question remains, alas, unanswered: why the SPSS FACTOR
procedure, applied to a number of categorical variables converted into
dummies, would yield different results depending on which category is used
as the reference category in each variable. Not the trivially different
results resulting from the different contrast but non-trivial ones such as
the shape of the distribution etc.

Now we are tilting towards using CATPCA instead of FACTOR as the starting
point of our analysis, if only we could find out whether the mathematical
properties of the solution fit well with our analytical purposes. The
prospects in that regard are so far quite promising.

Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Art
Kendall
Enviado el: Friday, August 18, 2006 2:49 PM
Para: [hidden email]
Asunto: Re: Reference category for dummies in factor analysis

In an INDSCAL approach one finds dimensions that are common to the

overall set of matrices. The common matrix is the same as that found in

a simple MDS. However, each matrix has measures of how much use is

made of each dimension in each matrix. What I was thinking that access

to fresh vegetables may be less defining in rural areas than in urban

areas, or numbers of rooms in temperate micro-climates where much

household activity can be done outdoors. However, you might also be

able to get at that by clustering localities based on scores on the

first so many factors.

The Leiden people built on the work of the people I mentioned and from

what I have seen at the Classification society and other sources their

work is outstanding. I haven't yet had an opportunity to use CATPCA or

CATREG, but they look promising for many purposes. My understanding is

that CATPCA produces interval level scores that are interpreted much as

those in PCA. It seems that if you have a solid basis to construct your

measure you should be ok. If you come across anything that

compares/contrasts Scores from PCA, CATPCA, and traditional scores from

factor analysis using unit weights, I would appreciate it if you would

pass the citation along.

Art

Social Research Consultants

[hidden email]

Hector Maletta wrote:

>Art,

>Interesting thoughts, as usual from you. I didn't know about the higher

>precision of exponents. About non parametric factor analysis based on

>alternating least squares I have received excellent advice from one of the

>major figures in the field, Anita van der Kooij (one of the developers of

>the Categories SPSS module where these procedures are included). Of course

>in those approaches you do not drop any category, all categories are

>quantified and you get a unique set of factor scores, so my problem

>disappears. Since our project involves using the scores in a model
involving

>index number theory and neoclassical economics, we are now analyzing
whether

>by using categorical factor analysis we may lose some useful mathematical

>properties of PCA.

>Regarding your idea of differentiating by region or sector: Of course, this

>kind of scale is computed in order to be used in analytical and practical

>applications involving geographical breakdown (e.g. to improve or refine

>targeting of social programs) and other analytical subdivisions (such as

>sector of employment). For some applications it makes sense to compute a

>region-based scale and for other purposes a nationwide one. In our case we

>are working in the context of nationally adopted goals of equitable

>development within so-called United Nations Millennium Development Goals,

>and therefore the standards for of the standard of living should be set at

>national level, because governments establish such goals for all the people

>of their nations. However, in equations where the scale is a predictor,

>regions are certainly another likely predictor along with other variables
of

>interest (employment and education level of household adults, say), to

>predict outcomes such as children dropping out of school or child mortality

>which figure outstandingly in Millennium Goals.

>Hector

>

>-----Mensaje original-----

>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Art

>Kendall

>Enviado el: Friday, August 18, 2006 11:53 AM

>Para: [hidden email]

>Asunto: Re: Reference category for dummies in factor analysis

>

>If you think that you might have been approaching a zero determinant,

>you might try a pfa and see what the communalities look like.

>Obviously the determinant would be zero if all of the categories for a

>single variable had dummies.

>Also, the 16 or so decimal digits is only in the mantissa, the exponent

>can go to something like 1022 or 1023 depending on the sign.

>

>It really would be interesting to hear what some of the inventors of

>extensions to FA, like CA,INDSCAL, ALSCAL, PROXSCAL have to say about

>your problem. I would urge you to post a description of what you are

>trying to do and the problem you ran into on the lists I mentioned..

>

>Just to stir the pot. Is it possible that the relation of the variables

>would vary by market areas e.g., manufacturing zones, fishing coastal,

>vs field agricultural vs herding etc.?

>So that something like an INDSCAL with a correlation or other variable

>similarity measure per "region" would be informative and fit the data

>even better?

>

>Art

>Social Research Consultants

>[hidden email]

>

>Hector Maletta wrote:

>

>

>

>>Art,

>>

>>Thanks for your interesting response. We used PCA, with 1.00

>>commonality, i.e. extracting 100% variance. By "classical" I meant

>>parametric factor analysis and not any form of optimal scaling or

>>alternating least squares forms of data reduction. I am now

>>considering these other alternatives under advice of Anita van der Kooij.

>>

>>

>>

>>I share your uncertainty about why results change depending on which

>>category is omitted, and that was the original question starting this

>>thread. Since nobody else seems to have an answer I will offer one

>>purely numerical hypothesis. The exercise with the varying results, it

>>turns out, was done by my colleague not with the entire census but

>>with a SAMPLE of the Peru census (about 200,000 households, still a

>>lot but perhaps not so much for so many variables and factors), and

>>the contributions of latter factors were pretty small. SPSS provides,

>>as is well known, a precision no higher than 15 decimal places approx.

>>So it is just possible that some matrix figures for some of the minor

>>factors differed only on the 15th or 16th decimal place (or further

>>down), and then were taken as equal, and this may have caused some

>>matrix to be singular or (most probably) near singular, and the

>>results to be still computable but unstable. Moreover, some of the

>>categories in census questions used as reference or omitted categories

>>were populated by very few cases, which may have compounded the

>>problem. Since running this on the entire census (which would enhance

>>statistical significance and stability of results) takes a lot of

>>computer time and has to be done several times with different

>>reference categories, we have not done it yet but will proceed soon

>>and report back. But I wanted to know whether some mathematical reason

>>existed for the discrepancy.

>>

>>

>>

>>About why I would want to create a single score out of multiple

>>factors, let us leave it for another occasion since it is a rather

>>complicated story of a project connecting factor analysis with index

>>number theory and economic welfare theory.

>>

>>

>>

>>Hector

>>

>>

>>

>>------------------------------------------------------------------------

>>

>>De: Art Kendall [mailto:[hidden email]]

>>Enviado el: Friday, August 18, 2006 9:34 AM

>>Para: Kooij, A.J. van der; [hidden email]

>>CC: [hidden email]

>>Asunto: Re: Reference category for dummies in factor analysis

>>

>>

>>

>>This has been an interesting discussion. I don't know why the FA and

>>scores would change depending on which category is omitted. Were

>>there errors in recoding to dummies that could have created different

>>missing values?

>>

>>

>>You also said classical FA, but then said PCA. What did you use for

>>communality estimates.? 1.00? Squared multiple correlations?

>>

>>(I'm not sure why you would create a single score if you have multiple

>>factors either, but that is another question.)

>>

>>What I do know is that people who know a lot more about CA, MDS, and

>>factor analysis than I do ( Like Joe Kruskal, Doug Carroll, Willem

>>Heisser, Phipps Arabie, Shizuhiko Nishimoto, et al) follow the

>>class-l and mpsych-l discussion lists.

>>see

>>

>>http://aris.ss.uci.edu/smp/mpsych.html

>>

>>and

>>

>>http://www.classification-society.org/csna/lists.html#class-l

>>

>>Art Kendall

>>[hidden email] <mailto:[hidden email]>

>>

>>

>>

>>Kooij, A.J. van der wrote:

>>

>>

>>

>>>... trouble because any category of each original census question would
be

>>>

>>>

>an exact linear

>

>

>>>function of the remaining categories of the question.

>>>

>>>

>>>

>>>

>>>

>>Yes, but this gives trouble in regression, not in PCA, as far as I know.

>>

>>

>>

>>

>>

>>

>>

>>>In the indicator matrix, one category will have zeroes on all indicator

>>>

>>>

>variables.

>

>

>>>

>>>

>>>

>>No, and, sorry, I was confused with CA on indicator matrix, but this is

>>

>>

>"sort of" PCA. See syntax below (object scores=component scores are equal

>to row scores CA, category quantifications equal to column scores CA).

>

>

>>Regards,

>>

>>Anita.

>>

>>

>>

>>

>>

>>data list free/v1 v2 v3.

>>

>>

>>

>>begin data.

>>

>>

>>

>>1 2 3

>>

>>

>>

>>2 1 3

>>

>>

>>

>>2 2 2

>>

>>

>>

>>3 1 1

>>

>>

>>

>>2 3 4

>>

>>

>>

>>2 2 2

>>

>>

>>

>>1 2 4

>>

>>

>>

>>end data.

>>

>>

>>

>>

>>

>>

>>

>>Multiple Correspondence v1 v2 v3

>>

>>

>>

>>/analysis v1 v2 v3

>>

>>

>>

>>/dim=2

>>

>>

>>

>>/critit .0000001

>>

>>

>>

>>/print discrim quant obj

>>

>>

>>

>>/plot none.

>>

>>

>>

>>

>>

>>

>>

>>catpca v1 v2 v3

>>

>>

>>

>>/analysis v1 v2 v3 (mnom)

>>

>>

>>

>>/dim=2

>>

>>

>>

>>/critit .0000001

>>

>>

>>

>>/print quant obj

>>

>>

>>

>>/plot none.

>>

>>

>>

>>

>>

>>

>>

>>data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1 v3cat2

>>

>>

>v3cat3 v3cat4 .

>

>

>>

>>begin data.

>>

>>

>>

>>1 0 0 0 1 0 0 0 1 0

>>

>>

>>

>>0 1 0 1 0 0 0 0 1 0

>>

>>

>>

>>0 1 0 0 1 0 0 1 0 0

>>

>>

>>

>>0 0 1 1 0 0 1 0 0 0

>>

>>

>>

>>0 1 0 0 0 1 0 0 0 1

>>

>>

>>

>>0 1 0 0 1 0 0 1 0 0

>>

>>

>>

>>1 0 0 0 1 0 0 0 0 1

>>

>>

>>

>>end data.

>>

>>

>>

>>

>>

>>

>>

>>CORRESPONDENCE

>>

>>

>>

>> TABLE = all (7,10)

>>

>>

>>

>> /DIMENSIONS = 2

>>

>>

>>

>> /NORMALIZATION = cprin

>>

>>

>>

>> /PRINT = RPOINTS CPOINTS

>>

>>

>>

>> /PLOT = none .

>>

>>

>>

>>

>>

>>

>>

>>________________________________

>>

>>

>>

>>From: SPSSX(r) Discussion on behalf of Hector Maletta

>>

>>Sent: Thu 17/08/2006 19:56

>>

>>To: [hidden email] <mailto:[hidden email]>

>>

>>Subject: Re: Reference category for dummies in factor analysis

>>

>>

>>

>>

>>

>>

>>

>>Thank you, Anita. I will certainly look into your suggestion about CATCPA.

>>

>>However, I suspect some mathematical properties of the scores generated by

>>

>>CATPCA are not the ones I hope to have in our scale, because of the

>>

>>non-parametric nature of the procedure (too long to explain here, and not

>>

>>sure of understanding it myself).

>>

>>As for your second idea, I think if you try to apply PCA on dummies not

>>

>>omitting any category you'd run into trouble because any category of each

>>

>>original census question would be an exact linear function of the
remaining

>>

>>categories of the question. In the indicator matrix, one category will
have

>>

>>zeroes on all indicator variables, and that one is the "omitted" category.

>>

>>Hector

>>

>>

>>

>>

>>

>>-----Mensaje original-----

>>

>>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de

>>

>>Kooij, A.J. van der

>>

>>Enviado el: Thursday, August 17, 2006 2:37 PM

>>

>>Para: [hidden email] <mailto:[hidden email]>

>>

>>Asunto: Re: Reference category for dummies in factor analysis

>>

>>

>>

>>CATPCA (in Data Reduction menu, under Optimal Scaling) is PCA for

>>

>>(ordered//ordinal and unorderd/nominal) categorical variables; no need to

>>

>>use dummies then.

>>

>>Using PCA on dummies I think you should not omit dummies (for nominal

>>

>>variables you can do PCA on an indicator maxtrix (that has columns that
can

>>

>>be regarded as dummy variables; a column for each category, thus without

>>

>>omitting one)).

>>

>>

>>

>>Regards,

>>

>>Anita van der Kooij

>>

>>Data Theory Group

>>

>>Leiden University.

>>

>>

>>

>>________________________________

>>

>>

>>

>>From: SPSSX(r) Discussion on behalf of Hector Maletta

>>

>>Sent: Thu 17/08/2006 17:52

>>

>>To: [hidden email] <mailto:[hidden email]>

>>

>>Subject: Reference category for dummies in factor analysis

>>

>>

>>

>>

>>

>>

>>

>>Dear colleagues,

>>

>>

>>

>>I am re-posting (slightly re-phrased for added clarity) a question I sent

>>

>>the list about a week ago without eliciting any response as yet. I hope

>>

>>

>some

>

>

>>factor analysis experts may be able to help.

>>

>>

>>

>>In a research project on which we work together, a colleague of mine

>>

>>constructed a scale based on factor scores obtained through classical

>>

>>

>factor

>

>

>>analysis (principal components) of a number of categorical census

>>

>>

>variables

>

>

>>all transformed into dummies. The variables concerned the standard of

>>

>>

>living

>

>

>>of households and included quality of dwelling and basic services such as

>>

>>sanitation, water supply, electricity and the like. (The scale was not

>>

>>simply the score for the first factor, but the average score of several

>>

>>factors, weighted by their respective contribution to explaining the

>>

>>

>overall

>

>

>>variance of observed variables, but this is, I surmise, beside the point.)

>>

>>

>>

>>Now, he found out that the choice of reference or "omitted" category for

>>

>>defining the dummies has an influence on results. He first ran the
analysis

>>

>>using the first category of all categorical variables as the reference

>>

>>category, and then repeated the analysis using the last category as the

>>

>>reference or omitted category, whatever they might be. He found that the

>>

>>resulting scale varied not only in absolute value but also in the shape of

>>

>>its distribution.

>>

>>

>>

>>I can understand that the absolute value of the factor scores may change

>>

>>

>and

>

>

>>even the ranking of the categories of the various variables (in terms of

>>

>>their average scores) may also be different, since after all the list of

>>

>>dummies used has varied and the categories are tallied each time against a

>>

>>different reference category. But the shape of the scale distribution

>>

>>

>should

>

>

>>not change, I guess, especially not in a drastic manner. In this case the

>>

>>shape of the scale frequency distribution did change. Both distributions

>>

>>were roughly normal, with a kind of "hump" on one side, one of them on the

>>

>>left and the other on the right, probably due to the change in reference

>>

>>categories, but also with changes in the range of the scale and other

>>

>>details.

>>

>>

>>

>>Also, he found that the two scales had not a perfect correlation, and

>>

>>moreover, that their correlation was negative. That the correlation was

>>

>>negative may be understandable: the first category in such census
variables

>>

>>is usually a "good" one (for instance, a home with walls made of brick or

>>

>>concrete) and the last one is frequently a "bad" one (earthen floor) or a

>>

>>residual heterogeneous one including bad options ("other" kinds of roof).

>>

>>But since the two scales are just different combinations of the same

>>

>>categorical variables based on the same statistical treatment of their

>>

>>

>given

>

>

>>covariance matrix, one should expect a closer, indeed a perfect

>>

>>

>correlation,

>

>

>>even if a negative one is possible for the reasons stated above. Changing

>>

>>the reference category should be like changing the unit of measurement or

>>

>>the position of the zero point (like passing from Celsius to Fahrenheit),
a

>>

>>decision not affecting the correlation coefficient with other variables.
In

>>

>>this case, instead, the two scales had r = -0.54, implying they shared
only

>>

>>29% of their variance, even in the extreme case when ALL the possible

>>

>>factors (as many as variables) were extracted and all their scores
averaged

>>

>>into the scale, and therefore the entire variance, common or specific, of

>>

>>the whole set of variables was taken into account).

>>

>>

>>

>>I should add that the dataset was a large sample of census data, and all

>>

>>

>the

>

>

>>results were statistically significant.

>>

>>

>>

>>Any ideas why choosing different reference categories for dummy conversion

>>

>>could have such impact on results? I would greatly appreciate your
thoughts

>>

>>in this regard.

>>

>>

>>

>>Hector

>>

>>

>>

>>

>>

>>

>>

>>**********************************************************************

>>

>>This email and any files transmitted with it are confidential and

>>

>>intended solely for the use of the individual or entity to whom they

>>

>>are addressed. If you have received this email in error please notify

>>

>>the system manager.

>>

>>**********************************************************************

>>

>>

>>

>>

>>

>>

>>

>>

>>

>

>

>

>

>

>

Hector Maletta

Re: Reference category for dummies in factor analysis

Stephen,

Yes, and apparently yes. As I explained before, the actual job of computing
the scale (based on factor analysis of categorical variables converted into
dummy variables with different reference categories) was carried out on a
census sample of his own country by a colleague of mine, working with me on
the same project and living -as it happens- abroad. He just sent me the
average scores for various groupings of cases (by regions, sectors, and so
on), and the scale overall frequency distribution. Those scores were, as I
already explained, not the scores of one individual factor but an average
over the scores of many factors weighted by their contributions to overall
variance. He did not send me the actual dataset with the original variables
and the factor scores for individual households (too big to send over the
internet, I suppose), but I figure the only way those averages can possibly
be different when the reference category is changed is that the individual
scores are also different, isn't it?

I asked him to check on several issues (which factor scores are different,
whether the differences seem substantial or just due to rounding error in
very small values close to SPSS lower limit of precision, low number of
cases in the reference category, and so on). He has made some checks but
others are still pending (the poor soul is too busy with other things and
hasn't got around to it yet). I've had no time these latest days (besides
sustaining this discussion and doing a number of other things) to run myself
the analysis with different ref categories on some of my own datasets and
see what gives.

If indeed the choice of reference category, which is essentially arbitrary,
causes serious indeterminacy in the resulting scale, we would consider
abandoning FACTOR and using CATPCA, which under the same conditions gives a
unique solution with (apparently) the same properties. However, since we
need the resulting scale to have certain mathematical properties consistent
with aspects of our project involving index numbers theory and some areas of
neoclassical economics (welfare theory), that shift to the non-parametric
realm of iterative algorithms in which CATPCA thrives may or may not be
convenient, and we'll have to consider it carefully.

Hector

-----Mensaje original-----
De: Statisticsdoc [mailto:[hidden email]]
Enviado el: Friday, August 18, 2006 11:54 PM
Para: Hector Maletta; [hidden email]
Asunto: RE: Reference category for dummies in factor analysis

Hector,

This has been a fascinating thread. I am just curious - have you computed

factor scores from the different solutions? Do the same cases get different

factor scores when different reference categories are used?

Best,

Stephen Brand

For personalized and professional consultation in statistics and research

design, visit

www.statisticsdoc.com

-----Original Message-----

From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of

Hector Maletta

Sent: Friday, August 18, 2006 2:16 PM

To: [hidden email]

Subject: Re: Reference category for dummies in factor analysis

Art,

As I told you before, and you may have seen in this thread, I have been

exchanging messages on this topic with Anita van der Kooij, who along with

Jacqueline Meulman and others at Leiden authored CATPCA, CATREG and other

similar procedures based on optimal scaling and alternating least squares.

Now I have gained a lot of enlightenment on this matter (as I hope also

others in the list have done). As yourself, I have also no experience myself

with these procedures.

My original question remains, alas, unanswered: why the SPSS FACTOR

procedure, applied to a number of categorical variables converted into

dummies, would yield different results depending on which category is used

as the reference category in each variable. Not the trivially different

results resulting from the different contrast but non-trivial ones such as

the shape of the distribution etc.

Now we are tilting towards using CATPCA instead of FACTOR as the starting

point of our analysis, if only we could find out whether the mathematical

properties of the solution fit well with our analytical purposes. The

prospects in that regard are so far quite promising.

Hector

-----Mensaje original-----

De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Art

Kendall

Enviado el: Friday, August 18, 2006 2:49 PM

Para: [hidden email]

Asunto: Re: Reference category for dummies in factor analysis

In an INDSCAL approach one finds dimensions that are common to the

overall set of matrices. The common matrix is the same as that found in

a simple MDS. However, each matrix has measures of how much use is

made of each dimension in each matrix. What I was thinking that access

to fresh vegetables may be less defining in rural areas than in urban

areas, or numbers of rooms in temperate micro-climates where much

household activity can be done outdoors. However, you might also be

able to get at that by clustering localities based on scores on the

first so many factors.

The Leiden people built on the work of the people I mentioned and from

what I have seen at the Classification society and other sources their

work is outstanding. I haven't yet had an opportunity to use CATPCA or

CATREG, but they look promising for many purposes. My understanding is

that CATPCA produces interval level scores that are interpreted much as

those in PCA. It seems that if you have a solid basis to construct your

measure you should be ok. If you come across anything that

compares/contrasts Scores from PCA, CATPCA, and traditional scores from

factor analysis using unit weights, I would appreciate it if you would

pass the citation along.

Art

Social Research Consultants

[hidden email]

Hector Maletta wrote:

>Art,

>Interesting thoughts, as usual from you. I didn't know about the higher

>precision of exponents. About non parametric factor analysis based on

>alternating least squares I have received excellent advice from one of the

>major figures in the field, Anita van der Kooij (one of the developers of

>the Categories SPSS module where these procedures are included). Of course

>in those approaches you do not drop any category, all categories are

>quantified and you get a unique set of factor scores, so my problem

>disappears. Since our project involves using the scores in a model

involving

>index number theory and neoclassical economics, we are now analyzing

whether

>by using categorical factor analysis we may lose some useful mathematical

>properties of PCA.

>Regarding your idea of differentiating by region or sector: Of course, this

>kind of scale is computed in order to be used in analytical and practical

>applications involving geographical breakdown (e.g. to improve or refine

>targeting of social programs) and other analytical subdivisions (such as

>sector of employment). For some applications it makes sense to compute a

>region-based scale and for other purposes a nationwide one. In our case we

>are working in the context of nationally adopted goals of equitable

>development within so-called United Nations Millennium Development Goals,

>and therefore the standards for of the standard of living should be set at

>national level, because governments establish such goals for all the people

>of their nations. However, in equations where the scale is a predictor,

>regions are certainly another likely predictor along with other variables

of

>interest (employment and education level of household adults, say), to

>predict outcomes such as children dropping out of school or child mortality

>which figure outstandingly in Millennium Goals.

>Hector

>

>-----Mensaje original-----

>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Art

>Kendall

>Enviado el: Friday, August 18, 2006 11:53 AM

>Para: [hidden email]

>Asunto: Re: Reference category for dummies in factor analysis

>

>If you think that you might have been approaching a zero determinant,

>you might try a pfa and see what the communalities look like.

>Obviously the determinant would be zero if all of the categories for a

>single variable had dummies.

>Also, the 16 or so decimal digits is only in the mantissa, the exponent

>can go to something like 1022 or 1023 depending on the sign.

>

>It really would be interesting to hear what some of the inventors of

>extensions to FA, like CA,INDSCAL, ALSCAL, PROXSCAL have to say about

>your problem. I would urge you to post a description of what you are

>trying to do and the problem you ran into on the lists I mentioned..

>

>Just to stir the pot. Is it possible that the relation of the variables

>would vary by market areas e.g., manufacturing zones, fishing coastal,

>vs field agricultural vs herding etc.?

>So that something like an INDSCAL with a correlation or other variable

>similarity measure per "region" would be informative and fit the data

>even better?

>

>Art

>Social Research Consultants

>[hidden email]

>

>Hector Maletta wrote:

>

>

>

>>Art,

>>

>>Thanks for your interesting response. We used PCA, with 1.00

>>commonality, i.e. extracting 100% variance. By "classical" I meant

>>parametric factor analysis and not any form of optimal scaling or

>>alternating least squares forms of data reduction. I am now

>>considering these other alternatives under advice of Anita van der Kooij.

>>

>>

>>

>>I share your uncertainty about why results change depending on which

>>category is omitted, and that was the original question starting this

>>thread. Since nobody else seems to have an answer I will offer one

>>purely numerical hypothesis. The exercise with the varying results, it

>>turns out, was done by my colleague not with the entire census but

>>with a SAMPLE of the Peru census (about 200,000 households, still a

>>lot but perhaps not so much for so many variables and factors), and

>>the contributions of latter factors were pretty small. SPSS provides,

>>as is well known, a precision no higher than 15 decimal places approx.

>>So it is just possible that some matrix figures for some of the minor

>>factors differed only on the 15th or 16th decimal place (or further

>>down), and then were taken as equal, and this may have caused some

>>matrix to be singular or (most probably) near singular, and the

>>results to be still computable but unstable. Moreover, some of the

>>categories in census questions used as reference or omitted categories

>>were populated by very few cases, which may have compounded the

>>problem. Since running this on the entire census (which would enhance

>>statistical significance and stability of results) takes a lot of

>>computer time and has to be done several times with different

>>reference categories, we have not done it yet but will proceed soon

>>and report back. But I wanted to know whether some mathematical reason

>>existed for the discrepancy.

>>

>>

>>

>>About why I would want to create a single score out of multiple

>>factors, let us leave it for another occasion since it is a rather

>>complicated story of a project connecting factor analysis with index

>>number theory and economic welfare theory.

>>

>>

>>

>>Hector

>>

>>

>>

>>------------------------------------------------------------------------

>>

>>De: Art Kendall [mailto:[hidden email]]

>>Enviado el: Friday, August 18, 2006 9:34 AM

>>Para: Kooij, A.J. van der; [hidden email]

>>CC: [hidden email]

>>Asunto: Re: Reference category for dummies in factor analysis

>>

>>

>>

>>This has been an interesting discussion. I don't know why the FA and

>>scores would change depending on which category is omitted. Were

>>there errors in recoding to dummies that could have created different

>>missing values?

>>

>>

>>You also said classical FA, but then said PCA. What did you use for

>>communality estimates.? 1.00? Squared multiple correlations?

>>

>>(I'm not sure why you would create a single score if you have multiple

>>factors either, but that is another question.)

>>

>>What I do know is that people who know a lot more about CA, MDS, and

>>factor analysis than I do ( Like Joe Kruskal, Doug Carroll, Willem

>>Heisser, Phipps Arabie, Shizuhiko Nishimoto, et al) follow the

>>class-l and mpsych-l discussion lists.

>>see

>>

>>http://aris.ss.uci.edu/smp/mpsych.html

>>

>>and

>>

>>http://www.classification-society.org/csna/lists.html#class-l

>>

>>Art Kendall

>>[hidden email] <mailto:[hidden email]>

>>

>>

>>

>>Kooij, A.J. van der wrote:

>>

>>

>>

>>>... trouble because any category of each original census question would

be

>>>

>>>

>an exact linear

>

>

>>>function of the remaining categories of the question.

>>>

>>>

>>>

>>>

>>>

>>Yes, but this gives trouble in regression, not in PCA, as far as I know.

>>

>>

>>

>>

>>

>>

>>

>>>In the indicator matrix, one category will have zeroes on all indicator

>>>

>>>

>variables.

>

>

>>>

>>>

>>>

>>No, and, sorry, I was confused with CA on indicator matrix, but this is

>>

>>

>"sort of" PCA. See syntax below (object scores=component scores are equal

>to row scores CA, category quantifications equal to column scores CA).

>

>

>>Regards,

>>

>>Anita.

>>

>>

>>

>>

>>

>>data list free/v1 v2 v3.

>>

>>

>>

>>begin data.

>>

>>

>>

>>1 2 3

>>

>>

>>

>>2 1 3

>>

>>

>>

>>2 2 2

>>

>>

>>

>>3 1 1

>>

>>

>>

>>2 3 4

>>

>>

>>

>>2 2 2

>>

>>

>>

>>1 2 4

>>

>>

>>

>>end data.

>>

>>

>>

>>

>>

>>

>>

>>Multiple Correspondence v1 v2 v3

>>

>>

>>

>>/analysis v1 v2 v3

>>

>>

>>

>>/dim=2

>>

>>

>>

>>/critit .0000001

>>

>>

>>

>>/print discrim quant obj

>>

>>

>>

>>/plot none.

>>

>>

>>

>>

>>

>>

>>

>>catpca v1 v2 v3

>>

>>

>>

>>/analysis v1 v2 v3 (mnom)

>>

>>

>>

>>/dim=2

>>

>>

>>

>>/critit .0000001

>>

>>

>>

>>/print quant obj

>>

>>

>>

>>/plot none.

>>

>>

>>

>>

>>

>>

>>

>>data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1 v3cat2

>>

>>

>v3cat3 v3cat4 .

>

>

>>

>>begin data.

>>

>>

>>

>>1 0 0 0 1 0 0 0 1 0

>>

>>

>>

>>0 1 0 1 0 0 0 0 1 0

>>

>>

>>

>>0 1 0 0 1 0 0 1 0 0

>>

>>

>>

>>0 0 1 1 0 0 1 0 0 0

>>

>>

>>

>>0 1 0 0 0 1 0 0 0 1

>>

>>

>>

>>0 1 0 0 1 0 0 1 0 0

>>

>>

>>

>>1 0 0 0 1 0 0 0 0 1

>>

>>

>>

>>end data.

>>

>>

>>

>>

>>

>>

>>

>>CORRESPONDENCE

>>

>>

>>

>> TABLE = all (7,10)

>>

>>

>>

>> /DIMENSIONS = 2

>>

>>

>>

>> /NORMALIZATION = cprin

>>

>>

>>

>> /PRINT = RPOINTS CPOINTS

>>

>>

>>

>> /PLOT = none .

>>

>>

>>

>>

>>

>>

>>

>>________________________________

>>

>>

>>

>>From: SPSSX(r) Discussion on behalf of Hector Maletta

>>

>>Sent: Thu 17/08/2006 19:56

>>

>>To: [hidden email] <mailto:[hidden email]>

>>

>>Subject: Re: Reference category for dummies in factor analysis

>>

>>

>>

>>

>>

>>

>>

>>Thank you, Anita. I will certainly look into your suggestion about CATCPA.

>>

>>However, I suspect some mathematical properties of the scores generated by

>>

>>CATPCA are not the ones I hope to have in our scale, because of the

>>

>>non-parametric nature of the procedure (too long to explain here, and not

>>

>>sure of understanding it myself).

>>

>>As for your second idea, I think if you try to apply PCA on dummies not

>>

>>omitting any category you'd run into trouble because any category of each

>>

>>original census question would be an exact linear function of the

remaining

>>

>>categories of the question. In the indicator matrix, one category will

have

>>

>>zeroes on all indicator variables, and that one is the "omitted" category.

>>

>>Hector

>>

>>

>>

>>

>>

>>-----Mensaje original-----

>>

>>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de

>>

>>Kooij, A.J. van der

>>

>>Enviado el: Thursday, August 17, 2006 2:37 PM

>>

>>Para: [hidden email] <mailto:[hidden email]>

>>

>>Asunto: Re: Reference category for dummies in factor analysis

>>

>>

>>

>>CATPCA (in Data Reduction menu, under Optimal Scaling) is PCA for

>>

>>(ordered//ordinal and unorderd/nominal) categorical variables; no need to

>>

>>use dummies then.

>>

>>Using PCA on dummies I think you should not omit dummies (for nominal

>>

>>variables you can do PCA on an indicator maxtrix (that has columns that

can

>>

>>be regarded as dummy variables; a column for each category, thus without

>>

>>omitting one)).

>>

>>

>>

>>Regards,

>>

>>Anita van der Kooij

>>

>>Data Theory Group

>>

>>Leiden University.

>>

>>

>>

>>________________________________

>>

>>

>>

>>From: SPSSX(r) Discussion on behalf of Hector Maletta

>>

>>Sent: Thu 17/08/2006 17:52

>>

>>To: [hidden email] <mailto:[hidden email]>

>>

>>Subject: Reference category for dummies in factor analysis

>>

>>

>>

>>

>>

>>

>>

>>Dear colleagues,

>>

>>

>>

>>I am re-posting (slightly re-phrased for added clarity) a question I sent

>>

>>the list about a week ago without eliciting any response as yet. I hope

>>

>>

>some

>

>

>>factor analysis experts may be able to help.

>>

>>

>>

>>In a research project on which we work together, a colleague of mine

>>

>>constructed a scale based on factor scores obtained through classical

>>

>>

>factor

>

>

>>analysis (principal components) of a number of categorical census

>>

>>

>variables

>

>

>>all transformed into dummies. The variables concerned the standard of

>>

>>

>living

>

>

>>of households and included quality of dwelling and basic services such as

>>

>>sanitation, water supply, electricity and the like. (The scale was not

>>

>>simply the score for the first factor, but the average score of several

>>

>>factors, weighted by their respective contribution to explaining the

>>

>>

>overall

>

>

>>variance of observed variables, but this is, I surmise, beside the point.)

>>

>>

>>

>>Now, he found out that the choice of reference or "omitted" category for

>>

>>defining the dummies has an influence on results. He first ran the

analysis

>>

>>using the first category of all categorical variables as the reference

>>

>>category, and then repeated the analysis using the last category as the

>>

>>reference or omitted category, whatever they might be. He found that the

>>

>>resulting scale varied not only in absolute value but also in the shape of

>>

>>its distribution.

>>

>>

>>

>>I can understand that the absolute value of the factor scores may change

>>

>>

>and

>

>

>>even the ranking of the categories of the various variables (in terms of

>>

>>their average scores) may also be different, since after all the list of

>>

>>dummies used has varied and the categories are tallied each time against a

>>

>>different reference category. But the shape of the scale distribution

>>

>>

>should

>

>

>>not change, I guess, especially not in a drastic manner. In this case the

>>

>>shape of the scale frequency distribution did change. Both distributions

>>

>>were roughly normal, with a kind of "hump" on one side, one of them on the

>>

>>left and the other on the right, probably due to the change in reference

>>

>>categories, but also with changes in the range of the scale and other

>>

>>details.

>>

>>

>>

>>Also, he found that the two scales had not a perfect correlation, and

>>

>>moreover, that their correlation was negative. That the correlation was

>>

>>negative may be understandable: the first category in such census

variables

>>

>>is usually a "good" one (for instance, a home with walls made of brick or

>>

>>concrete) and the last one is frequently a "bad" one (earthen floor) or a

>>

>>residual heterogeneous one including bad options ("other" kinds of roof).

>>

>>But since the two scales are just different combinations of the same

>>

>>categorical variables based on the same statistical treatment of their

>>

>>

>given

>

>

>>covariance matrix, one should expect a closer, indeed a perfect

>>

>>

>correlation,

>

>

>>even if a negative one is possible for the reasons stated above. Changing

>>

>>the reference category should be like changing the unit of measurement or

>>

>>the position of the zero point (like passing from Celsius to Fahrenheit),

a

>>

>>decision not affecting the correlation coefficient with other variables.

In

>>

>>this case, instead, the two scales had r = -0.54, implying they shared

only

>>

>>29% of their variance, even in the extreme case when ALL the possible

>>

>>factors (as many as variables) were extracted and all their scores

averaged

>>

>>into the scale, and therefore the entire variance, common or specific, of

>>

>>the whole set of variables was taken into account).

>>

>>

>>

>>I should add that the dataset was a large sample of census data, and all

>>

>>

>the

>

>

>>results were statistically significant.

>>

>>

>>

>>Any ideas why choosing different reference categories for dummy conversion

>>

>>could have such impact on results? I would greatly appreciate your

thoughts

>>

>>in this regard.

>>

>>

>>

>>Hector

>>

>>

>>

>>

>>

>>

>>

>>**********************************************************************

>>

>>This email and any files transmitted with it are confidential and

>>

>>intended solely for the use of the individual or entity to whom they

>>

>>are addressed. If you have received this email in error please notify

>>

>>the system manager.

>>

>>**********************************************************************

>>

>>

>>

>>

>>

>>

>>

>>

>>

>

>

>

>

>

>