http://spssx-discussion.165.s1.nabble.com/Reference-category-for-dummies-in-factor-analysis-tp1070339p1070347.html
overall set of matrices. The common matrix is the same as that found in
a simple MDS. However, each matrix has measures of how much use is
made of each dimension in each matrix. What I was thinking that access
household activity can be done outdoors. However, you might also be
first so many factors.
work is outstanding. I haven't yet had an opportunity to use CATPCA or
CATREG, but they look promising for many purposes. My understanding is
those in PCA. It seems that if you have a solid basis to construct your
measure you should be ok. If you come across anything that
pass the citation along.
>Art,
>Interesting thoughts, as usual from you. I didn't know about the higher
>precision of exponents. About non parametric factor analysis based on
>alternating least squares I have received excellent advice from one of the
>major figures in the field, Anita van der Kooij (one of the developers of
>the Categories SPSS module where these procedures are included). Of course
>in those approaches you do not drop any category, all categories are
>quantified and you get a unique set of factor scores, so my problem
>disappears. Since our project involves using the scores in a model involving
>index number theory and neoclassical economics, we are now analyzing whether
>by using categorical factor analysis we may lose some useful mathematical
>properties of PCA.
>Regarding your idea of differentiating by region or sector: Of course, this
>kind of scale is computed in order to be used in analytical and practical
>applications involving geographical breakdown (e.g. to improve or refine
>targeting of social programs) and other analytical subdivisions (such as
>sector of employment). For some applications it makes sense to compute a
>region-based scale and for other purposes a nationwide one. In our case we
>are working in the context of nationally adopted goals of equitable
>development within so-called United Nations Millennium Development Goals,
>and therefore the standards for of the standard of living should be set at
>national level, because governments establish such goals for all the people
>of their nations. However, in equations where the scale is a predictor,
>regions are certainly another likely predictor along with other variables of
>interest (employment and education level of household adults, say), to
>predict outcomes such as children dropping out of school or child mortality
>which figure outstandingly in Millennium Goals.
>Hector
>
>-----Mensaje original-----
>De: SPSSX(r) Discussion [mailto:
[hidden email]] En nombre de Art
>Kendall
>Enviado el: Friday, August 18, 2006 11:53 AM
>Para:
[hidden email]
>Asunto: Re: Reference category for dummies in factor analysis
>
>If you think that you might have been approaching a zero determinant,
>you might try a pfa and see what the communalities look like.
>Obviously the determinant would be zero if all of the categories for a
>single variable had dummies.
>Also, the 16 or so decimal digits is only in the mantissa, the exponent
>can go to something like 1022 or 1023 depending on the sign.
>
>It really would be interesting to hear what some of the inventors of
>extensions to FA, like CA,INDSCAL, ALSCAL, PROXSCAL have to say about
>your problem. I would urge you to post a description of what you are
>trying to do and the problem you ran into on the lists I mentioned..
>
>Just to stir the pot. Is it possible that the relation of the variables
>would vary by market areas e.g., manufacturing zones, fishing coastal,
>vs field agricultural vs herding etc.?
>So that something like an INDSCAL with a correlation or other variable
>similarity measure per "region" would be informative and fit the data
>even better?
>
>Art
>Social Research Consultants
>
[hidden email]
>
>Hector Maletta wrote:
>
>
>
>>Art,
>>
>>Thanks for your interesting response. We used PCA, with 1.00
>>commonality, i.e. extracting 100% variance. By "classical" I meant
>>parametric factor analysis and not any form of optimal scaling or
>>alternating least squares forms of data reduction. I am now
>>considering these other alternatives under advice of Anita van der Kooij.
>>
>>
>>
>>I share your uncertainty about why results change depending on which
>>category is omitted, and that was the original question starting this
>>thread. Since nobody else seems to have an answer I will offer one
>>purely numerical hypothesis. The exercise with the varying results, it
>>turns out, was done by my colleague not with the entire census but
>>with a SAMPLE of the Peru census (about 200,000 households, still a
>>lot but perhaps not so much for so many variables and factors), and
>>the contributions of latter factors were pretty small. SPSS provides,
>>as is well known, a precision no higher than 15 decimal places approx.
>>So it is just possible that some matrix figures for some of the minor
>>factors differed only on the 15th or 16th decimal place (or further
>>down), and then were taken as equal, and this may have caused some
>>matrix to be singular or (most probably) near singular, and the
>>results to be still computable but unstable. Moreover, some of the
>>categories in census questions used as reference or omitted categories
>>were populated by very few cases, which may have compounded the
>>problem. Since running this on the entire census (which would enhance
>>statistical significance and stability of results) takes a lot of
>>computer time and has to be done several times with different
>>reference categories, we have not done it yet but will proceed soon
>>and report back. But I wanted to know whether some mathematical reason
>>existed for the discrepancy.
>>
>>
>>
>>About why I would want to create a single score out of multiple
>>factors, let us leave it for another occasion since it is a rather
>>complicated story of a project connecting factor analysis with index
>>number theory and economic welfare theory.
>>
>>
>>
>>Hector
>>
>>
>>
>>------------------------------------------------------------------------
>>
>>De: Art Kendall [mailto:
[hidden email]]
>>Enviado el: Friday, August 18, 2006 9:34 AM
>>Para: Kooij, A.J. van der;
[hidden email]
>>CC:
[hidden email]
>>Asunto: Re: Reference category for dummies in factor analysis
>>
>>
>>
>>This has been an interesting discussion. I don't know why the FA and
>>scores would change depending on which category is omitted. Were
>>there errors in recoding to dummies that could have created different
>>missing values?
>>
>>
>>You also said classical FA, but then said PCA. What did you use for
>>communality estimates.? 1.00? Squared multiple correlations?
>>
>>(I'm not sure why you would create a single score if you have multiple
>>factors either, but that is another question.)
>>
>>What I do know is that people who know a lot more about CA, MDS, and
>>factor analysis than I do ( Like Joe Kruskal, Doug Carroll, Willem
>>Heisser, Phipps Arabie, Shizuhiko Nishimoto, et al) follow the
>>class-l and mpsych-l discussion lists.
>>see
>>
>>
http://aris.ss.uci.edu/smp/mpsych.html>>
>>and
>>
>>
http://www.classification-society.org/csna/lists.html#class-l>>
>>Art Kendall
>>
[hidden email] <mailto:
[hidden email]>
>>
>>
>>
>>Kooij, A.J. van der wrote:
>>
>>
>>
>>>... trouble because any category of each original census question would be
>>>
>>>
>an exact linear
>
>
>>>function of the remaining categories of the question.
>>>
>>>
>>>
>>>
>>>
>>Yes, but this gives trouble in regression, not in PCA, as far as I know.
>>
>>
>>
>>
>>
>>
>>
>>>In the indicator matrix, one category will have zeroes on all indicator
>>>
>>>
>variables.
>
>
>>>
>>>
>>>
>>No, and, sorry, I was confused with CA on indicator matrix, but this is
>>
>>
>"sort of" PCA. See syntax below (object scores=component scores are equal
>to row scores CA, category quantifications equal to column scores CA).
>
>
>>Regards,
>>
>>Anita.
>>
>>
>>
>>
>>
>>data list free/v1 v2 v3.
>>
>>
>>
>>begin data.
>>
>>
>>
>>1 2 3
>>
>>
>>
>>2 1 3
>>
>>
>>
>>2 2 2
>>
>>
>>
>>3 1 1
>>
>>
>>
>>2 3 4
>>
>>
>>
>>2 2 2
>>
>>
>>
>>1 2 4
>>
>>
>>
>>end data.
>>
>>
>>
>>
>>
>>
>>
>>Multiple Correspondence v1 v2 v3
>>
>>
>>
>>/analysis v1 v2 v3
>>
>>
>>
>>/dim=2
>>
>>
>>
>>/critit .0000001
>>
>>
>>
>>/print discrim quant obj
>>
>>
>>
>>/plot none.
>>
>>
>>
>>
>>
>>
>>
>>catpca v1 v2 v3
>>
>>
>>
>>/analysis v1 v2 v3 (mnom)
>>
>>
>>
>>/dim=2
>>
>>
>>
>>/critit .0000001
>>
>>
>>
>>/print quant obj
>>
>>
>>
>>/plot none.
>>
>>
>>
>>
>>
>>
>>
>>data list free/v1cat1 v1cat2 v1cat3 v2cat1 v2cat2 v2cat3 v3cat1 v3cat2
>>
>>
>v3cat3 v3cat4 .
>
>
>>
>>begin data.
>>
>>
>>
>>1 0 0 0 1 0 0 0 1 0
>>
>>
>>
>>0 1 0 1 0 0 0 0 1 0
>>
>>
>>
>>0 1 0 0 1 0 0 1 0 0
>>
>>
>>
>>0 0 1 1 0 0 1 0 0 0
>>
>>
>>
>>0 1 0 0 0 1 0 0 0 1
>>
>>
>>
>>0 1 0 0 1 0 0 1 0 0
>>
>>
>>
>>1 0 0 0 1 0 0 0 0 1
>>
>>
>>
>>end data.
>>
>>
>>
>>
>>
>>
>>
>>CORRESPONDENCE
>>
>>
>>
>> TABLE = all (7,10)
>>
>>
>>
>> /DIMENSIONS = 2
>>
>>
>>
>> /NORMALIZATION = cprin
>>
>>
>>
>> /PRINT = RPOINTS CPOINTS
>>
>>
>>
>> /PLOT = none .
>>
>>
>>
>>
>>
>>
>>
>>________________________________
>>
>>
>>
>>From: SPSSX(r) Discussion on behalf of Hector Maletta
>>
>>Sent: Thu 17/08/2006 19:56
>>
>>To:
[hidden email] <mailto:
[hidden email]>
>>
>>Subject: Re: Reference category for dummies in factor analysis
>>
>>
>>
>>
>>
>>
>>
>>Thank you, Anita. I will certainly look into your suggestion about CATCPA.
>>
>>However, I suspect some mathematical properties of the scores generated by
>>
>>CATPCA are not the ones I hope to have in our scale, because of the
>>
>>non-parametric nature of the procedure (too long to explain here, and not
>>
>>sure of understanding it myself).
>>
>>As for your second idea, I think if you try to apply PCA on dummies not
>>
>>omitting any category you'd run into trouble because any category of each
>>
>>original census question would be an exact linear function of the remaining
>>
>>categories of the question. In the indicator matrix, one category will have
>>
>>zeroes on all indicator variables, and that one is the "omitted" category.
>>
>>Hector
>>
>>
>>
>>
>>
>>-----Mensaje original-----
>>
>>De: SPSSX(r) Discussion [mailto:
[hidden email]] En nombre de
>>
>>Kooij, A.J. van der
>>
>>Enviado el: Thursday, August 17, 2006 2:37 PM
>>
>>Para:
[hidden email] <mailto:
[hidden email]>
>>
>>Asunto: Re: Reference category for dummies in factor analysis
>>
>>
>>
>>CATPCA (in Data Reduction menu, under Optimal Scaling) is PCA for
>>
>>(ordered//ordinal and unorderd/nominal) categorical variables; no need to
>>
>>use dummies then.
>>
>>Using PCA on dummies I think you should not omit dummies (for nominal
>>
>>variables you can do PCA on an indicator maxtrix (that has columns that can
>>
>>be regarded as dummy variables; a column for each category, thus without
>>
>>omitting one)).
>>
>>
>>
>>Regards,
>>
>>Anita van der Kooij
>>
>>Data Theory Group
>>
>>Leiden University.
>>
>>
>>
>>________________________________
>>
>>
>>
>>From: SPSSX(r) Discussion on behalf of Hector Maletta
>>
>>Sent: Thu 17/08/2006 17:52
>>
>>To:
[hidden email] <mailto:
[hidden email]>
>>
>>Subject: Reference category for dummies in factor analysis
>>
>>
>>
>>
>>
>>
>>
>>Dear colleagues,
>>
>>
>>
>>I am re-posting (slightly re-phrased for added clarity) a question I sent
>>
>>the list about a week ago without eliciting any response as yet. I hope
>>
>>
>some
>
>
>>factor analysis experts may be able to help.
>>
>>
>>
>>In a research project on which we work together, a colleague of mine
>>
>>constructed a scale based on factor scores obtained through classical
>>
>>
>factor
>
>
>>analysis (principal components) of a number of categorical census
>>
>>
>variables
>
>
>>all transformed into dummies. The variables concerned the standard of
>>
>>
>living
>
>
>>of households and included quality of dwelling and basic services such as
>>
>>sanitation, water supply, electricity and the like. (The scale was not
>>
>>simply the score for the first factor, but the average score of several
>>
>>factors, weighted by their respective contribution to explaining the
>>
>>
>overall
>
>
>>variance of observed variables, but this is, I surmise, beside the point.)
>>
>>
>>
>>Now, he found out that the choice of reference or "omitted" category for
>>
>>defining the dummies has an influence on results. He first ran the analysis
>>
>>using the first category of all categorical variables as the reference
>>
>>category, and then repeated the analysis using the last category as the
>>
>>reference or omitted category, whatever they might be. He found that the
>>
>>resulting scale varied not only in absolute value but also in the shape of
>>
>>its distribution.
>>
>>
>>
>>I can understand that the absolute value of the factor scores may change
>>
>>
>and
>
>
>>even the ranking of the categories of the various variables (in terms of
>>
>>their average scores) may also be different, since after all the list of
>>
>>dummies used has varied and the categories are tallied each time against a
>>
>>different reference category. But the shape of the scale distribution
>>
>>
>should
>
>
>>not change, I guess, especially not in a drastic manner. In this case the
>>
>>shape of the scale frequency distribution did change. Both distributions
>>
>>were roughly normal, with a kind of "hump" on one side, one of them on the
>>
>>left and the other on the right, probably due to the change in reference
>>
>>categories, but also with changes in the range of the scale and other
>>
>>details.
>>
>>
>>
>>Also, he found that the two scales had not a perfect correlation, and
>>
>>moreover, that their correlation was negative. That the correlation was
>>
>>negative may be understandable: the first category in such census variables
>>
>>is usually a "good" one (for instance, a home with walls made of brick or
>>
>>concrete) and the last one is frequently a "bad" one (earthen floor) or a
>>
>>residual heterogeneous one including bad options ("other" kinds of roof).
>>
>>But since the two scales are just different combinations of the same
>>
>>categorical variables based on the same statistical treatment of their
>>
>>
>given
>
>
>>covariance matrix, one should expect a closer, indeed a perfect
>>
>>
>correlation,
>
>
>>even if a negative one is possible for the reasons stated above. Changing
>>
>>the reference category should be like changing the unit of measurement or
>>
>>the position of the zero point (like passing from Celsius to Fahrenheit), a
>>
>>decision not affecting the correlation coefficient with other variables. In
>>
>>this case, instead, the two scales had r = -0.54, implying they shared only
>>
>>29% of their variance, even in the extreme case when ALL the possible
>>
>>factors (as many as variables) were extracted and all their scores averaged
>>
>>into the scale, and therefore the entire variance, common or specific, of
>>
>>the whole set of variables was taken into account).
>>
>>
>>
>>I should add that the dataset was a large sample of census data, and all
>>
>>
>the
>
>
>>results were statistically significant.
>>
>>
>>
>>Any ideas why choosing different reference categories for dummy conversion
>>
>>could have such impact on results? I would greatly appreciate your thoughts
>>
>>in this regard.
>>
>>
>>
>>Hector
>>
>>
>>
>>
>>
>>
>>
>>**********************************************************************
>>
>>This email and any files transmitted with it are confidential and
>>
>>intended solely for the use of the individual or entity to whom they
>>
>>are addressed. If you have received this email in error please notify
>>
>>the system manager.
>>
>>**********************************************************************
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>