|
In a research project we work together on, a colleague of mine constructed
an index based on factor scores obtained through classical factor analysis of a number of categorical census variables all transformed into dummies. The variables concerned the standard of living and included quality of dwelling and basic services such as sanitation, water supply, electricity and the like. (The score was not simply the score for the first factor, but the average score of several factors, weighted by their respective contribution to explaining the overall variance of observed variables, but this is beside the point.) Now, he found out that the choice of reference or "omitted" category for defining the dummies has an influence on results. He first ran the analysis using the first category of all categorical variables as the reference category, and then repeated the analysis using the last category as the reference or omitted category. He found that the resulting scores varied not only in absolute value but also in the shape of their distribution. I can understand that the absolute value of the factor scores may change and even the ranking of the categories of the various variables (in terms of their average scores) may also be different, since after all the list of dummies used has varied. But the shape of the distribution should not change, I guess, especially not in a drastic manner. In this case both distributions are roughly similar but not equal, and both have one pointed density peak but of different height and at different places, one of them around -1 and the other around +1 on the z-score scale, the rest of the distributions being approximately alike. The two scores were inversely correlated (probably to be expected, since the first category in the original census variables represented often a "good" situation like living in a home of brick or concrete, and the last category was often a poor or residual situation like living in some "other" kind of nondescript dwelling, probably on the streets or suchlike, but they were not perfectly correlated as could have been expected considering that the two scores were just different combinations of the same categorical variables: their linear correlation coefficient was -0.54, indicating they share only 29% of their variance. The dataset was a large sample of census data, and all the results were statistically significant. Any ideas why choosing different reference categories for dummy conversion could have such impact on results? Hector |
| Free forum by Nabble | Edit this page |
