|
Dear List,
I have a clustering solution that seems overly dependent on a single variable. I am hoping someone on the list can help me understand, or suggest some analyses that will help me understand what is going on. I am clustering 70,000 cases with the Two Step procedure in SPSS. I am using the default outlier detection and standardizing all continuous variables. I am letting the procedure detect the number of clusters automatically. When I cluster my cases on 35 variables, where 1 is categorical with 13 classes and the remaining 34 are continuous I get an solution with 6 clusters plus an outlying cluster. When I cross tab these six clusters by the included categorical variable I find a very strong association between the two variables. For example over 99% of three of the segments consists of a single (though each different) value of the included categorical variable. When I remove the categorical variable I get a two cluster solution (plus an outlying cluster). To my surprise there was very little relationship between the two cluster and six cluster assignments of my 70,000 cases. This feels like my one categorical variable is driving the overall solution despite the inclusion of 34 other variables. This seems like an unstable clustering solution to me. Thank you for you thoughts, Jason Jason McNellis Educator / Analyst |
|
Jason,
Clustering is a heuristic tool, not an analytical or inferential statistical procedure. There is no "right" clustering solution in a statistical sense. You may produce (and use or discard) a number of clustering solutions, according to your research purposes and needs. At first sight, your results would suggest that the differences (in other variables) between the categories in your one categorical variable are so distinctive that it mandates putting each category in a different cluster. Whether this is actually so could be ascertained by an analysis of variance, to determine whether the variance of the other variables BETWEEN categories of the categorical variable is or is not far greater than the average variance WITHIN categories. Another issue is the way two-step clustering proceeds. Its ability to re-compute cluster centres is relatively limited, as compared for instance with k-means clustering. Thus it is possible that it starts assigning the various categories to different clusters, and this initial allocation is only marginally affected by subsequent calculations based on the other 34 variables. One possibility you may attempt is using the other 34 variables (or perhaps other usable variables in your dataset) to assign numerical values to each category of the categorical variable, in effect converting the categorical variable into an interval one, and only then performing the clustering exercise. The allocation of numerical values to your categorical variable could be achieved, for instance, with categorical factor analysis (CATCPA) using the 34+1 variables (plus perhaps other relevant variables that you consider as good predictors of the categorical one). The final clustering may be done by two step or by k-means. Hope this helps. Hector -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jason McNellis Sent: 19 July 2007 18:53 To: [hidden email] Subject: Two Step Clustering Confusion Dear List, I have a clustering solution that seems overly dependent on a single variable. I am hoping someone on the list can help me understand, or suggest some analyses that will help me understand what is going on. I am clustering 70,000 cases with the Two Step procedure in SPSS. I am using the default outlier detection and standardizing all continuous variables. I am letting the procedure detect the number of clusters automatically. When I cluster my cases on 35 variables, where 1 is categorical with 13 classes and the remaining 34 are continuous I get an solution with 6 clusters plus an outlying cluster. When I cross tab these six clusters by the included categorical variable I find a very strong association between the two variables. For example over 99% of three of the segments consists of a single (though each different) value of the included categorical variable. When I remove the categorical variable I get a two cluster solution (plus an outlying cluster). To my surprise there was very little relationship between the two cluster and six cluster assignments of my 70,000 cases. This feels like my one categorical variable is driving the overall solution despite the inclusion of 34 other variables. This seems like an unstable clustering solution to me. Thank you for you thoughts, Jason Jason McNellis Educator / Analyst |
| Free forum by Nabble | Edit this page |
