Two Step Clustering Confusion

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Two Step Clustering Confusion

Jason McNellis
Dear List,



I have a clustering solution that seems overly dependent on a single
variable.  I am hoping someone on the list can help me understand, or
suggest some analyses that will help me understand what is going on.



I am clustering 70,000 cases with the Two Step procedure in SPSS.  I am
using the default outlier detection and standardizing all continuous
variables.  I am letting the procedure detect the number of clusters
automatically.  When I cluster my cases on 35 variables, where 1 is
categorical with 13 classes and the remaining 34 are continuous I get an
solution with 6 clusters plus an outlying cluster.  When I cross tab these
six clusters by the included categorical variable I find a very strong
association between the two variables.  For example over 99% of three of the
segments consists of a single (though each different) value of the included
categorical variable.



When I remove the categorical variable I get a two cluster solution (plus an
outlying cluster).  To my surprise there was very little relationship
between the two cluster and six cluster assignments of my 70,000 cases.



This feels like my one categorical variable is driving the overall solution
despite the inclusion of 34 other variables.  This seems like an unstable
clustering solution to me.



Thank you for you thoughts, Jason





Jason McNellis

Educator / Analyst
Reply | Threaded
Open this post in threaded view
|

Re: Two Step Clustering Confusion

Hector Maletta
         Jason,
         Clustering is a heuristic tool, not an analytical or inferential
statistical procedure. There is no "right" clustering solution in a
statistical sense. You may produce (and use or discard) a number of
clustering solutions, according to your research purposes and needs.

         At first sight, your results would suggest that the differences (in
other variables) between the categories in your one categorical variable are
so distinctive that it mandates putting each category in a different
cluster. Whether this is actually so could be ascertained by an analysis of
variance, to determine whether the variance of the other variables BETWEEN
categories of the categorical variable is or is not far greater than the
average variance WITHIN categories.

         Another issue is the way two-step clustering proceeds. Its ability
to re-compute cluster centres is relatively limited, as compared for
instance with k-means clustering. Thus it is possible that it starts
assigning the various categories to different clusters, and this initial
allocation is only marginally affected by subsequent calculations based on
the other 34 variables.

         One possibility you may attempt is using the other 34 variables (or
perhaps other usable variables in your dataset) to assign numerical values
to each category of the categorical variable, in effect converting the
categorical variable into an interval one, and only then performing the
clustering exercise. The allocation of numerical values to your categorical
variable could be achieved, for instance, with categorical factor analysis
(CATCPA) using the 34+1 variables (plus perhaps other relevant variables
that you consider as good predictors of the categorical one). The final
clustering may be done by two step or by k-means.

         Hope this helps.

         Hector


         -----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Jason McNellis
Sent: 19 July 2007 18:53
To: [hidden email]
Subject: Two Step Clustering Confusion

         Dear List,



         I have a clustering solution that seems overly dependent on a
single
         variable.  I am hoping someone on the list can help me understand,
or
         suggest some analyses that will help me understand what is going
on.



         I am clustering 70,000 cases with the Two Step procedure in SPSS.
I am
         using the default outlier detection and standardizing all
continuous
         variables.  I am letting the procedure detect the number of
clusters
         automatically.  When I cluster my cases on 35 variables, where 1 is
         categorical with 13 classes and the remaining 34 are continuous I
get an
         solution with 6 clusters plus an outlying cluster.  When I cross
tab these
         six clusters by the included categorical variable I find a very
strong
         association between the two variables.  For example over 99% of
three of the
         segments consists of a single (though each different) value of the
included
         categorical variable.



         When I remove the categorical variable I get a two cluster solution
(plus an
         outlying cluster).  To my surprise there was very little
relationship
         between the two cluster and six cluster assignments of my 70,000
cases.



         This feels like my one categorical variable is driving the overall
solution
         despite the inclusion of 34 other variables.  This seems like an
unstable
         clustering solution to me.



         Thank you for you thoughts, Jason





         Jason McNellis

         Educator / Analyst