Cluster analysis procedures in SPSS

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Cluster analysis procedures in SPSS

Anthony Babinec
It is not sufficiently appreciated that K-means cluster
tacitly assumes equal variances and 0 covariances among basis
variables within clusters. For correlated basis variables,
you need to use finite mixture models. This K-means flaw
is also a flaw in Two-Step Cluster. See Wedel and Kamakura's
"Market Segmentation." Also, see the presentation at

  http://www.statisticalinnovations.com/articles/kmeans2a.htm


Anthony Babinec
[hidden email]

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Cluster analysis procedures in SPSS

Hector Maletta
Anthony,
I am still unconvinced. The cited presentation assumes cases to belong a
priori to two different classes or datasets, and judges clustering
procedures by their ability to put each case into its "correct" cluster. But
K-means is not a procedure to do that. It can only put cases together if
they are close to each other in their basis variables values, even if they
come originally from two different datasets. In this sense, the supposed
"misclassifications" of cases by K-means when one of the datasets involved
covariances is not due to the presence of covariance, but to the overlap of
cases from the two datasets. To prove this, suppose you have two datasets A
and B, with the same variables (X,Y), each of which comprises two distincts
"clouds" of cases, say M and N, centered around values (3,4) and (7,1).
Suppose each cloud has zero covariance. The two data sets would overlap, and
K-means would of course assign all cases in one mixed cloud to the same
cluster, regardless of the origin of cases in the original dataset A or B,
simply because all cases in cloud M are close to each other, and the same
for all cases in cloud N. K-means simply measures the distance between
points in the basis variables, regardless of their provenance (the
"provenance" from data set A or B being a variable not included in the
analysis).
In the case of dataset 4 of the presentation cited in your posting, cases
with lower values of X and Y in class 1 will possibly be clustered with
cases in class 2, simply because they are close to them, while other cases
in class 2 will be put in another cluster. This may be unsatisfactory when
you know which class they come from, and you may wish to use a procedure
that reproduces their class membership, but that is not the purpose of
K-means clustering.

Hector
-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Anthony Babinec
Sent: 25 March 2008 16:48
To: [hidden email]
Subject: Cluster analysis procedures in SPSS

It is not sufficiently appreciated that K-means cluster
tacitly assumes equal variances and 0 covariances among basis
variables within clusters. For correlated basis variables,
you need to use finite mixture models. This K-means flaw
is also a flaw in Two-Step Cluster. See Wedel and Kamakura's
"Market Segmentation." Also, see the presentation at

  http://www.statisticalinnovations.com/articles/kmeans2a.htm


Anthony Babinec
[hidden email]

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD