|
It is not sufficiently appreciated that K-means cluster
tacitly assumes equal variances and 0 covariances among basis variables within clusters. For correlated basis variables, you need to use finite mixture models. This K-means flaw is also a flaw in Two-Step Cluster. See Wedel and Kamakura's "Market Segmentation." Also, see the presentation at http://www.statisticalinnovations.com/articles/kmeans2a.htm Anthony Babinec [hidden email] ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Anthony,
I am still unconvinced. The cited presentation assumes cases to belong a priori to two different classes or datasets, and judges clustering procedures by their ability to put each case into its "correct" cluster. But K-means is not a procedure to do that. It can only put cases together if they are close to each other in their basis variables values, even if they come originally from two different datasets. In this sense, the supposed "misclassifications" of cases by K-means when one of the datasets involved covariances is not due to the presence of covariance, but to the overlap of cases from the two datasets. To prove this, suppose you have two datasets A and B, with the same variables (X,Y), each of which comprises two distincts "clouds" of cases, say M and N, centered around values (3,4) and (7,1). Suppose each cloud has zero covariance. The two data sets would overlap, and K-means would of course assign all cases in one mixed cloud to the same cluster, regardless of the origin of cases in the original dataset A or B, simply because all cases in cloud M are close to each other, and the same for all cases in cloud N. K-means simply measures the distance between points in the basis variables, regardless of their provenance (the "provenance" from data set A or B being a variable not included in the analysis). In the case of dataset 4 of the presentation cited in your posting, cases with lower values of X and Y in class 1 will possibly be clustered with cases in class 2, simply because they are close to them, while other cases in class 2 will be put in another cluster. This may be unsatisfactory when you know which class they come from, and you may wish to use a procedure that reproduces their class membership, but that is not the purpose of K-means clustering. Hector -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Anthony Babinec Sent: 25 March 2008 16:48 To: [hidden email] Subject: Cluster analysis procedures in SPSS It is not sufficiently appreciated that K-means cluster tacitly assumes equal variances and 0 covariances among basis variables within clusters. For correlated basis variables, you need to use finite mixture models. This K-means flaw is also a flaw in Two-Step Cluster. See Wedel and Kamakura's "Market Segmentation." Also, see the presentation at http://www.statisticalinnovations.com/articles/kmeans2a.htm Anthony Babinec [hidden email] ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
| Free forum by Nabble | Edit this page |
