SPSSX Discussion

Output of criterion for selecting the number of clusters in agglomerative heirarchical cluster analysis

Classic

List

Threaded

3 messages Options

Brenda Zollitsch

Output of criterion for selecting the number of clusters in agglomerative heirarchical cluster analysis

Hi there, SPSS Thinkers,

I am trying to determine if I have completed the proper SPSS analytical procedures to develop the criterion for selecting the number of clusters in agglomerative heirarchical cluster analysis. I am using squared euclidian distances and my ouput doesn't show any of the scores I expected to see. What output for criterion should I be seeing?

Thanks!

Brenda

Hector Maletta

Re: Output of criterion for selecting the number of clusters in agglomerative heirarchical cluster analysis

In the meanwhile, I dare to offer a thought: clusters can be based on any of various measures of distance or similarity or dissimilarity; but choosing the right number of clusters is not about maximizing their mutual distance, but minimizing (as far as possible) intra-cluster variance while minimizing (also as far as possible) inter-cluster variance. That is, aiming for internally homogeneous clusters, as distinct as possible from each other. As this is a problem of minimizing the one while minimizing the other, you can only aim at achieving a balance of the two, i.e. a kind of saddlepoint solution.

There is no quick and ready recipe for doing this. Other than practicing an ANOVA at each number of clusters, with some external (interval scale) criterion variable, I cannot think of a quick solution right now. Between the extremes of n clusters (one case per cluster) and 1 only giant cluster with n members, there would be some intermediate point that provides a more satisfactory balance, and maybe making also sense from some theoretical or practical point of view.

If no external criterion variable comes to mind, one may indeed think of distances, as you apparently were thinking: minimizing the sum of distances (or squared distances) between members of each cluster and the cluster centroid, and at the same time maximizing the sum of distances (or squared distances) between cluster centroids, or between cluster centroids and the overall centroid of all the cluster centroids. This can be attacked as an ordinary least squares problem.

Hector

De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Brenda Zollitsch
Enviado el: Saturday, March 31, 2012 18:36
Para: [hidden email]
Asunto: Output of criterion for selecting the number of clusters in agglomerative heirarchical cluster analysis

Hi there, SPSS Thinkers,

Thanks!

Brenda

Jon K Peck

Re: Output of criterion for selecting the number of clusters in agglomerative hierarchical cluster analysis

As Hector has pointed out, clustering is more of an exploratory technique than a cut and dried algorithm. However, if you use the TWOSTEP procedure, it can choose the best number of clusters (as it sees it) under some strong assumptions. That might complement the hierarchical model. And k-means will give you some ANOVA statistics - useful if your variables are all continuous.

Cluster silhouette plots may also give you useful information about choosing the number of clusters. They are available via the STATS CLUSTER SIL extension command available from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral). That may not work with very large datasets. None of these come with hard and fast rules, but they may give you some insight into your data.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: Hector Maletta <[hidden email]>
To: [hidden email]
Date: 03/31/2012 04:29 PM
Subject: Re: [SPSSX-L] Output of criterion for selecting the number of clusters in agglomerative heirarchical cluster analysis
Sent by: "SPSSX(r) Discussion" <[hidden email]>

It should be useful that you inform us in some more detail about what your analysis consists of, how many cases and variables are involved, what are you trying to achieve with cluster analysis, and what are the “procedures” an “development of criterion” that you have reportedly completed.
In the meanwhile, I dare to offer a thought: clusters can be based on any of various measures of distance or similarity or dissimilarity; but choosing the right number of clusters is not about maximizing their mutual distance, but minimizing (as far as possible) intra-cluster variance while minimizing (also as far as possible) inter-cluster variance. That is, aiming for internally homogeneous clusters, as distinct as possible from each other. As this is a problem of minimizing the one while minimizing the other, you can only aim at achieving a balance of the two, i.e. a kind of saddlepoint solution.
There is no quick and ready recipe for doing this. Other than practicing an ANOVA at each number of clusters, with some external (interval scale) criterion variable, I cannot think of a quick solution right now. Between the extremes of n clusters (one case per cluster) and 1 only giant cluster with n members, there would be some intermediate point that provides a more satisfactory balance, and maybe making also sense from some theoretical or practical point of view.

If no external criterion variable comes to mind, one may indeed think of distances, as you apparently were thinking: minimizing the sum of distances (or squared distances) between members of each cluster and the cluster centroid, and at the same time maximizing the sum of distances (or squared distances) between cluster centroids, or between cluster centroids and the overall centroid of all the cluster centroids. This can be attacked as an ordinary least squares problem.

Hector

De: SPSSX(r) Discussion [[hidden email]] En nombre de Brenda Zollitsch
Enviado el: Saturday, March 31, 2012 18:36
Para: [hidden email]
Asunto: Output of criterion for selecting the number of clusters in agglomerative heirarchical cluster analysis

Hi there, SPSS Thinkers,

I am trying to determine if I have completed the proper SPSS analytical procedures to develop the criterion for selecting the number of clusters in agglomerative heirarchical cluster analysis. I am using squared euclidian distances and my ouput doesn't show any of the scores I expected to see. What output for criterion should I be seeing?

Thanks!

Brenda