SPSSX Discussion

detemine number of clusters for K-means cluster analysis

Classic

List

Threaded

3 messages Options

Rongjin Guan

Jan 08, 2016; 9:56pm

detemine number of clusters for K-means cluster analysis

we want to do K-means cluster analysis and need to define number of clusters as input.

We first did hierarchical cluster analysis to determine the number of clusters. We copied the
agglomeration schedule data outputted by the SPSS hierarchical cluster analysis to excel
and drew a scree plot, wishing to see a clear gap, but it is not so obvious.

In this situation, what one can do to determine the number of clusters?
In SAS, people can use some statistics like pesudo F, t^2 and ccc, also Semipartial R-Square
to help judge the number of clusters. But in spss, I did not see such options.

Thanks and have a good weekend!

Rongjin Guan
Rutgers SSW
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Kirill Orlov

Jan 08, 2016; 10:48pm

Re: detemine number of clusters for K-means cluster analysis

There exist tens of dozens of clustering criterions, internal and external. Read Wikipedia article on Cluster analysis, to begin with on this topic.

On page http://www.spsstools.net/en/KO-spssmacros you'll find a collection of some most recommended internal clustering criterions - just use those macros. One of them, Silhouette statistic, is also available as an extension command (by Jon Peck) which, if I recall right, has been added to last Statistics release.

The approach you describe - take the agglomeration schedule or the dendrogram and visually find a "gap" - is just one of "clustering criterions" and is not very good one. One of the reasons is that you may not compare and choose between different agglomeration methods, using it. Please read some warnings regardind hierarchical cluster results comparison http://stats.stackexchange.com/a/63549/3277.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon Peck

Jan 09, 2016; 2:13am

Re: detemine number of clusters for K-means cluster analysis

Some people like to do this by using Ward's method with squared Euclidean distance in hierarchical clustering and using the cluster centers from that as the starting point for k-means.

The STATS_CLUS_SIL extension command can be helpful in evaluating the clusters from any method.It is installed with the Python Essentials in recent releases, but you can get it from the SPSS Community site (old or new) via the Utilities menu, if you don't already have it. It appears on the Analyze > Classify > Cluster Silhouettes menu.

On Fri, Jan 8, 2016 at 3:48 PM, Kirill Orlov <[hidden email]> wrote:

There exist tens of dozens of clustering criterions, internal and external. Read Wikipedia article on Cluster analysis, to begin with on this topic.

On page http://www.spsstools.net/en/KO-spssmacros you'll find a collection of some most recommended internal clustering criterions - just use those macros. One of them, Silhouette statistic, is also available as an extension command (by Jon Peck) which, if I recall right, has been added to last Statistics release.

The approach you describe - take the agglomeration schedule or the dendrogram and visually find a "gap" - is just one of "clustering criterions" and is not very good one. One of the reasons is that you may not compare and choose between different agglomeration methods, using it. Please read some warnings regardind hierarchical cluster results comparison http://stats.stackexchange.com/a/63549/3277.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD