detemine number of clusters for K-means cluster analysis

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

detemine number of clusters for K-means cluster analysis

Rongjin Guan
we want to do K-means cluster analysis and need to define number of clusters as input.

We first did hierarchical cluster analysis to determine the number of clusters. We copied the
 agglomeration schedule data outputted by the SPSS hierarchical cluster analysis to excel
and drew a scree plot, wishing to see a clear gap, but it is not so obvious.

In this situation, what one can do to determine the number of clusters?
In SAS, people can use some statistics like pesudo F, t^2 and ccc, also Semipartial R-Square
to help judge the number of clusters. But in spss, I did not see such options.

Thanks and have a good weekend!

Rongjin Guan
Rutgers SSW
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: detemine number of clusters for K-means cluster analysis

Kirill Orlov
There exist tens of dozens of clustering criterions, internal and external. Read Wikipedia article on Cluster analysis, to begin with on this topic.

On page http://www.spsstools.net/en/KO-spssmacros you'll find a collection of some most recommended internal clustering criterions - just use those macros. One of them, Silhouette statistic, is also available as an extension command (by Jon Peck) which, if I recall right, has been added to last Statistics release.

The approach you describe - take the agglomeration schedule or the dendrogram and visually find a "gap" - is just one of "clustering criterions" and is not very good one. One of the reasons is that you may not compare and choose between different agglomeration methods, using it. Please read some warnings regardind hierarchical cluster results comparison http://stats.stackexchange.com/a/63549/3277.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: detemine number of clusters for K-means cluster analysis

Jon Peck
Some people like to do this by using Ward's method with squared Euclidean distance in hierarchical clustering and using the cluster centers from that as the starting point for k-means.

The STATS_CLUS_SIL extension command can be helpful in evaluating the clusters from any method.It is installed with the Python Essentials in recent releases, but you can get it from the SPSS Community site (old or new) via the Utilities menu, if you don't already have it.  It appears on the Analyze > Classify > Cluster Silhouettes menu.

On Fri, Jan 8, 2016 at 3:48 PM, Kirill Orlov <[hidden email]> wrote:
There exist tens of dozens of clustering criterions, internal and external. Read Wikipedia article on Cluster analysis, to begin with on this topic.

On page http://www.spsstools.net/en/KO-spssmacros you'll find a collection of some most recommended internal clustering criterions - just use those macros. One of them, Silhouette statistic, is also available as an extension command (by Jon Peck) which, if I recall right, has been added to last Statistics release.

The approach you describe - take the agglomeration schedule or the dendrogram and visually find a "gap" - is just one of "clustering criterions" and is not very good one. One of the reasons is that you may not compare and choose between different agglomeration methods, using it. Please read some warnings regardind hierarchical cluster results comparison http://stats.stackexchange.com/a/63549/3277.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD