SPSSX Discussion

two step clustering is SPSS

Classic

List

Threaded

4 messages Options

Rongjin Guan

Jul 15, 2016; 10:20pm

two step clustering is SPSS

Today when I was reading at stackoverflow, I found some interesting comments:
-------------------------------------------
I'm doing the same set of analyses for a project. Just for your information, two-step clustering process offered by SPSS is more robust that K-means (Punj & Stewart 1983). In K-means, how are you going to choose the K?! You can also use the clvalid package to get the optimal number of K if you insist on using K-means.

Punj, G., & Stewart, D. W. (1983). Cluster analysis in marketing research: review and suggestions for application. Journal of marketing research, 134-148.
-------------------------------------------

Is the two step clustering indeed more robust than K-means? I thought I should use two step
clustering to get the number of clusters (K) and then input this K in K-means clustering as last step.

SPSS two step clustering either use BIC (default) or AIC to determine the number of clusters (K),
and also report Silhouette indicator (poor, fair and good). It seems to me the BIC or AIC are not very
good indicators of number of clusters. And the Silhouette information provided is very limited.
Recently I have changed to SAS for cluster analysis, but I want to know if others have good uses
of the two step clustering in SPSS.

Any comment is welcome.

Rongjin Guan

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Kirill Orlov

Jul 16, 2016; 12:45pm

Re: two step clustering is SPSS

--> Is the two step clustering indeed more robust than K-means?

What do you mean specifically by "robust"? And how can this be tied with the issue of "best K", the number of clusters?

TwoStep has an option of "automatic K selection" which is however not the core aspect of that clustering algorithm. The automatic selection tries to suggest the optimal K on halfway on clustering, i.e. prior the clusters are built in full. There now exist modifications of K-means (not in SPSS) which attempt similar "automatic K selection".

Note that "automatic K selection" is only advantageous when you have huge dataset to cluster, many thousands or millions of cases. For if your dataset isn't big you can cluster "till the end" for a range of K values in modest time, and then - after the clusterings are done - apply any clustering criterion you see as suitable, to select the best K, the best solution. (A number of such criterions implemented for SPSS can be found on http://www.spsstools.net/en/KO-spssmacros.)

TwoStep cluster was initially based on BIRCH clustering algorithm and is designed specifically for big data. If the data isn't big, other clustering methods, including hierarchical CA, are more flexible and are preferable. Be aware that TwoStep CA isn't free of assumptions or "biases" or even shortcomings. My preliminary probes, for example, showed that it assumes "round" (rather than "ellipsoid") clusters, like K-means do, although perhaps to a lesser degree than K-means does.

16.07.2016 1:20, Rongjin Guan пишет:

Today when I was reading at stackoverflow, I found some interesting comments:
-------------------------------------------
I'm doing the same set of analyses for a project. Just for your information, two-step clustering process offered by SPSS is more robust that K-means (Punj & Stewart 1983). In K-means, how are you going to choose the K?! You can also use the clvalid package to get the optimal number of K if you insist on using K-means.

Punj, G., & Stewart, D. W. (1983). Cluster analysis in marketing research: review and suggestions for application. Journal of marketing research, 134-148.
-------------------------------------------

Is the two step clustering indeed more robust than K-means? I thought I should use two step
clustering to get the number of clusters (K) and then input this K in K-means clustering as last step. 

SPSS two step clustering either use BIC (default) or AIC to determine the number of clusters (K), 
and also report Silhouette indicator (poor, fair and good). It seems to me the BIC or AIC are not very
good indicators of number of clusters. And the Silhouette information provided is very limited. 
Recently I have changed to SAS for cluster analysis, but I want to know if others have good uses
of the two step clustering in SPSS. 

Any comment is welcome. 

Rongjin Guan 

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

... [show rest of quote]

Это сообщение проверено на вирусы антивирусом Avast.
www.avast.com

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon Peck

Jul 16, 2016; 1:28pm

Re: two step clustering is SPSS

In addition to the built-in clustering procedures in Statistics, there are several extension commands pertinent to this area under the Analyze > Classify menu

Density-Based Clustering with Noise (STATS DBSCAN AND STATS DBPRED for prediction)

SVM (STATS SVM) Support Vector Machines

Cluster Silhouette (STATS CLUS SIL)

If these are not already installed (all but CLUS SIL require the R Essentials and are included with some versions of it), they can be added from the Utilities menu in V22-23 or the Extensions menu in V24. For older versions they can be downloaded from the SPSS Community website.)

On Sat, Jul 16, 2016 at 6:45 AM, Kirill Orlov <[hidden email]> wrote:

--> Is the two step clustering indeed more robust than K-means?

What do you mean specifically by "robust"? And how can this be tied with the issue of "best K", the number of clusters?
TwoStep has an option of "automatic K selection" which is however not the core aspect of that clustering algorithm. The automatic selection tries to suggest the optimal K on halfway on clustering, i.e. prior the clusters are built in full. There now exist modifications of K-means (not in SPSS) which attempt similar "automatic K selection".

Note that "automatic K selection" is only advantageous when you have huge dataset to cluster, many thousands or millions of cases. For if your dataset isn't big you can cluster "till the end" for a range of K values in modest time, and then - after the clusterings are done - apply any clustering criterion you see as suitable, to select the best K, the best solution. (A number of such criterions implemented for SPSS can be found on http://www.spsstools.net/en/KO-spssmacros.)

TwoStep cluster was initially based on BIRCH clustering algorithm and is designed specifically for big data. If the data isn't big, other clustering methods, including hierarchical CA, are more flexible and are preferable. Be aware that TwoStep CA isn't free of assumptions or "biases" or even shortcomings. My preliminary probes, for example, showed that it assumes "round" (rather than "ellipsoid") clusters, like K-means do, although perhaps to a lesser degree than K-means does.

16.07.2016 1:20, Rongjin Guan пишет:
Today when I was reading at stackoverflow, I found some interesting comments:
-------------------------------------------
I'm doing the same set of analyses for a project. Just for your information, two-step clustering process offered by SPSS is more robust that K-means (Punj & Stewart 1983). In K-means, how are you going to choose the K?! You can also use the clvalid package to get the optimal number of K if you insist on using K-means.

Punj, G., & Stewart, D. W. (1983). Cluster analysis in marketing research: review and suggestions for application. Journal of marketing research, 134-148.
-------------------------------------------

Is the two step clustering indeed more robust than K-means? I thought I should use two step
clustering to get the number of clusters (K) and then input this K in K-means clustering as last step. 

SPSS two step clustering either use BIC (default) or AIC to determine the number of clusters (K), 
and also report Silhouette indicator (poor, fair and good). It seems to me the BIC or AIC are not very
good indicators of number of clusters. And the Silhouette information provided is very limited. 
Recently I have changed to SAS for cluster analysis, but I want to know if others have good uses
of the two step clustering in SPSS. 

Any comment is welcome. 

Rongjin Guan 

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
... [show rest of quote]
Это сообщение проверено на вирусы антивирусом Avast.
www.avast.com
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

... [show rest of quote]

Jon K Peck
[hidden email]

Anthony Babinec

Jul 16, 2016; 2:01pm

twostep clustering

In reply to this post by Rongjin Guan

Do an internet search for a paper/presentation by Johann Bacher, Knut Wenzig, Melanie Vogler

entitled SPSS TwoStep Cluster – A First Evaluation. Note that an internet search on TwoStep Cluster

will also turn up some other comparison papers.

Tony Babinec

[hidden email]