Fitindices to determine optimal clustersolution

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Fitindices to determine optimal clustersolution

MaaikeSmits
Hello,

My question is about the selection of the optimal cluster solution. I have used SPSS version 21.0 to perform a cluster analysis in which the cluster means from a hierarchical cluster analysis using Ward’s method and squared euclidian distance are used as starting points in a k-means cluster analysis.

Several cluster solutions (2-, 3-, 4- and 5-clustersolutions) have been established in this manner.

Has anyone guidance (and SPSS syntax) concerning the use of fit indices such as  explained variance/eta-squared/adjusted-R squared; Calinski Harabasz Index, and the Akaike Information Criterion to determine the optimal cluster solution?

Thank you for your help in advance.
Reply | Threaded
Open this post in threaded view
|

Re: Fitindices to determine optimal clustersolution

Art Kendall
How many cases do you have in the whole data set?

How were the cases selected?

Are you variables reasonably uncorrelated?

Am I reading correctly that you used the cluster profiles from the Ward method to start the k-means?

How many samples from the whole set of cases did you use for the Ward method?

How large were those samples?
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Fitindices to determine optimal clustersolution

Art Kendall
In reply to this post by MaaikeSmits
How many cases do you have in the whole data set?

How were the cases selected?

Are you variables reasonably uncorrelated?

Am I reading correctly that you used the cluster profiles from the Ward method to start the k-means?

How many samples from the whole set of cases did you use for the Ward method?

How large were those samples?
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Fitindices to determine optimal clustersolution

MaaikeSmits
In reply to this post by Art Kendall
Hello,

Thank you for taking interest in my question. I will try to provide you with additional information on your questions.

From a total of 225 cases, 187 were included in the cluster analysis (20 cases were lost as a result of missing data on one or more of the 10 input variables and another 18 were excluded because they showed to be extreme outliers on one or more of the input variables).

I started with a hierarchical cluster analysis on this 187 cases and the cluster means that resulted from this procedure were used as non-random starting points in the k-means cluster analysis, which was also done on these same 187 cases. So, I did not select subsamples for the hierarchical nor the k-means procedure, but ran both on the whole sample.

The 10 (standardized) dimensional scores that were used as input variables for the cluster analysis were fairly unrelated, most below .1, a few of .3 or .4.

I hope I have given you the relevant answers to be able to provide some guidance on my question. Of course I will be happy to provide more detailed information if necessary.

Kind Regards
Maaike





2015-04-29 17:08 GMT+02:00 Art Kendall [via SPSSX Discussion] <[hidden email]>:
How many cases do you have in the whole data set?

How were the cases selected?

Are you variables reasonably uncorrelated?

Am I reading correctly that you used the cluster profiles from the Ward method to start the k-means?

How many samples from the whole set of cases did you use for the Ward method?

How large were those samples?
Art Kendall
Social Research Consultants



If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/Fitindices-to-determine-optimal-clustersolution-tp5729419p5729431.html
To unsubscribe from Fitindices to determine optimal clustersolution, click here.
NAML

Reply | Threaded
Open this post in threaded view
|

Re: Fitindices to determine optimal clustersolution

Jon K Peck
One tool that might give you added insight into your clustering solutions is cluster silhouettes.  These show the distribution of silhouette values for each cluster.  They can be produced by the STATS CLUS SIL extension command (Analyze > Classify > Cluster Silhouettes).  If you don't have that already installed and have V22 or later, you can install it from the Utilities menu.  For older versions you would need to get it from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral) in the Extension Commands collection.  It requires the Python Essentials, which are integrated into the Statistics install as of V22.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        MaaikeSmits <[hidden email]>
To:        [hidden email]
Date:        04/29/2015 11:06 AM
Subject:        Re: [SPSSX-L] Fitindices to determine optimal clustersolution
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Hello,

Thank you for taking interest in my question. I will try to provide you with additional information on your questions.

From a total of 225 cases, 187 were included in the cluster analysis (20 cases were lost as a result of missing data on one or more of the 10 input variables and another 18 were excluded because they showed to be extreme outliers on one or more of the input variables).

I started with a hierarchical cluster analysis on this 187 cases and the cluster means that resulted from this procedure were used as non-random starting points in the k-means cluster analysis, which was also done on these same 187 cases. So, I did not select subsamples for the hierarchical nor the k-means procedure, but ran both on the whole sample.

The 10 (standardized) dimensional scores that were used as input variables for the cluster analysis were fairly unrelated, most below .1, a few of .3 or .4.

I hope I have given you the relevant answers to be able to provide some guidance on my question. Of course I will be happy to provide more detailed information if necessary.

Kind Regards
Maaike





2015-04-29 17:08 GMT+02:00 Art Kendall [via SPSSX Discussion] <[hidden email]>:
How many cases do you have in the whole data set?

How were the cases selected?

Are you variables reasonably uncorrelated?

Am I reading correctly that you used the cluster profiles from the Ward method to start the k-means?

How many samples from the whole set of cases did you use for the Ward method?

How large were those samples?

Art Kendall
Social Research Consultants





If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/Fitindices-to-determine-optimal-clustersolution-tp5729419p5729431.html
To unsubscribe from Fitindices to determine optimal clustersolution, click here.
NAML



View this message in context: Re: Fitindices to determine optimal clustersolution
Sent from the
SPSSX Discussion mailing list archive at Nabble.com.
===================== To manage your subscription to SPSSX-L, send a message to
LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Fitindices to determine optimal clustersolution

Kirill Orlov
In reply to this post by MaaikeSmits
A collection of macros to compute various internal clustering criterions: "Clustering criterions" on my page http://www.spsstools.net/KO-spssmacros.htm.


29.04.2015 12:39, MaaikeSmits пишет:
Hello,

My question is about the selection of the optimal cluster solution. I have
used SPSS version 21.0 to perform a cluster analysis in which the cluster
means from a hierarchical cluster analysis using Ward’s method and squared
euclidian distance are used as starting points in a k-means cluster
analysis. 

Several cluster solutions (2-, 3-, 4- and 5-clustersolutions) have been
established in this manner. 

Has anyone guidance (and SPSS syntax) concerning the use of fit indices such
as  explained variance/eta-squared/adjusted-R squared; Calinski Harabasz
Index, and the Akaike Information Criterion to determine the optimal cluster
solution?

Thank you for your help in advance.



--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Fitindices-to-determine-optimal-clustersolution-tp5729419.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD




===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Fitindices to determine optimal clustersolution

Art Kendall
In reply to this post by Jon K Peck
Jon's suggestion is right on.

*****
Of course, a lot depends on what you are going to use the cluster membership variable for, what the nature of a case is, and the meaning of teh variables in the raw profiles.
*****

However, it has been my practice since the mid-70's to use several clustering algorithms and use "core clusters" based on consensus   of several techniques.

Clustering techniques are exploratory/heuristic techniques. The various distance measure * agglomeration algorithm combinations grab different aspects of potential clusters.



(1) retain the cluster memberships from several techniques distance measures and levels of agglomeration. Interpreting each so they make some sense.
(2) crosstab memberships and find bunches of cases that are put together by several runs.
call those bunches "care clusters".
(3) use the classification phase of discriminant to see how well the core clusters are separated.

Now that STATS CLUS SIL is available use that to look at the core clusters.


I have not tried this but it would be interesting to see what happens when you run TWOSTEP on the set of cluster membership variables.

-----

For the cases with suspicious values, do they make sense as valid measures?
I am leery of trimming variables in general and especially so when looking for profiles. In clustering extremes might be particularly, e.g., looking at counties in 5 western US states, Los Angeles remained a singleton through everything and that makes tremendous sense.

----

Have you examined why some of the variables in the raw profiles are missing? Since your data set is small, you might consider whether you can make reasonable substitutions for the missing values, e.g, use variables not in the profile to do MVA, or substitute zero, or use means of valid scale items rather than totals if you have summative scores, or ...
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Fitindices to determine optimal clustersolution

MaaikeSmits
In reply to this post by Kirill Orlov
@ Kirill Orlov. I found your macro's in an earlier stage of my search and tried the command for the use of the Caliniski Harabasz index and AIC/BIC. However the command doesn't seem to run on my data. It seems that a temporary file cannot be found, as I get the following warnings (for each of the cluster solutions). Do you have any idea how I could fix this problem?

Error # 34 in column 36.  Text: temp_aicbic_QCL_1.sav
>SPSS Statistics cannot access a file with the given file specification.  The
>file specification is either syntactically invalid, specifies an invalid
>drive, specifies a protected directory, specifies a protected file, or
>specifies a non-sharable file.
>Execution of this command stops

>Error # 34 in column 35.  Text: temp_calharv_QCL_1.sav
>SPSS Statistics cannot access a file with the given file specification.  The
>file specification is either syntactically invalid, specifies an invalid
>drive, specifies a protected directory, specifies a protected file, or
>specifies a non-sharable file.
>Execution of this command stops.


Reply | Threaded
Open this post in threaded view
|

Re: Fitindices to determine optimal clustersolution

MaaikeSmits
In reply to this post by Art Kendall
@ Jon Peck and Art Kendall

I have now used Silhouettes to look at the clustersolutions. I am not sure if the use of euclidian is justified. I used squared euclidian distance as measure in the Original clustering procedure. Also I cannot find information on deciding the value of Minkowski, so I put that by default on 2. However, when I continue under these two assumptions (Minkwoski on 2 and use of euclidian) then I find the overal overall average S for all four clustering solutions to be rather low:  s(2) = .117, s(3)= .090, s(4) = .058, s (5) = .125. Maybe it is a good sign that none of the clusters show a negative mean value of S (in none of the clustersolutions), however in all of the clusters there are cases to be found with negative s values. Is there an absolute manner to interpreter the S values, or only relative to each other?


Considering the suggestions of Art Kendall:
Thank you for your advice on the stepped manner by working through the clustering via various procedures to decide via crostabbing which cases belong to core clusters. I will try to work out which different techniques of distance meausure would be suitable for my data apart from the procedure I already used. Can you refer me to an article in which the stepped procedure as you desribed above is used and outlined?

We did give a lot of consideration to the handling of missing data and outliers. In the steps of silhouettes there is being refered to the handling of missing data but in my procedure all cases which have missings on one of the inputvariables are automatically excluded. Are there clustering procedures in which this is not the case (without manually correcting for missings values or omputing them)?
When I do not exclude the extreme outliers, clusters are formed with only 1 of 2 cases, which makes the interpretation quite hard, so that is why we chose to exclude extreme outliers (but keep potential or probable outliers).  

Reply | Threaded
Open this post in threaded view
|

Re: Fitindices to determine optimal clustersolution

Kirill Orlov
In reply to this post by MaaikeSmits
DID you read page named "About SPSS macros" on the site? It says please DO READ.

01.05.2015 12:12, MaaikeSmits пишет:
@ Kirill Orlov. I found your macro's in an earlier stage of my search and
tried the command for the use of the Caliniski Harabasz index and AIC/BIC.
However the command doesn't seem to run on my data. It seems that a
temporary file cannot be found, as I get the following warnings (for each of
the cluster solutions). Do you have any idea how I could fix this problem?

Error # 34 in column 36.  Text: temp_aicbic_QCL_1.sav 
SPSS Statistics cannot access a file with the given file specification. 
The 
file specification is either syntactically invalid, specifies an invalid 
drive, specifies a protected directory, specifies a protected file, or 
specifies a non-sharable file. 
Execution of this command stops
Error # 34 in column 35.  Text: temp_calharv_QCL_1.sav 
SPSS Statistics cannot access a file with the given file specification. 
The 
file specification is either syntactically invalid, specifies an invalid 
drive, specifies a protected directory, specifies a protected file, or 
specifies a non-sharable file. 
Execution of this command stops.




--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Fitindices-to-determine-optimal-clustersolution-tp5729419p5729457.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD




===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Fitindices to determine optimal clustersolution

MaaikeSmits
In reply to this post by Art Kendall
I am sorry but forgot to mention one last question on the two-step procedure. You hinted to run the TWO-step procedure on the set of clustermembership variables. I am not sure if I grasp the concept of what I would be doing then, could you refer me to more information on this method? Do you mean clustering by means of the input of the 4 clustermembership variables that were established via the (first hierarchical and then optimized by the) k-means clustering procedure of respectivly the 2-, 3-, 4 and 5-clustersolutions?

Your help is very much appreciated.
Reply | Threaded
Open this post in threaded view
|

Re: Fitindices to determine optimal clustersolution

Art Kendall
In reply to this post by MaaikeSmits
Look at the proximities documentation and see which are      for continuous data.

Unfortunately, publications from the US Census Bureau do not seem to be available online.  My files with my write ups on that procedure were not returned from sanitizing at the time of the 2001 DC area anthrax events.

TWOSTEP can use contiunous variables, categorical variables, or a mixture of those.  it is certainly one that I would try on the continuous data.

I have not yet had an opportunity to try using TWOSTEP with the cluster membership variables as the profile elements. The idea is (1) to apply (a) a set of Quick Cluster algorithms,e.g.,  Ward, single-linkage, average-linkage, (b) TWOSTEP, and (c) K-means would give you variables showing cluster membership.
(2) use the cluster memberships as the elements of a profile.  IFF it gives meaningful grouping, using TWOSTEP would be a way to find core clusters (3) as in the other post about core clusters, iteratively run DISCRIMINANT using the probabilities of membership to move some cases into the "ungrouped"  value of the variable that contains the core cluster membership.
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Fitindices to determine optimal clustersolution

MaaikeSmits
In reply to this post by Jon K Peck
Hi Jon Peck,

Some time ago you referred me to the silhouettes procedure as a way to determine and compare several clustersolutions, that I got from a k-means clusteranalysis. I performed the cluster-analysis on z-scores (10 variables). In the silhouettes procedure I think I should use the z-scores as well (next to clustersolution/membership variable), is that right? So option 1 instead of 2? QCL_1 is the clustermembership variabele that is saved from the k-means procedure. The rBPS to rSZOID variables are the variables I performed the k-means analyses on (in option 1 in Z-scores, in option 2 as originol non standardized scores). I do not understand enough of the mathematical computation of silhouettes in order to understand the difference between silhouette mean scores that arises if I use standardized versus non-standardized scores, but which one should I use ? I perform syntax below on all clustervariables, en then the one with highest total silhouette mean score woudl indicate best fit to the data right?

Option 1
STATS CLUS SIL CLUSTER=QCL_1 VARIABLES=ZrBPSdim ZrTHEAdim ZrNARCdim ZrANTdimA ZrAFHdim ZrONTdim
    ZrOBSdim ZrPARAdim ZrSTYPdim ZrSZOIDdim
NEXTBEST=clusdrienextbest SILHOUETTE=clusdriesilval DISSIMILARITY=EUCLID MINKOWSKIPOWER=2
/OPTIONS MISSING=RESCALE RENUMBERORDINAL=NO
/OUTPUT HISTOGRAM=YES ORIENTATION=HORIZONTAL THREEDBAR=YES THREEDCOUNTS=YES.

Option 2
STATS CLUS SIL CLUSTER=QCL_3 VARIABLES=rBPSdim rTHEAdim rNARCdim rANTdimA rAFHdim rONTdim
    rOBSdim rPARAdim rSTYPdim rSZOIDdim
NEXTBEST=clusdrienextbest SILHOUETTE=clusdriesilval DISSIMILARITY=EUCLID MINKOWSKIPOWER=2
/OPTIONS MISSING=RESCALE RENUMBERORDINAL=NO
/OUTPUT HISTOGRAM=YES ORIENTATION=HORIZONTAL THREEDBAR=YES THREEDCOUNTS=YES.
Reply | Threaded
Open this post in threaded view
|

Re: Fitindices to determine optimal clustersolution

Jon K Peck
You should be using the variables the way they were used when you did the clustering, so if you clustered using standardized variables, use those same variables for the silhouette plots.

The silhouette plots are showing you how comfortable, if you will, the points are in their assigned clusters,  Silhouette values near 1 are good.

This link gives you a concise description without too much math.
https://en.wikipedia.org/wiki/Silhouette_(clustering)

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        MaaikeSmits <[hidden email]>
To:        [hidden email]
Date:        11/13/2015 07:38 AM
Subject:        Re: [SPSSX-L] Fitindices to determine optimal clustersolution
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Hi Jon Peck,

Some time ago you referred me to the silhouettes procedure as a way to
determine and compare several clustersolutions, that I got from a k-means
clusteranalysis. I performed the cluster-analysis on z-scores (10
variables). In the silhouettes procedure I think I should use the z-scores
as well (next to clustersolution/membership variable), is that right? So
option 1 instead of 2? QCL_1 is the clustermembership variabele that is
saved from the k-means procedure. The rBPS to rSZOID variables are the
variables I performed the k-means analyses on (in option 1 in Z-scores, in
option 2 as originol non standardized scores). I do not understand enough of
the mathematical computation of silhouettes in order to understand the
difference between silhouette mean scores that arises if I use standardized
versus non-standardized scores, but which one should I use ? I perform
syntax below on all clustervariables, en then the one with highest total
silhouette mean score woudl indicate best fit to the data right?

Option 1
STATS CLUS SIL CLUSTER=QCL_1 VARIABLES=ZrBPSdim ZrTHEAdim ZrNARCdim
ZrANTdimA ZrAFHdim ZrONTdim
   ZrOBSdim ZrPARAdim ZrSTYPdim ZrSZOIDdim
NEXTBEST=clusdrienextbest SILHOUETTE=clusdriesilval DISSIMILARITY=EUCLID
MINKOWSKIPOWER=2
/OPTIONS MISSING=RESCALE RENUMBERORDINAL=NO
/OUTPUT HISTOGRAM=YES ORIENTATION=HORIZONTAL THREEDBAR=YES THREEDCOUNTS=YES.

Option 2
STATS CLUS SIL CLUSTER=QCL_3 VARIABLES=rBPSdim rTHEAdim rNARCdim rANTdimA
rAFHdim rONTdim
   rOBSdim rPARAdim rSTYPdim rSZOIDdim
NEXTBEST=clusdrienextbest SILHOUETTE=clusdriesilval DISSIMILARITY=EUCLID
MINKOWSKIPOWER=2
/OPTIONS MISSING=RESCALE RENUMBERORDINAL=NO
/OUTPUT HISTOGRAM=YES ORIENTATION=HORIZONTAL THREEDBAR=YES THREEDCOUNTS=YES.



--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Fitindices-to-determine-optimal-clustersolution-tp5729419p5730940.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Fitindices to determine optimal clustersolution

Art Kendall
In reply to this post by MaaikeSmits
I am generally leery of the idea of removing extreme values.  It is important to look for extreme values in order to check that the data was correctly entered. However, depending on the nature of your data extreme values may be particularly interesting.

What is the nature of the cases? Are they a pop, a genuine sample, an available set of cases? etc.

What is the nature of the set of variables?

Why is data missing? Do some variables account for much of the missing data? I.e., could you retain more cases by dropping variables?  
Art Kendall
Social Research Consultants