Hello,
My question is about the selection of the optimal cluster solution. I have used SPSS version 21.0 to perform a cluster analysis in which the cluster means from a hierarchical cluster analysis using Ward’s method and squared euclidian distance are used as starting points in a k-means cluster analysis. Several cluster solutions (2-, 3-, 4- and 5-clustersolutions) have been established in this manner. Has anyone guidance (and SPSS syntax) concerning the use of fit indices such as explained variance/eta-squared/adjusted-R squared; Calinski Harabasz Index, and the Akaike Information Criterion to determine the optimal cluster solution? Thank you for your help in advance. |
How many cases do you have in the whole data set?
How were the cases selected? Are you variables reasonably uncorrelated? Am I reading correctly that you used the cluster profiles from the Ward method to start the k-means? How many samples from the whole set of cases did you use for the Ward method? How large were those samples?
Art Kendall
Social Research Consultants |
In reply to this post by MaaikeSmits
How many cases do you have in the whole data set?
How were the cases selected? Are you variables reasonably uncorrelated? Am I reading correctly that you used the cluster profiles from the Ward method to start the k-means? How many samples from the whole set of cases did you use for the Ward method? How large were those samples?
Art Kendall
Social Research Consultants |
In reply to this post by Art Kendall
Hello, Thank you for taking interest in my question. I will try to provide you with additional information on your questions. From a total of 225 cases, 187 were included in the cluster analysis (20 cases were lost as a result of missing data on one or more of the 10 input variables and another 18 were excluded because they showed to be extreme outliers on one or more of the input variables). I started with a hierarchical cluster analysis on this 187 cases and the cluster means that resulted from this procedure were used as non-random starting points in the k-means cluster analysis, which was also done on these same 187 cases. So, I did not select subsamples for the hierarchical nor the k-means procedure, but ran both on the whole sample. The 10 (standardized) dimensional scores that were used as input variables for the cluster analysis were fairly unrelated, most below .1, a few of .3 or .4. I hope I have given you the relevant answers to be able to provide some guidance on my question. Of course I will be happy to provide more detailed information if necessary. Kind Regards Maaike 2015-04-29 17:08 GMT+02:00 Art Kendall [via SPSSX Discussion] <[hidden email]>: How many cases do you have in the whole data set? |
One tool that might give you added insight
into your clustering solutions is cluster silhouettes. These show
the distribution of silhouette values for each cluster. They can
be produced by the STATS CLUS SIL extension command (Analyze > Classify
> Cluster Silhouettes). If you don't have that already installed
and have V22 or later, you can install it from the Utilities menu. For
older versions you would need to get it from the SPSS Community website
(www.ibm.com/developerworks/spssdevcentral)
in the Extension Commands collection. It requires the Python Essentials,
which are integrated into the Statistics install as of V22.
Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: MaaikeSmits <[hidden email]> To: [hidden email] Date: 04/29/2015 11:06 AM Subject: Re: [SPSSX-L] Fitindices to determine optimal clustersolution Sent by: "SPSSX(r) Discussion" <[hidden email]> Hello, Thank you for taking interest in my question. I will try to provide you with additional information on your questions. From a total of 225 cases, 187 were included in the cluster analysis (20 cases were lost as a result of missing data on one or more of the 10 input variables and another 18 were excluded because they showed to be extreme outliers on one or more of the input variables). I started with a hierarchical cluster analysis on this 187 cases and the cluster means that resulted from this procedure were used as non-random starting points in the k-means cluster analysis, which was also done on these same 187 cases. So, I did not select subsamples for the hierarchical nor the k-means procedure, but ran both on the whole sample. The 10 (standardized) dimensional scores that were used as input variables for the cluster analysis were fairly unrelated, most below .1, a few of .3 or .4. I hope I have given you the relevant answers to be able to provide some guidance on my question. Of course I will be happy to provide more detailed information if necessary. Kind Regards Maaike 2015-04-29 17:08 GMT+02:00 Art Kendall [via SPSSX Discussion] <[hidden email]>: How many cases do you have in the whole data set? How were the cases selected? Are you variables reasonably uncorrelated? Am I reading correctly that you used the cluster profiles from the Ward method to start the k-means? How many samples from the whole set of cases did you use for the Ward method? How large were those samples? Art Kendall If you reply to this email, your message will be added to the discussion below: http://spssx-discussion.1045642.n5.nabble.com/Fitindices-to-determine-optimal-clustersolution-tp5729419p5729431.html To unsubscribe from Fitindices to determine optimal clustersolution, click here. NAML View this message in context: Re: Fitindices to determine optimal clustersolution Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by MaaikeSmits
A collection of macros to compute various internal clustering
criterions: "Clustering criterions" on my page
http://www.spsstools.net/KO-spssmacros.htm.
29.04.2015 12:39, MaaikeSmits пишет:
Hello, My question is about the selection of the optimal cluster solution. I have used SPSS version 21.0 to perform a cluster analysis in which the cluster means from a hierarchical cluster analysis using Ward’s method and squared euclidian distance are used as starting points in a k-means cluster analysis. Several cluster solutions (2-, 3-, 4- and 5-clustersolutions) have been established in this manner. Has anyone guidance (and SPSS syntax) concerning the use of fit indices such as explained variance/eta-squared/adjusted-R squared; Calinski Harabasz Index, and the Akaike Information Criterion to determine the optimal cluster solution? Thank you for your help in advance. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Fitindices-to-determine-optimal-clustersolution-tp5729419.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Jon K Peck
Jon's suggestion is right on.
***** Of course, a lot depends on what you are going to use the cluster membership variable for, what the nature of a case is, and the meaning of teh variables in the raw profiles. ***** However, it has been my practice since the mid-70's to use several clustering algorithms and use "core clusters" based on consensus of several techniques. Clustering techniques are exploratory/heuristic techniques. The various distance measure * agglomeration algorithm combinations grab different aspects of potential clusters. (1) retain the cluster memberships from several techniques distance measures and levels of agglomeration. Interpreting each so they make some sense. (2) crosstab memberships and find bunches of cases that are put together by several runs. call those bunches "care clusters". (3) use the classification phase of discriminant to see how well the core clusters are separated. Now that STATS CLUS SIL is available use that to look at the core clusters. I have not tried this but it would be interesting to see what happens when you run TWOSTEP on the set of cluster membership variables. ----- For the cases with suspicious values, do they make sense as valid measures? I am leery of trimming variables in general and especially so when looking for profiles. In clustering extremes might be particularly, e.g., looking at counties in 5 western US states, Los Angeles remained a singleton through everything and that makes tremendous sense. ---- Have you examined why some of the variables in the raw profiles are missing? Since your data set is small, you might consider whether you can make reasonable substitutions for the missing values, e.g, use variables not in the profile to do MVA, or substitute zero, or use means of valid scale items rather than totals if you have summative scores, or ...
Art Kendall
Social Research Consultants |
In reply to this post by Kirill Orlov
@ Kirill Orlov. I found your macro's in an earlier stage of my search and tried the command for the use of the Caliniski Harabasz index and AIC/BIC. However the command doesn't seem to run on my data. It seems that a temporary file cannot be found, as I get the following warnings (for each of the cluster solutions). Do you have any idea how I could fix this problem?
Error # 34 in column 36. Text: temp_aicbic_QCL_1.sav >SPSS Statistics cannot access a file with the given file specification. The >file specification is either syntactically invalid, specifies an invalid >drive, specifies a protected directory, specifies a protected file, or >specifies a non-sharable file. >Execution of this command stops >Error # 34 in column 35. Text: temp_calharv_QCL_1.sav >SPSS Statistics cannot access a file with the given file specification. The >file specification is either syntactically invalid, specifies an invalid >drive, specifies a protected directory, specifies a protected file, or >specifies a non-sharable file. >Execution of this command stops. |
In reply to this post by Art Kendall
@ Jon Peck and Art Kendall
I have now used Silhouettes to look at the clustersolutions. I am not sure if the use of euclidian is justified. I used squared euclidian distance as measure in the Original clustering procedure. Also I cannot find information on deciding the value of Minkowski, so I put that by default on 2. However, when I continue under these two assumptions (Minkwoski on 2 and use of euclidian) then I find the overal overall average S for all four clustering solutions to be rather low: s(2) = .117, s(3)= .090, s(4) = .058, s (5) = .125. Maybe it is a good sign that none of the clusters show a negative mean value of S (in none of the clustersolutions), however in all of the clusters there are cases to be found with negative s values. Is there an absolute manner to interpreter the S values, or only relative to each other? Considering the suggestions of Art Kendall: Thank you for your advice on the stepped manner by working through the clustering via various procedures to decide via crostabbing which cases belong to core clusters. I will try to work out which different techniques of distance meausure would be suitable for my data apart from the procedure I already used. Can you refer me to an article in which the stepped procedure as you desribed above is used and outlined? We did give a lot of consideration to the handling of missing data and outliers. In the steps of silhouettes there is being refered to the handling of missing data but in my procedure all cases which have missings on one of the inputvariables are automatically excluded. Are there clustering procedures in which this is not the case (without manually correcting for missings values or omputing them)? When I do not exclude the extreme outliers, clusters are formed with only 1 of 2 cases, which makes the interpretation quite hard, so that is why we chose to exclude extreme outliers (but keep potential or probable outliers). |
In reply to this post by MaaikeSmits
DID you read page named "About SPSS macros" on the site? It says
please DO READ.
01.05.2015 12:12, MaaikeSmits пишет:
@ Kirill Orlov. I found your macro's in an earlier stage of my search and tried the command for the use of the Caliniski Harabasz index and AIC/BIC. However the command doesn't seem to run on my data. It seems that a temporary file cannot be found, as I get the following warnings (for each of the cluster solutions). Do you have any idea how I could fix this problem? Error # 34 in column 36. Text: temp_aicbic_QCL_1.savSPSS Statistics cannot access a file with the given file specification.Thefile specification is either syntactically invalid, specifies an invalid drive, specifies a protected directory, specifies a protected file, or specifies a non-sharable file. Execution of this command stopsError # 34 in column 35. Text: temp_calharv_QCL_1.sav SPSS Statistics cannot access a file with the given file specification.Thefile specification is either syntactically invalid, specifies an invalid drive, specifies a protected directory, specifies a protected file, or specifies a non-sharable file. Execution of this command stops.-- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Fitindices-to-determine-optimal-clustersolution-tp5729419p5729457.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Art Kendall
I am sorry but forgot to mention one last question on the two-step procedure. You hinted to run the TWO-step procedure on the set of clustermembership variables. I am not sure if I grasp the concept of what I would be doing then, could you refer me to more information on this method? Do you mean clustering by means of the input of the 4 clustermembership variables that were established via the (first hierarchical and then optimized by the) k-means clustering procedure of respectivly the 2-, 3-, 4 and 5-clustersolutions?
Your help is very much appreciated. |
In reply to this post by MaaikeSmits
Look at the proximities documentation and see which are for continuous data.
Unfortunately, publications from the US Census Bureau do not seem to be available online. My files with my write ups on that procedure were not returned from sanitizing at the time of the 2001 DC area anthrax events. TWOSTEP can use contiunous variables, categorical variables, or a mixture of those. it is certainly one that I would try on the continuous data. I have not yet had an opportunity to try using TWOSTEP with the cluster membership variables as the profile elements. The idea is (1) to apply (a) a set of Quick Cluster algorithms,e.g., Ward, single-linkage, average-linkage, (b) TWOSTEP, and (c) K-means would give you variables showing cluster membership. (2) use the cluster memberships as the elements of a profile. IFF it gives meaningful grouping, using TWOSTEP would be a way to find core clusters (3) as in the other post about core clusters, iteratively run DISCRIMINANT using the probabilities of membership to move some cases into the "ungrouped" value of the variable that contains the core cluster membership.
Art Kendall
Social Research Consultants |
In reply to this post by Jon K Peck
Hi Jon Peck,
Some time ago you referred me to the silhouettes procedure as a way to determine and compare several clustersolutions, that I got from a k-means clusteranalysis. I performed the cluster-analysis on z-scores (10 variables). In the silhouettes procedure I think I should use the z-scores as well (next to clustersolution/membership variable), is that right? So option 1 instead of 2? QCL_1 is the clustermembership variabele that is saved from the k-means procedure. The rBPS to rSZOID variables are the variables I performed the k-means analyses on (in option 1 in Z-scores, in option 2 as originol non standardized scores). I do not understand enough of the mathematical computation of silhouettes in order to understand the difference between silhouette mean scores that arises if I use standardized versus non-standardized scores, but which one should I use ? I perform syntax below on all clustervariables, en then the one with highest total silhouette mean score woudl indicate best fit to the data right? Option 1 STATS CLUS SIL CLUSTER=QCL_1 VARIABLES=ZrBPSdim ZrTHEAdim ZrNARCdim ZrANTdimA ZrAFHdim ZrONTdim ZrOBSdim ZrPARAdim ZrSTYPdim ZrSZOIDdim NEXTBEST=clusdrienextbest SILHOUETTE=clusdriesilval DISSIMILARITY=EUCLID MINKOWSKIPOWER=2 /OPTIONS MISSING=RESCALE RENUMBERORDINAL=NO /OUTPUT HISTOGRAM=YES ORIENTATION=HORIZONTAL THREEDBAR=YES THREEDCOUNTS=YES. Option 2 STATS CLUS SIL CLUSTER=QCL_3 VARIABLES=rBPSdim rTHEAdim rNARCdim rANTdimA rAFHdim rONTdim rOBSdim rPARAdim rSTYPdim rSZOIDdim NEXTBEST=clusdrienextbest SILHOUETTE=clusdriesilval DISSIMILARITY=EUCLID MINKOWSKIPOWER=2 /OPTIONS MISSING=RESCALE RENUMBERORDINAL=NO /OUTPUT HISTOGRAM=YES ORIENTATION=HORIZONTAL THREEDBAR=YES THREEDCOUNTS=YES. |
You should be using the variables the way
they were used when you did the clustering, so if you clustered using standardized
variables, use those same variables for the silhouette plots.
The silhouette plots are showing you how comfortable, if you will, the points are in their assigned clusters, Silhouette values near 1 are good. This link gives you a concise description without too much math. https://en.wikipedia.org/wiki/Silhouette_(clustering) Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: MaaikeSmits <[hidden email]> To: [hidden email] Date: 11/13/2015 07:38 AM Subject: Re: [SPSSX-L] Fitindices to determine optimal clustersolution Sent by: "SPSSX(r) Discussion" <[hidden email]> Hi Jon Peck, Some time ago you referred me to the silhouettes procedure as a way to determine and compare several clustersolutions, that I got from a k-means clusteranalysis. I performed the cluster-analysis on z-scores (10 variables). In the silhouettes procedure I think I should use the z-scores as well (next to clustersolution/membership variable), is that right? So option 1 instead of 2? QCL_1 is the clustermembership variabele that is saved from the k-means procedure. The rBPS to rSZOID variables are the variables I performed the k-means analyses on (in option 1 in Z-scores, in option 2 as originol non standardized scores). I do not understand enough of the mathematical computation of silhouettes in order to understand the difference between silhouette mean scores that arises if I use standardized versus non-standardized scores, but which one should I use ? I perform syntax below on all clustervariables, en then the one with highest total silhouette mean score woudl indicate best fit to the data right? Option 1 STATS CLUS SIL CLUSTER=QCL_1 VARIABLES=ZrBPSdim ZrTHEAdim ZrNARCdim ZrANTdimA ZrAFHdim ZrONTdim ZrOBSdim ZrPARAdim ZrSTYPdim ZrSZOIDdim NEXTBEST=clusdrienextbest SILHOUETTE=clusdriesilval DISSIMILARITY=EUCLID MINKOWSKIPOWER=2 /OPTIONS MISSING=RESCALE RENUMBERORDINAL=NO /OUTPUT HISTOGRAM=YES ORIENTATION=HORIZONTAL THREEDBAR=YES THREEDCOUNTS=YES. Option 2 STATS CLUS SIL CLUSTER=QCL_3 VARIABLES=rBPSdim rTHEAdim rNARCdim rANTdimA rAFHdim rONTdim rOBSdim rPARAdim rSTYPdim rSZOIDdim NEXTBEST=clusdrienextbest SILHOUETTE=clusdriesilval DISSIMILARITY=EUCLID MINKOWSKIPOWER=2 /OPTIONS MISSING=RESCALE RENUMBERORDINAL=NO /OUTPUT HISTOGRAM=YES ORIENTATION=HORIZONTAL THREEDBAR=YES THREEDCOUNTS=YES. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Fitindices-to-determine-optimal-clustersolution-tp5729419p5730940.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by MaaikeSmits
I am generally leery of the idea of removing extreme values. It is important to look for extreme values in order to check that the data was correctly entered. However, depending on the nature of your data extreme values may be particularly interesting.
What is the nature of the cases? Are they a pop, a genuine sample, an available set of cases? etc. What is the nature of the set of variables? Why is data missing? Do some variables account for much of the missing data? I.e., could you retain more cases by dropping variables?
Art Kendall
Social Research Consultants |
Free forum by Nabble | Edit this page |