Hi all,
I'm trying to figure out what clustering mechanism I should be using for my analysis. For now I've tried both K-means and hierarchichal clustering on the same data and have ended up with entirely different clusters. In the case of K-means I got three clusters that are very close in size, whereas with hierarchical clustering almost all the cases ended up in one cluster. Is this even possible? (I'm not entirely clear on how Ward's algorithm works). Which one should I be using? My database size is about 64 cases, and 10 variables were used in clustering. any advice would be great Alina Sheyman, Family Office Exchange |
We use Ward's (1963) minimum distance method. This is a hierarchical
method that groups cases to maximize between group differences and minimize within group differences (i.e., optimizes an F-Statistic). It keeps grouping the most similar pair of cases/clusters until there is just one cluster. Melissa -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Alina Sheyman Sent: Monday, December 04, 2006 2:35 PM To: [hidden email] Subject: [SPSSX-L] K-means vs. hierarchical clustering Hi all, I'm trying to figure out what clustering mechanism I should be using for my analysis. For now I've tried both K-means and hierarchichal clustering on the same data and have ended up with entirely different clusters. In the case of K-means I got three clusters that are very close in size, whereas with hierarchical clustering almost all the cases ended up in one cluster. Is this even possible? (I'm not entirely clear on how Ward's algorithm works). Which one should I be using? My database size is about 64 cases, and 10 variables were used in clustering. any advice would be great Alina Sheyman, Family Office Exchange PRIVILEGED AND CONFIDENTIAL INFORMATION This transmittal and any attachments may contain PRIVILEGED AND CONFIDENTIAL information and is intended only for the use of the addressee. If you are not the designated recipient, or an employee or agent authorized to deliver such transmittals to the designated recipient, you are hereby notified that any dissemination, copying or publication of this transmittal is strictly prohibited. If you have received this transmittal in error, please notify us immediately by replying to the sender and delete this copy from your system. You may also call us at (309) 827-6026 for assistance. |
In reply to this post by Alina Sheyman
www.statisticsdoc.com
Stephen Brand Alina, It is usually very good to examine the results of both K-means and a hierarchical clustering algorithm (and Ward's is a good one). Important considerations in choosing between the results of a Ward's and a K-Means solution include interpretability and utility. Here is a very rough guide to how the algorithms work and how to look at the results. If you ask K-Means to form K clusters, it will try to find N centroids that are furthest apart. The process will start with K cases that are furthest apart and form clusters of cases that are closest to these initial cases. The location of the centroids does not depend on the clusters that are found with K+1 clusters, and does not influence the clusters that are found with K-1 clusters. Hence, K-means is not hierarchical - it tries to differentiate cases into K clusters regardless of how they were combined in more differentiated cluster structures. It is not constrained by solutions with more clusters Hierarchical clustering algorithms start by assigning each case to its own cluster, and then combine clusters. If your sample size is N, then the process starts with N clusters. There are a variety of algorithms and criteria for combining clusters. Wards is one of the more widely used, and useful, hierarchical clustering methods. A key point is that, in hierarchical methods, the solution for K clusters depends on the cluster solution for K+1 clusters, because the clusters in the K-1 solution are the ones that are combined. The clustering process operates on the clusters that was found with more clusters. In some applications, the results of Ward's or another hierarchical method are more interpretable, because the hierarchical structure has some inherent taxnomical meaning. The fact that some clusters are subsumed under other higher-order clusters is inherently interesting (thinl speciation). Sometimes, the results of Wards are just more interpretable in terms of how the cases are grouped on the clustering or other variables. To evaluate cluster solutions, you may find it helpful to conduct discriminant function analysis to differentiate the clusters according to the clustering variables. You might also consider running a discriminant function analysis differentiating the clusters according to exogenous variables (ones that were not used to form the clusters, but which should differ meaningfully between the clusters). Do some of the clusters appear to break out interesting and useful patterns of variables? Is there some inherent utility in finding small cases with atypical patterns of responding, or are you looking for a number of reasonably large cases. A whole other topic, of course, is deciding on the number of clusters to select with each method. Again, utility and interpretability are key issues, and discriminant function analysis can be a useful tool. HTH, Stephen Brand For personalized and professional consultation in statistics and research design, visit www.statisticsdoc.com -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of Alina Sheyman Sent: Monday, December 04, 2006 3:35 PM To: [hidden email] Subject: K-means vs. hierarchical clustering Hi all, I'm trying to figure out what clustering mechanism I should be using for my analysis. For now I've tried both K-means and hierarchichal clustering on the same data and have ended up with entirely different clusters. In the case of K-means I got three clusters that are very close in size, whereas with hierarchical clustering almost all the cases ended up in one cluster. Is this even possible? (I'm not entirely clear on how Ward's algorithm works). Which one should I be using? My database size is about 64 cases, and 10 variables were used in clustering. any advice would be great Alina Sheyman, Family Office Exchange |
In reply to this post by Alina Sheyman
Different methods have different strengths and weaknesses. Ward's
methods tends to give eual sized clusters while single linkage (nearest neighbor) tends to give long strings into a cluster. I think it is best to try several methods and examine the clusters for interpretability. AK-means is sensitive to the starting values. What I do is try several hierarchical methods and see which gives the most interpretable clusters. Then I use k-means (with the hierarchical cluster centroids as starting points) to clean up the hierarchical cluster. this can be necessary because sometimes hierarchical clusters can drift away from their starting point. Paul R. Swank, Ph.D. Professor and Director of Research, Children's Learning Institute University of Texas Health Science Center at Houston -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Alina Sheyman Sent: Monday, December 04, 2006 2:35 PM To: [hidden email] Subject: K-means vs. hierarchical clustering Hi all, I'm trying to figure out what clustering mechanism I should be using for my analysis. For now I've tried both K-means and hierarchichal clustering on the same data and have ended up with entirely different clusters. In the case of K-means I got three clusters that are very close in size, whereas with hierarchical clustering almost all the cases ended up in one cluster. Is this even possible? (I'm not entirely clear on how Ward's algorithm works). Which one should I be using? My database size is about 64 cases, and 10 variables were used in clustering. any advice would be great Alina Sheyman, Family Office Exchange |
In reply to this post by Alina Sheyman
Alina,
Both methods are applicable, but there are differences between them. K-means computes a definite number of clusters given by you, so you did not "get" three clusters: you "told" the QUICK CLUSTER procedure to assign the cases to three clusters. Besides, k-means works with interval level variables which are previously standardized (i.e. converted into z scores by you). Also, the results of k-means clustering may be affected by the choice of initial centres for the clusters, i.e. the starting points for the iteration. On the other hand, hierarchical clustering accepts any kind of variables provided you choose an adequate measure of proximity for them, and then the CLUSTER procedure proceeds to form successive groupings, from the initial situation of N clusters with one member each, to the final situation of one giant cluster with N members. This is achieved by steps. In the first step, the two closest cases are joined in one cluster, thus resulting in N-1 clusters: one cluster of two members and N-2 clusters of one member. In the second step a third case is joined either to the 2-member cluster or to another solitary case, depending which is closest, resulting in N-3 clusters, and so on. Hierarchical clustering lets you choose which of the N steps gives you the most convenient number of clusters for your analysis. Hector -----Mensaje original----- De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Alina Sheyman Enviado el: 04 December 2006 21:35 Para: [hidden email] Asunto: K-means vs. hierarchical clustering Hi all, I'm trying to figure out what clustering mechanism I should be using for my analysis. For now I've tried both K-means and hierarchichal clustering on the same data and have ended up with entirely different clusters. In the case of K-means I got three clusters that are very close in size, whereas with hierarchical clustering almost all the cases ended up in one cluster. Is this even possible? (I'm not entirely clear on how Ward's algorithm works). Which one should I be using? My database size is about 64 cases, and 10 variables were used in clustering. any advice would be great Alina Sheyman, Family Office Exchange |
In reply to this post by statisticsdoc
At 04:32 PM 12/4/2006, Statisticsdoc wrote:
>It is usually very good to examine the results of both K-means and a >hierarchical clustering algorithm (and Ward's is a good one) >[...] >To evaluate cluster solutions, you may find it helpful to conduct >discriminant function analysis to differentiate the clusters according >to the clustering variables. Well, when displaying ignorance, might as well do it in public. Stephen, why discriminant function? Without knowing either deeply, I've come to think of logistic regression as usually superior, for modelling of group differentiation. Among other things, do I recall that discriminant is a little unforgiving of non-normality? |
In reply to this post by Alina Sheyman
Alina,
There is no right or wrong approach here, there is just a well thought out logical rationale for one choice over another and some basic investigation of your data to explain what is happening. From your outline so far, what I know is that you have ten variables and 64 cases. Typically, I would defer to Hierarchical clustering (HC) given your sample size, since this is the only piece of information you have provided in your posting, apart from disparate findings across the two different algorithms (HC and k-means). I have read somewhere that hc produces more stable solutions over k-means with small sample sizes (cannot remember where), you may be able to find some published peer review lit to substantiate your choice of one algorithm over another, based in part (not the only consideration!!!!!) on your sample size. This does not seem to be your case (k-means gives seeminly more balanced solutions), so here are some other things to look at, using spss. 1. Multi-collinearity (is this stuffing up your solutions?) Before you run your clustering process again, I would first run correlation analysis on your variables and develop a correlation matrix to assess collinearity between the variables. Variables that are highly collinear (i.e. have high correlations should be ommitted from the analysis unless there are theoretical grounds for keeping them in there) should be eliminated. You could also run a quick and dirty PCA on your variables (before I get shot down based on PCA on 64 cases) you are doing this just to see which items load together, looking for general patterns, and not reading too much into your FA results. Then run and rerun your cluster analyses. Develop different solutions (hc and k-means) with all the variables included, and then eliminate any collinear variables, then rerun your solutions), and see if this has an impact on the differences between the two solutions. That way you can identify or discount multi-collinearity as impacting on your solutions. 2. Are your clusters an artefact of the algorithm and really not 'true' clusters, which could explain disparate results across the two different algorithms you used? Given the way spss clusters work, and their shortcomings, here is a little test to run on your solutions. Depending on how you sort the file and the order of cases, your solutions can vary (oh dear!!!!). Here is what I would do if you have time. This is a quick way to test cluster 'reproducibility'. Generate a set of random id variables at the end of your data set (assign different id numbers to each case). Sort your dataset by each of these different cases (ascending and descending) and then rerun your cluster analyses repeatedly. Save the cluster memberships and then run cross tabs on the different memberships. If your clusters are stable, no matter how you sort the dataset, you should see similar membership patterns across the different sorted solutions. If they are not, you have a clue that the algorithm is not picking up real and reproducible solutions!!!! (In clustan-graphics, I can seed 5000 solutions for k-means and it generates a reproducibility index based on euclidean sum of squares) that tells me that for different random starting points, my solution is reproduced 75% of the time. 3. What else might be causing the different results (some real and actual patterns in the data) Are the two different algorithms tapping different patterns across the variables. Profile the clusters on the 10 variables (look at mean and standard deviations) by running a series of anovas using cluster membership and all the ten variables for the k-means and hierarchical solutions (make sure you have the same cluster numbers). This will give you a picture of what variables your clusters differ on (do not look for significance, look for general patterns here). It may be that the two different algorithms are linking your cases differently, and your profiles will give you some idea about whether this is occurring or not. Look at mean differences and standard deviation sizes. A general rule of thumb is that variables that have smaller sd's and large differences between means are better discriminators between clusters (this depends on the algorithm). I have identified some simple practical things you can do, hope this helps. Paul > Alina Sheyman <[hidden email]> wrote: > > Hi all, > > I'm trying to figure out what clustering mechanism I should be using for > my > analysis. For now I've tried both K-means and hierarchichal clustering > on > the same data and have ended up with entirely different clusters. In the > case of K-means I got three clusters that are very close in size, > whereas > with hierarchical clustering almost all the cases ended up in one > cluster. > Is this even possible? (I'm not entirely clear on how Ward's algorithm > works). Which one should I be using? My database size is about 64 cases, > and 10 variables were used in clustering. > > any advice would be great > > Alina Sheyman, > Family Office Exchange |
In reply to this post by Richard Ristow
I have a very simple question. If you have repeated-measures data but you cannot match up data points across testing occasions what is the problem with just doing an independent samples t-test? What do you lose? Is it just completely unsound practice or do you just lose power?
Thanks, Matt Matthew Pirritano, Ph.D. Assistant Professor of Psychology Smith Hall 116C Chapman University Department of Psychology One University Drive Orange, CA 92866 Telephone (714)744-7940 FAX (714)997-6780 |
Matt,
When you conduct an independent samples t-test, you lose power. To the extent that the pre-test and post-test scores are correlated, the paired-sample test is more powerful. HTH, Stephen Brand For personalized and professional consultation in statistics and research design, visit www.statisticsdoc.com -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of Pirritano, Matthew Sent: Monday, December 04, 2006 8:15 PM To: [hidden email] Subject: Independent t-test or paired-sample I have a very simple question. If you have repeated-measures data but you cannot match up data points across testing occasions what is the problem with just doing an independent samples t-test? What do you lose? Is it just completely unsound practice or do you just lose power? Thanks, Matt Matthew Pirritano, Ph.D. Assistant Professor of Psychology Smith Hall 116C Chapman University Department of Psychology One University Drive Orange, CA 92866 Telephone (714)744-7940 FAX (714)997-6780 |
In reply to this post by Alina Sheyman
--On 04 December 2006 15:34 -0500 Alina Sheyman <[hidden email]>
wrote: > Hi all, > > I'm trying to figure out what clustering mechanism I should be using for > my analysis. For now I've tried both K-means and hierarchichal clustering > on the same data and have ended up with entirely different clusters. As usual in statistics, it is a good idea to think about the data and how it arose - what mechanism in nature generated it. You can approach this by theory, or by examining the data. Ignoring clustering, for the moment, suppose that you are just given two sets of numbers and you are asked if their locations differ. Before using a t-test, you look at the data and consider whether it is normally distributed. If all of the numbers are whole numbers, and after examining the distribution, you decide that the data probably arose from a Poisson process rather than from a normal distribution, and you select your analysis accordingly. On the other hand, if you know that the numbers came from counting cars driven past a fixed point, this would give you some clues about appropriate analysis. Now for clustering. Different clustering methods try to find different kinds of clusters from data-sets. What kind of clusters should you be looking for? Hierarchal clustering is the sort that you might apply when there is a "tree" structure to the data. Think of the classification of living things. At the top, all of them, then splitting into plants, animals and other things such as funghi. Once you are on the animal branch, this splits into mammals, reptiles, etc, and you can keep going until you get down to individual species. AT NO TIME, when things have been split off from the rest of the data onto one of the branches, do subsets ever move to other branches. You might think about whether this is appropriate for your data. Once you have split your data up into two sets this split is final, and the process only subdivides further - nothing from set one ever moves back into set two. K-means clustering does not assume a tree structure. In its pure form you might ask the computer - split these data values into three groups or four groups, but you can't guarantee that merging two groups from the four-group solution will produce the same as the three-group solution. If you have only two or three dimensions (or can sensibly reduce your data by factor analysis) you can plot it and see what sort of relationships you have. Are you looking for nice spherical clusters, or are long chains more suitable? You might consider that your data values were generated from multivariate normal random variables from groups with different means, and you might consider how best to identify these groups and their means. Sometimes data values fall into such clear groups that almost all clustering methods will find the same clusters. Where the boundaries are fuzzy, the solutions may be very different. I'll end with a little parable. Suppose I have a very willing idiot working for me, and I ask him to arrange my books nicely. He might do this by author or by subject, or by the colour of the cover, or the size of the book, or by weight, or by date of publication. If I simply ask for a "nice arrangement" I ought not to complain about any of these, and I might find one or more useful. If you just ask SPSS to use cluster analysis to produce a "nice arrangement" then, according to the method chosen, the order of the data and a possible random element, you might get one of many rather different nice arrangments, and the "best" of these depends on what you want the clustering for. David Hitchin |
In reply to this post by Richard Ristow
Stephen Brand
www.statisticsdoc.com Richard, In my experience, discriminant functions usually provide a useful framework for differentiating clusters. DFA provides a sense of the dimensions that differentiate clusters, and where the clusters are located within this framework, and how the variables are associated with the dimensions. (It is also possible to look at the unique contributions of the variables to the dimensions, but in many instances the variables you cluster on will have a fairly high degree of collinearity, so be careful). For example, when looking at health and adjustment data, it can be useful to speak of a function that relates to substance use items, another that relates to emotional adjustment, a third that relates to academic adjustment, and consider where the centroids of each cluster fall on each function (with and without rotation). You raise an interesting point - a multinomial logistic regression would be another potentially useful way to relate the clustering variables to cluster membership. This method could also be used to relate exogenous variables to cluster membership, particularly when the set of predictor variables have relatively low collinearity. The regression approach will give you more information about the variables that make a unique contribution to the likelihood of membership in one cluster versus alternatives. I see discriminant function analysis as being perhaps more helpful for developing a dimensional framework. Best, Stephen Brand For personalized and professional consultation in statistics and research design, visit www.statisticsdoc.com -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of Richard Ristow Sent: Monday, December 04, 2006 6:22 PM To: [hidden email] Subject: Re: K-means vs. hierarchical clustering At 04:32 PM 12/4/2006, Statisticsdoc wrote: >It is usually very good to examine the results of both K-means and a >hierarchical clustering algorithm (and Ward's is a good one) >[...] >To evaluate cluster solutions, you may find it helpful to conduct >discriminant function analysis to differentiate the clusters according >to the clustering variables. Well, when displaying ignorance, might as well do it in public. Stephen, why discriminant function? Without knowing either deeply, I've come to think of logistic regression as usually superior, for modelling of group differentiation. Among other things, do I recall that discriminant is a little unforgiving of non-normality? |
Previous responders have made several very good points.
In addition: A major reason to do clustering is data reduction. Clustering creates a new nominal level variable that can be used in any further analysis. Iterative use of discriminant provides other outputs that are useful in refining and interpreting cluster solution suggested by cluster procedures. The classification phase has a table of cluster memberships by memberships that would be assigned by the dfa. DFA can also save the probability of membership for each case in each of the clusters. DFA can also save the probability that a case would be so far away from the centroid of the cluster it is assigned to. Iteratively treating cases with ambiguous cluster assignment or with extreme distance from their centroids as "ungrouped" in the classification phase can be very useful in reaching a working solution. Logistic regression produces predicted scores that can only take on the values of the raw variable. This can be very useful. DFA creates continuous variables that cases are arrayed along. It is analogous to making cuts on the dimensions. The farther a case is from the cutpoint the more strongly it is a member of the group to which it is assigned. Art Kendall Social Research Consultants Statisticsdoc wrote: >Stephen Brand >www.statisticsdoc.com > >Richard, > >In my experience, discriminant functions usually provide a useful framework >for differentiating clusters. DFA provides a sense of the dimensions that >differentiate clusters, and where the clusters are located within this >framework, and how the variables are associated with the dimensions. (It is >also possible to look at the unique contributions of the variables to the >dimensions, but in many instances the variables you cluster on will have a >fairly high degree of collinearity, so be careful). For example, when >looking at health and adjustment data, it can be useful to speak of a >function that relates to substance use items, another that relates to >emotional adjustment, a third that relates to academic adjustment, and >consider where the centroids of each cluster fall on each function (with and >without rotation). > >You raise an interesting point - a multinomial logistic regression would be >another potentially useful way to relate the clustering variables to cluster >membership. This method could also be used to relate exogenous variables to >cluster membership, particularly when the set of predictor variables have >relatively low collinearity. The regression approach will give you more >information about the variables that make a unique contribution to the >likelihood of membership in one cluster versus alternatives. I see >discriminant function analysis as being perhaps more helpful for developing >a dimensional framework. > >Best, > >Stephen Brand > > >For personalized and professional consultation in statistics and research >design, visit >www.statisticsdoc.com > > >-----Original Message----- >From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of >Richard Ristow >Sent: Monday, December 04, 2006 6:22 PM >To: [hidden email] >Subject: Re: K-means vs. hierarchical clustering > > >At 04:32 PM 12/4/2006, Statisticsdoc wrote: > > > >>It is usually very good to examine the results of both K-means and a >>hierarchical clustering algorithm (and Ward's is a good one) >>[...] >>To evaluate cluster solutions, you may find it helpful to conduct >>discriminant function analysis to differentiate the clusters according >>to the clustering variables. >> >> > >Well, when displaying ignorance, might as well do it in public. > >Stephen, why discriminant function? Without knowing either deeply, I've >come to think of logistic regression as usually superior, for modelling >of group differentiation. Among other things, do I recall that >discriminant is a little unforgiving of non-normality? > > > > |
In reply to this post by Alina Sheyman
Thanks to all who've responded to my post. You've been incredibly helpful.
|
Free forum by Nabble | Edit this page |