SPSSX Discussion

K-means vs. hierarchical clustering

Classic

List

Threaded

13 messages Options

Alina Sheyman

K-means vs. hierarchical clustering

Hi all,

I'm trying to figure out what clustering mechanism I should be using for my
analysis. For now I've tried both K-means and hierarchichal clustering on
the same data and have ended up with entirely different clusters. In the
case of K-means I got three clusters that are very close in size, whereas
with hierarchical clustering almost all the cases ended up in one cluster.
Is this even possible? (I'm not entirely clear on how Ward's algorithm
works). Which one should I be using? My database size is about 64 cases,
and 10 variables were used in clustering.

any advice would be great

Alina Sheyman,
Family Office Exchange

Melissa Ives

Re: K-means vs. hierarchical clustering

We use Ward's (1963) minimum distance method. This is a hierarchical
method that groups cases to maximize between group differences and
minimize within group differences (i.e., optimizes an F-Statistic). It
keeps grouping the most similar pair of cases/clusters until there is
just one cluster.

Melissa

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Alina Sheyman
Sent: Monday, December 04, 2006 2:35 PM
To: [hidden email]
Subject: [SPSSX-L] K-means vs. hierarchical clustering

Hi all,

I'm trying to figure out what clustering mechanism I should be using for
my analysis. For now I've tried both K-means and hierarchichal
clustering on the same data and have ended up with entirely different
clusters. In the case of K-means I got three clusters that are very
close in size, whereas with hierarchical clustering almost all the cases
ended up in one cluster.
Is this even possible? (I'm not entirely clear on how Ward's algorithm
works). Which one should I be using? My database size is about 64 cases,
and 10 variables were used in clustering.

any advice would be great

Alina Sheyman,
Family Office Exchange

PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.

statisticsdoc

Re: K-means vs. hierarchical clustering

In reply to this post by Alina Sheyman

www.statisticsdoc.com
Stephen Brand

Alina,

It is usually very good to examine the results of both K-means and a
hierarchical clustering algorithm (and Ward's is a good one). Important
considerations in choosing between the results of a Ward's and a K-Means
solution include interpretability and utility. Here is a very rough guide
to how the algorithms work and how to look at the results.

If you ask K-Means to form K clusters, it will try to find N centroids that
are furthest apart. The process will start with K cases that are furthest
apart and form clusters of cases that are closest to these initial cases.
The location of the centroids does not depend on the clusters that are found
with K+1 clusters, and does not influence the clusters that are found with
K-1 clusters. Hence, K-means is not hierarchical - it tries to
differentiate cases into K clusters regardless of how they were combined in
more differentiated cluster structures. It is not constrained by solutions
with more clusters

Hierarchical clustering algorithms start by assigning each case to its own
cluster, and then combine clusters. If your sample size is N, then the
process starts with N clusters. There are a variety of algorithms and
criteria for combining clusters. Wards is one of the more widely used, and
useful, hierarchical clustering methods. A key point is that, in
hierarchical methods, the solution for K clusters depends on the cluster
solution for K+1 clusters, because the clusters in the K-1 solution are the
ones that are combined. The clustering process operates on the clusters
that was found with more clusters. In some applications, the results of
Ward's or another hierarchical method are more interpretable, because the
hierarchical structure has some inherent taxnomical meaning. The fact that
some clusters are subsumed under other higher-order clusters is inherently
interesting (thinl speciation). Sometimes, the results of Wards are just
more interpretable in terms of how the cases are grouped on the clustering
or other variables.

To evaluate cluster solutions, you may find it helpful to conduct
discriminant function analysis to differentiate the clusters according to
the clustering variables. You might also consider running a discriminant
function analysis differentiating the clusters according to exogenous
variables (ones that were not used to form the clusters, but which should
differ meaningfully between the clusters). Do some of the clusters appear
to break out interesting and useful patterns of variables? Is there some
inherent utility in finding small cases with atypical patterns of
responding, or are you looking for a number of reasonably large cases.

A whole other topic, of course, is deciding on the number of clusters to
select with each method. Again, utility and interpretability are key
issues, and discriminant function analysis can be a useful tool.

HTH,

Stephen Brand

For personalized and professional consultation in statistics and research
design, visit
www.statisticsdoc.com

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of
Alina Sheyman
Sent: Monday, December 04, 2006 3:35 PM
To: [hidden email]
Subject: K-means vs. hierarchical clustering

Hi all,

I'm trying to figure out what clustering mechanism I should be using for my
analysis. For now I've tried both K-means and hierarchichal clustering on
the same data and have ended up with entirely different clusters. In the
case of K-means I got three clusters that are very close in size, whereas
with hierarchical clustering almost all the cases ended up in one cluster.
Is this even possible? (I'm not entirely clear on how Ward's algorithm
works). Which one should I be using? My database size is about 64 cases,
and 10 variables were used in clustering.

any advice would be great

Alina Sheyman,
Family Office Exchange

Swank, Paul R

Re: K-means vs. hierarchical clustering

In reply to this post by Alina Sheyman

Different methods have different strengths and weaknesses. Ward's
methods tends to give eual sized clusters while single linkage (nearest
neighbor) tends to give long strings into a cluster. I think it is best
to try several methods and examine the clusters for interpretability.
AK-means is sensitive to the starting values. What I do is try several
hierarchical methods and see which gives the most interpretable
clusters. Then I use k-means (with the hierarchical cluster centroids as
starting points) to clean up the hierarchical cluster. this can be
necessary because sometimes hierarchical clusters can drift away from
their starting point.

Paul R. Swank, Ph.D.
Professor and Director of Research,
Children's Learning Institute
University of Texas Health Science Center at Houston

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Alina Sheyman
Sent: Monday, December 04, 2006 2:35 PM
To: [hidden email]
Subject: K-means vs. hierarchical clustering

Hi all,

I'm trying to figure out what clustering mechanism I should be using for
my analysis. For now I've tried both K-means and hierarchichal
clustering on the same data and have ended up with entirely different
clusters. In the case of K-means I got three clusters that are very
close in size, whereas with hierarchical clustering almost all the cases
ended up in one cluster.
Is this even possible? (I'm not entirely clear on how Ward's algorithm
works). Which one should I be using? My database size is about 64 cases,
and 10 variables were used in clustering.

any advice would be great

Alina Sheyman,
Family Office Exchange

Hector Maletta

Re: K-means vs. hierarchical clustering

In reply to this post by Alina Sheyman

Alina,
Both methods are applicable, but there are differences between them.
K-means computes a definite number of clusters given by you, so you did not
"get" three clusters: you "told" the QUICK CLUSTER procedure to assign the
cases to three clusters. Besides, k-means works with interval level
variables which are previously standardized (i.e. converted into z scores by
you). Also, the results of k-means clustering may be affected by the choice
of initial centres for the clusters, i.e. the starting points for the
iteration.
On the other hand, hierarchical clustering accepts any kind of
variables provided you choose an adequate measure of proximity for them, and
then the CLUSTER procedure proceeds to form successive groupings, from the
initial situation of N clusters with one member each, to the final situation
of one giant cluster with N members. This is achieved by steps. In the first
step, the two closest cases are joined in one cluster, thus resulting in N-1
clusters: one cluster of two members and N-2 clusters of one member. In the
second step a third case is joined either to the 2-member cluster or to
another solitary case, depending which is closest, resulting in N-3
clusters, and so on. Hierarchical clustering lets you choose which of the N
steps gives you the most convenient number of clusters for your analysis.

Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Alina
Sheyman
Enviado el: 04 December 2006 21:35
Para: [hidden email]
Asunto: K-means vs. hierarchical clustering

Hi all,

I'm trying to figure out what clustering mechanism I should be using
for my
analysis. For now I've tried both K-means and hierarchichal
clustering on
the same data and have ended up with entirely different clusters. In
the
case of K-means I got three clusters that are very close in size,
whereas
with hierarchical clustering almost all the cases ended up in one
cluster.
Is this even possible? (I'm not entirely clear on how Ward's
algorithm
works). Which one should I be using? My database size is about 64
cases,
and 10 variables were used in clustering.

any advice would be great

Alina Sheyman,
Family Office Exchange

Richard Ristow

Re: K-means vs. hierarchical clustering

In reply to this post by statisticsdoc

At 04:32 PM 12/4/2006, Statisticsdoc wrote:

>It is usually very good to examine the results of both K-means and a
>hierarchical clustering algorithm (and Ward's is a good one)
>[...]
>To evaluate cluster solutions, you may find it helpful to conduct
>discriminant function analysis to differentiate the clusters according
>to the clustering variables.

Well, when displaying ignorance, might as well do it in public.

Stephen, why discriminant function? Without knowing either deeply, I've
come to think of logistic regression as usually superior, for modelling
of group differentiation. Among other things, do I recall that
discriminant is a little unforgiving of non-normality?

paulandpen

Re: K-means vs. hierarchical clustering

In reply to this post by Alina Sheyman

Alina,

There is no right or wrong approach here, there is just a well thought out logical rationale for one choice over another and some basic investigation of your data to explain what is happening. From your outline so far, what I know is that you have ten variables and 64 cases. Typically, I would defer to Hierarchical clustering (HC) given your sample size, since this is the only piece of information you have provided in your posting, apart from disparate findings across the two different algorithms (HC and k-means). I have read somewhere that hc produces more stable solutions over k-means with small sample sizes (cannot remember where), you may be able to find some published peer review lit to substantiate your choice of one algorithm over another, based in part (not the only consideration!!!!!) on your sample size. This does not seem to be your case (k-means gives seeminly more balanced solutions), so here are some other things to look at, using spss.

1. Multi-collinearity (is this stuffing up your solutions?)
Before you run your clustering process again, I would first run correlation analysis on your variables and develop a correlation matrix to assess collinearity between the variables. Variables that are highly collinear (i.e. have high correlations should be ommitted from the analysis unless there are theoretical grounds for keeping them in there) should be eliminated. You could also run a quick and dirty PCA on your variables (before I get shot down based on PCA on 64 cases) you are doing this just to see which items load together, looking for general patterns, and not reading too much into your FA results. Then run and rerun your cluster analyses. Develop different solutions (hc and k-means) with all the variables included, and then eliminate any collinear variables, then rerun your solutions), and see if this has an impact on the differences between the two solutions. That way you can identify or discount multi-collinearity as impacting on your solutions.

2. Are your clusters an artefact of the algorithm and really not 'true' clusters, which could explain disparate results across the two different algorithms you used? Given the way spss clusters work, and their shortcomings, here is a little test to run on your solutions.
Depending on how you sort the file and the order of cases, your solutions can vary (oh dear!!!!). Here is what I would do if you have time. This is a quick way to test cluster 'reproducibility'. Generate a set of random id variables at the end of your data set (assign different id numbers to each case). Sort your dataset by each of these different cases (ascending and descending) and then rerun your cluster analyses repeatedly. Save the cluster memberships and then run cross tabs on the different memberships. If your clusters are stable, no matter how you sort the dataset, you should see similar membership patterns across the different sorted solutions. If they are not, you have a clue that the algorithm is not picking up real and reproducible solutions!!!!

(In clustan-graphics, I can seed 5000 solutions for k-means and it generates a reproducibility index based on euclidean sum of squares) that tells me that for different random starting points, my solution is reproduced 75% of the time.

3. What else might be causing the different results (some real and actual patterns in the data)
Are the two different algorithms tapping different patterns across the variables. Profile the clusters on the 10 variables (look at mean and standard deviations) by running a series of anovas using cluster membership and all the ten variables for the k-means and hierarchical solutions (make sure you have the same cluster numbers). This will give you a picture of what variables your clusters differ on (do not look for significance, look for general patterns here). It may be that the two different algorithms are linking your cases differently, and your profiles will give you some idea about whether this is occurring or not. Look at mean differences and standard deviation sizes. A general rule of thumb is that variables that have smaller sd's and large differences between means are better discriminators between clusters (this depends on the algorithm).

I have identified some simple practical things you can do, hope this helps.
Paul

> Alina Sheyman <[hidden email]> wrote:
>
> Hi all,
>
> I'm trying to figure out what clustering mechanism I should be using for
> my
> analysis. For now I've tried both K-means and hierarchichal clustering
> on
> the same data and have ended up with entirely different clusters. In the
> case of K-means I got three clusters that are very close in size,
> whereas
> with hierarchical clustering almost all the cases ended up in one
> cluster.
> Is this even possible? (I'm not entirely clear on how Ward's algorithm
> works). Which one should I be using? My database size is about 64 cases,
> and 10 variables were used in clustering.
>
> any advice would be great
>
> Alina Sheyman,
> Family Office Exchange

Pirritano, Matthew

Independent t-test or paired-sample

In reply to this post by Richard Ristow

I have a very simple question. If you have repeated-measures data but you cannot match up data points across testing occasions what is the problem with just doing an independent samples t-test? What do you lose? Is it just completely unsound practice or do you just lose power?

Thanks,
Matt

Matthew Pirritano, Ph.D.
Assistant Professor of Psychology
Smith Hall 116C
Chapman University
Department of Psychology
One University Drive
Orange, CA 92866
Telephone (714)744-7940
FAX (714)997-6780

statisticsdoc

Re: Independent t-test or paired-sample

Matt,

When you conduct an independent samples t-test, you lose power. To the
extent that the pre-test and post-test scores are correlated, the
paired-sample test is more powerful.

HTH,

Stephen Brand

For personalized and professional consultation in statistics and research
design, visit
www.statisticsdoc.com

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of
Pirritano, Matthew
Sent: Monday, December 04, 2006 8:15 PM
To: [hidden email]
Subject: Independent t-test or paired-sample

I have a very simple question. If you have repeated-measures data but you
cannot match up data points across testing occasions what is the problem
with just doing an independent samples t-test? What do you lose? Is it just
completely unsound practice or do you just lose power?

Thanks,
Matt

Matthew Pirritano, Ph.D.
Assistant Professor of Psychology
Smith Hall 116C
Chapman University
Department of Psychology
One University Drive
Orange, CA 92866
Telephone (714)744-7940
FAX (714)997-6780

David Hitchin

Re: K-means vs. hierarchical clustering

In reply to this post by Alina Sheyman

--On 04 December 2006 15:34 -0500 Alina Sheyman <[hidden email]>
wrote:

> Hi all,
>
> I'm trying to figure out what clustering mechanism I should be using for
> my analysis. For now I've tried both K-means and hierarchichal clustering
> on the same data and have ended up with entirely different clusters.

As usual in statistics, it is a good idea to think about the data and how
it arose - what mechanism in nature generated it. You can approach this by
theory, or by examining the data. Ignoring clustering, for the moment,
suppose that you are just given two sets of numbers and you are asked if
their locations differ. Before using a t-test, you look at the data and
consider whether it is normally distributed. If all of the numbers are
whole numbers, and after examining the distribution, you decide that the
data probably arose from a Poisson process rather than from a normal
distribution, and you select your analysis accordingly. On the other hand,
if you know that the numbers came from counting cars driven past a fixed
point, this would give you some clues about appropriate analysis.

Now for clustering. Different clustering methods try to find different
kinds of clusters from data-sets. What kind of clusters should you be
looking for?

Hierarchal clustering is the sort that you might apply when there is a
"tree" structure to the data. Think of the classification of living things.
At the top, all of them, then splitting into plants, animals and other
things such as funghi. Once you are on the animal branch, this splits into
mammals, reptiles, etc, and you can keep going until you get down to
individual species. AT NO TIME, when things have been split off from the
rest of the data onto one of the branches, do subsets ever move to other
branches. You might think about whether this is appropriate for your data.
Once you have split your data up into two sets this split is final, and the
process only subdivides further - nothing from set one ever moves back into
set two.

K-means clustering does not assume a tree structure. In its pure form you
might ask the computer - split these data values into three groups or four
groups, but you can't guarantee that merging two groups from the four-group
solution will produce the same as the three-group solution.

If you have only two or three dimensions (or can sensibly reduce your data
by factor analysis) you can plot it and see what sort of relationships you
have. Are you looking for nice spherical clusters, or are long chains more
suitable?

You might consider that your data values were generated from multivariate
normal random variables from groups with different means, and you might
consider how best to identify these groups and their means.

Sometimes data values fall into such clear groups that almost all
clustering methods will find the same clusters. Where the boundaries are
fuzzy, the solutions may be very different.

I'll end with a little parable. Suppose I have a very willing idiot working
for me, and I ask him to arrange my books nicely. He might do this by
author or by subject, or by the colour of the cover, or the size of the
book, or by weight, or by date of publication. If I simply ask for a "nice
arrangement" I ought not to complain about any of these, and I might find
one or more useful. If you just ask SPSS to use cluster analysis to produce
a "nice arrangement" then, according to the method chosen, the order of the
data and a possible random element, you might get one of many rather
different nice arrangments, and the "best" of these depends on what you
want the clustering for.

David Hitchin

statisticsdoc

Re: K-means vs. hierarchical clustering

In reply to this post by Richard Ristow

Stephen Brand
www.statisticsdoc.com

Richard,

In my experience, discriminant functions usually provide a useful framework
for differentiating clusters. DFA provides a sense of the dimensions that
differentiate clusters, and where the clusters are located within this
framework, and how the variables are associated with the dimensions. (It is
also possible to look at the unique contributions of the variables to the
dimensions, but in many instances the variables you cluster on will have a
fairly high degree of collinearity, so be careful). For example, when
looking at health and adjustment data, it can be useful to speak of a
function that relates to substance use items, another that relates to
emotional adjustment, a third that relates to academic adjustment, and
consider where the centroids of each cluster fall on each function (with and
without rotation).

You raise an interesting point - a multinomial logistic regression would be
another potentially useful way to relate the clustering variables to cluster
membership. This method could also be used to relate exogenous variables to
cluster membership, particularly when the set of predictor variables have
relatively low collinearity. The regression approach will give you more
information about the variables that make a unique contribution to the
likelihood of membership in one cluster versus alternatives. I see
discriminant function analysis as being perhaps more helpful for developing
a dimensional framework.

Best,

Stephen Brand

For personalized and professional consultation in statistics and research
design, visit
www.statisticsdoc.com

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of
Richard Ristow
Sent: Monday, December 04, 2006 6:22 PM
To: [hidden email]
Subject: Re: K-means vs. hierarchical clustering

At 04:32 PM 12/4/2006, Statisticsdoc wrote:

>It is usually very good to examine the results of both K-means and a
>hierarchical clustering algorithm (and Ward's is a good one)
>[...]
>To evaluate cluster solutions, you may find it helpful to conduct
>discriminant function analysis to differentiate the clusters according
>to the clustering variables.

Well, when displaying ignorance, might as well do it in public.

Stephen, why discriminant function? Without knowing either deeply, I've
come to think of logistic regression as usually superior, for modelling
of group differentiation. Among other things, do I recall that
discriminant is a little unforgiving of non-normality?

Art Kendall-2

Re: K-means vs. hierarchical clustering

Previous responders have made several very good points.

In addition:

A major reason to do clustering is data reduction. Clustering creates a
new nominal level variable that can be used in any further analysis.

Iterative use of discriminant provides other outputs that are useful in
refining and interpreting cluster solution suggested by cluster
procedures. The classification phase has a table of cluster
memberships by memberships that would be assigned by the dfa.
DFA can also save the probability of membership for each case in each of
the clusters.

DFA can also save the probability that a case would be so far away from
the centroid of the cluster it is assigned to.

Iteratively treating cases with ambiguous cluster assignment or with
extreme distance from their centroids as "ungrouped" in the
classification phase can be very useful in reaching a working solution.

Logistic regression produces predicted scores that can only take on the
values of the raw variable. This can be very useful. DFA creates
continuous variables that cases are arrayed along. It is analogous to
making cuts on the dimensions. The farther a case is from the cutpoint
the more strongly it is a member of the group to which it is assigned.

Art Kendall
Social Research Consultants

Statisticsdoc wrote:

>Stephen Brand
>www.statisticsdoc.com
>
>Richard,
>
>In my experience, discriminant functions usually provide a useful framework
>for differentiating clusters. DFA provides a sense of the dimensions that
>differentiate clusters, and where the clusters are located within this
>framework, and how the variables are associated with the dimensions. (It is
>also possible to look at the unique contributions of the variables to the
>dimensions, but in many instances the variables you cluster on will have a
>fairly high degree of collinearity, so be careful). For example, when
>looking at health and adjustment data, it can be useful to speak of a
>function that relates to substance use items, another that relates to
>emotional adjustment, a third that relates to academic adjustment, and
>consider where the centroids of each cluster fall on each function (with and
>without rotation).
>
>You raise an interesting point - a multinomial logistic regression would be
>another potentially useful way to relate the clustering variables to cluster
>membership. This method could also be used to relate exogenous variables to
>cluster membership, particularly when the set of predictor variables have
>relatively low collinearity. The regression approach will give you more
>information about the variables that make a unique contribution to the
>likelihood of membership in one cluster versus alternatives. I see
>discriminant function analysis as being perhaps more helpful for developing
>a dimensional framework.
>
>Best,
>
>Stephen Brand
>
>
>For personalized and professional consultation in statistics and research
>design, visit
>www.statisticsdoc.com
>
>
>-----Original Message-----
>From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of
>Richard Ristow
>Sent: Monday, December 04, 2006 6:22 PM
>To: [hidden email]
>Subject: Re: K-means vs. hierarchical clustering
>
>
>At 04:32 PM 12/4/2006, Statisticsdoc wrote:
>
>
>
>>It is usually very good to examine the results of both K-means and a
>>hierarchical clustering algorithm (and Ward's is a good one)
>>[...]
>>To evaluate cluster solutions, you may find it helpful to conduct
>>discriminant function analysis to differentiate the clusters according
>>to the clustering variables.
>>
>>
>
>Well, when displaying ignorance, might as well do it in public.
>
>Stephen, why discriminant function? Without knowing either deeply, I've
>come to think of logistic regression as usually superior, for modelling
>of group differentiation. Among other things, do I recall that
>discriminant is a little unforgiving of non-normality?
>
>
>
>

Alina Sheyman

Re: K-means vs. hierarchical clustering

In reply to this post by Alina Sheyman

Thanks to all who've responded to my post. You've been incredibly helpful.