SPSSX Discussion

K-means Cluster Analysis in SPSS: Why it produces different solutions?

Classic

List

Threaded

4 messages Options

Juanito Talili

K-means Cluster Analysis in SPSS: Why it produces different solutions?

Hello Everyone,
 
Have you encountered this problem in the K-means cluster analysis of SPSS? The scenario goes like this.  We ran the k-means cluster analysis for the seven variables(x1, x2, ...,x7) with n=2540.  After which the k-means cluster analysis was ran again for the same data(seven variables and n=2540). But prior to running the second time, the cases were sorted by one of the variables (say, x1).    This means that the k-means cluster analysis was ran two times for the same dataset.  The only difference is that the cases were sorted before it was ran for the second time.  What puzzled us is that SPSS provides two different solutions! For example, the first ran has two cluster solutions while the second provides three solutions. Because of our curiosity, we ran again the procedure several times and found that different solutions come out
for every run.  Moreover, the same trend of results were happened when we tried the other two procedures such as the twosteps and hierarchical using the same dataset. Our question is "why is this so?" We believed that only the order of the cases has been changed for every run because of the sorting, but the sorting of the cases does not distort the dataset, right?
 
Thank you.
J.Talili

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reutter, Alex

Re: K-means Cluster Analysis in SPSS: Why it produces different solutions?

From the help (K-Means Cluster Analysis Data Considerations):
"
Case and initial cluster center order. The default algorithm for choosing initial cluster centers is not invariant to case ordering. The Use running means option in the Iterate dialog box makes the resulting solution potentially dependent on case order, regardless of how initial cluster centers are chosen. If you are using either of these methods, you may want to obtain several different solutions with cases sorted in different random orders to verify the stability of a given solution. Specifying initial cluster centers and not using the Use running means option will avoid issues related to case order. However, ordering of the initial cluster centers may affect the solution if there are tied distances from cases to cluster centers. To assess the stability of a given solution, you can compare results from analyses with different permutations of the initial center values.
"

The K-Means (QUICK CLUSTER) algorithms, also in the help (From the menus, Help > Topics; then Algorithms > QUICK CLUSTER Algorithms in the Contents pane) have further details.

Cheers,
Alex

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Juanito Talili
Sent: Thursday, May 29, 2008 1:52 AM
To: [hidden email]
Subject: K-means Cluster Analysis in SPSS: Why it produces different solutions?

Hello Everyone,
 
Have you encountered this problem in the K-means cluster analysis of SPSS? The scenario goes like this.  We ran the k-means cluster analysis for the seven variables(x1, x2, ...,x7) with n=2540.  After which the k-means cluster analysis was ran again for the same data(seven variables and n=2540). But prior to running the second time, the cases were sorted by one of the variables (say, x1).    This means that the k-means cluster analysis was ran two times for the same dataset.  The only difference is that the cases were sorted before it was ran for the second time.  What puzzled us is that SPSS provides two different solutions! For example, the first ran has two cluster solutions while the second provides three solutions. Because of our curiosity, we ran again the procedure several times and found that different solutions come out
for every run.  Moreover, the same trend of results were happened when we tried the other two procedures such as the twosteps and hierarchical using the same dataset. Our question is "why is this so?" We believed that only the order of the cases has been changed for every run because of the sorting, but the sorting of the cases does not distort the dataset, right?
 
Thank you.
J.Talili

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Wim Beyers

Re: K-means Cluster Analysis in SPSS: Why it produces different solutions?

Hi clustering fans,

Kmeans is a great top-down iterative clustering procedure (particularly
for large samples). It however has one big problem: good starting values. To
avoid the problem of bad starting values (which lead to instable solutions),
you can do several things:
(1) run like 15 kmeans with different orderings of your data, and see which
solution is most stable (this is what was suggested by by SPSS-Alex below)
(2) better: get good starting values, e.g., the final centers from a
hierarchical (bottom up) clustering procedure (e.g., obtained by Ward's
method if you're using social science data). References for the latter
excellent approach to clustering (which is not the same as the two-step
clustering in SPSS!!!, the latter has the same problem like kmeans) can be
found in:

Gore, P. A. Jr. (2000). Cluster analysis. In H. E. A. Tinsley & S. D. Brown
(Eds.), Handbook of applied multivariate statistics and mathematical
modeling (pp. 297-321). San Diego, CA: Academic Press.

Hair, J. R., Anderson, R. E., Tatham, R. L., & Black, W. C. (1998).
Multivariate data analysis. Upper Saddle River, NJ: Prentic Hall.

Tan, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining.
Boston, MA: Addison-Wesley. (Chapter 8 in this book: Cluster Analysis: Basic
Concepts and Algorithms, freely available on the web at
http://www-users.cs.umn.edu/~kumar/dmbook/index.php).

If you're not convinced, this approach (combination of hierarchical and
kmeans clustering) is what used by default in the exact sciences in which
clustering is often used (e.g., biology). They have their own software for
it. For instance, check out, http://biodiver.bio.ub.es/ginkgo/Ginkgo.htm,
with many thanks to the people from the Unitat de Botànica at the University
of Barcelona, Spain, for the many years of use of their great software!

Best,

---

Dr. Wim Beyers

Dept. of Developmental, Personality and Social Psychology

Ghent University

Henri Dunantlaan 2

9000 Gent

Belgium

www.vopspsy.ugent.be

----- Original Message -----
From: "Reutter, Alex" <[hidden email]>
To: <[hidden email]>
Sent: Thursday, May 29, 2008 1:23 PM
Subject: Re: K-means Cluster Analysis in SPSS: Why it produces different
solutions?

> From the help (K-Means Cluster Analysis Data Considerations):
> "
> Case and initial cluster center order. The default algorithm for choosing
> initial cluster centers is not invariant to case ordering. The Use running
> means option in the Iterate dialog box makes the resulting solution
> potentially dependent on case order, regardless of how initial cluster
> centers are chosen. If you are using either of these methods, you may want
> to obtain several different solutions with cases sorted in different
> random orders to verify the stability of a given solution. Specifying
> initial cluster centers and not using the Use running means option will
> avoid issues related to case order. However, ordering of the initial
> cluster centers may affect the solution if there are tied distances from
> cases to cluster centers. To assess the stability of a given solution, you
> can compare results from analyses with different permutations of the
> initial center values.
> "
>
> The K-Means (QUICK CLUSTER) algorithms, also in the help (From the menus,
> Help > Topics; then Algorithms > QUICK CLUSTER Algorithms in the Contents
> pane) have further details.
>
> Cheers,
> Alex
>
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
> Juanito Talili
> Sent: Thursday, May 29, 2008 1:52 AM
> To: [hidden email]
> Subject: K-means Cluster Analysis in SPSS: Why it produces different
> solutions?
>
> Hello Everyone,
>  
> Have you encountered this problem in the K-means cluster analysis of SPSS?
> The scenario goes like this.  We ran the k-means cluster analysis
> for the seven variables(x1, x2, ...,x7) with
> n=2540.  After which the k-means cluster
> analysis was ran again for the same data(seven variables and
> n=2540). But prior to running the second time, the cases were sorted
> by one of the variables (say, x1).    This means
> that the k-means cluster analysis was ran two times for the same
> dataset.  The only difference is that the cases were sorted
> before it was ran for the second time.  What puzzled us is
> that SPSS provides two different solutions! For
> example, the first ran has two cluster solutions while the
> second provides three solutions. Because of our curiosity,
> we ran again the procedure several times and found that different
> solutions come out
> for every run.  Moreover, the same trend of results were happened
> when we tried the other two procedures such as the twosteps and
> hierarchical using the same dataset. Our question is "why is this
> so?" We believed that only the order of the cases has
> been changed for every run because of the sorting, but the sorting of
> the cases does not distort the dataset, right?
>  
> Thank you.
> J.Talili
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Art Kendall

Re: K-means Cluster Analysis in SPSS: Why it produces different solutions?

Any final clustering solution should use a consensus from several
clusterings: varying distance measures and algorithms for the
hierarchical approaches, varying case order for twostep, k-means, and
k-medians for the single slice solutions.

However you derive centers, the overall process is applying results
from a subset to a larger set.
That is likely to result in some uncertainty in assignment of the cases
that were not in the "training" set. The two steps in TWOSTEP 1) derive
a hierarchical tree, and 2) apply it to the other cases.
some advantages of twostep are:
it provides info on where to slice the tree
it can cluster categorical and/or continuous
it has provisions for adjusting for outliers while growing the tree
it provides info on variable importance.

Art Kendall
Social Research Consultants

Wim Beyers wrote:

> Hi clustering fans,
>
> Kmeans is a great top-down iterative clustering procedure
> (particularly
> for large samples). It however has one big problem: good starting
> values. To
> avoid the problem of bad starting values (which lead to instable
> solutions),
> you can do several things:
> (1) run like 15 kmeans with different orderings of your data, and see
> which
> solution is most stable (this is what was suggested by by SPSS-Alex
> below)
> (2) better: get good starting values, e.g., the final centers from a
> hierarchical (bottom up) clustering procedure (e.g., obtained by Ward's
> method if you're using social science data). References for the latter
> excellent approach to clustering (which is not the same as the two-step
> clustering in SPSS!!!, the latter has the same problem like kmeans)
> can be
> found in:
>
> Gore, P. A. Jr. (2000). Cluster analysis. In H. E. A. Tinsley & S. D.
> Brown
> (Eds.), Handbook of applied multivariate statistics and mathematical
> modeling (pp. 297-321). San Diego, CA: Academic Press.
>
>
>
> Hair, J. R., Anderson, R. E., Tatham, R. L., & Black, W. C. (1998).
> Multivariate data analysis. Upper Saddle River, NJ: Prentic Hall.
>
>
>
> Tan, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to data
> mining.
> Boston, MA: Addison-Wesley. (Chapter 8 in this book: Cluster Analysis:
> Basic
> Concepts and Algorithms, freely available on the web at
> http://www-users.cs.umn.edu/~kumar/dmbook/index.php).
>
>
>
> If you're not convinced, this approach (combination of hierarchical and
> kmeans clustering) is what used by default in the exact sciences in which
> clustering is often used (e.g., biology). They have their own software
> for
> it. For instance, check out, http://biodiver.bio.ub.es/ginkgo/Ginkgo.htm,
> with many thanks to the people from the Unitat de Botànica at the
> University
> of Barcelona, Spain, for the many years of use of their great software!
>
>
>
> Best,
>
> ---
>
> Dr. Wim Beyers
>
> Dept. of Developmental, Personality and Social Psychology
>
> Ghent University
>
> Henri Dunantlaan 2
>
> 9000 Gent
>
> Belgium
>
> www.vopspsy.ugent.be
>
>
>
>
>
>
> ----- Original Message -----
> From: "Reutter, Alex" <[hidden email]>
> To: <[hidden email]>
> Sent: Thursday, May 29, 2008 1:23 PM
> Subject: Re: K-means Cluster Analysis in SPSS: Why it produces different
> solutions?
>
>
>> From the help (K-Means Cluster Analysis Data Considerations):
>> "
>> Case and initial cluster center order. The default algorithm for
>> choosing
>> initial cluster centers is not invariant to case ordering. The Use
>> running
>> means option in the Iterate dialog box makes the resulting solution
>> potentially dependent on case order, regardless of how initial cluster
>> centers are chosen. If you are using either of these methods, you may
>> want
>> to obtain several different solutions with cases sorted in different
>> random orders to verify the stability of a given solution. Specifying
>> initial cluster centers and not using the Use running means option will
>> avoid issues related to case order. However, ordering of the initial
>> cluster centers may affect the solution if there are tied distances from
>> cases to cluster centers. To assess the stability of a given
>> solution, you
>> can compare results from analyses with different permutations of the
>> initial center values.
>> "
>>
>> The K-Means (QUICK CLUSTER) algorithms, also in the help (From the
>> menus,
>> Help > Topics; then Algorithms > QUICK CLUSTER Algorithms in the
>> Contents
>> pane) have further details.
>>
>> Cheers,
>> Alex
>>
>>
>> -----Original Message-----
>> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
>> Juanito Talili
>> Sent: Thursday, May 29, 2008 1:52 AM
>> To: [hidden email]
>> Subject: K-means Cluster Analysis in SPSS: Why it produces different
>> solutions?
>>
>> Hello Everyone,
>>  
>> Have you encountered this problem in the K-means cluster analysis of
>> SPSS?
>> The scenario goes like this.  We ran the k-means cluster analysis
>> for the seven variables(x1, x2, ...,x7) with
>> n=2540.  After which the k-means cluster
>> analysis was ran again for the same data(seven variables and
>> n=2540). But prior to running the second time, the cases were sorted
>> by one of the variables (say, x1).    This
>> means
>> that the k-means cluster analysis was ran two times for the same
>> dataset.  The only difference is that the cases were sorted
>> before it was ran for the second time.  What puzzled us is
>> that SPSS provides two different solutions! For
>> example, the first ran has two cluster solutions while the
>> second provides three solutions. Because of our
>> curiosity,
>> we ran again the procedure several times and found that different
>> solutions come out
>> for every run.  Moreover, the same trend of results were happened
>> when we tried the other two procedures such as the twosteps and
>> hierarchical using the same dataset. Our question is "why is this
>> so?" We believed that only the order of the cases has
>> been changed for every run because of the sorting, but the
>> sorting of
>> the cases does not distort the dataset, right?
>>  
>> Thank you.
>> J.Talili
>>
>> =====================
>> To manage your subscription to SPSSX-L, send a message to
>> [hidden email] (not to SPSSX-L), with no body text except the
>> command. To leave the list, send the command
>> SIGNOFF SPSSX-L
>> For a list of commands to manage subscriptions, send the command
>> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
>

Art Kendall
Social Research Consultants