: Cluster Analysis - Seeds needed for K-Means

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

: Cluster Analysis - Seeds needed for K-Means

Mike P-5
Hi Aaron,

In answer to your first question,

Yes the matrix does represent the mean within cluster values of each
cluster variates.

The idea of running on a small sample is because of time and memory
constraints. Anything over 1000 cases will take a long time to run using
Hierarchical CA, before the two-step cluster analysis the techniques
that your trying to use was to allow 'potential' better solutions to the
clustering problems.

Remember as Hector said yesterday

"K-means uses only Euclidean distances, whereas Hierarchical Clustering
uses a full array of distance measures. A solution that seems adequate
with some fancy distance function may lead to nonsense, or at least to
some surprising results, when applied to K-means with Euclidean
distances."

So if your dataset has say 100,000 cases run the analysis on 1000 to
form the cluster centres and then use these as the starting point for
the quicker method of k-means.

For your second point , there is no need to run the analysis twice.

Remember
"These steps allow you to do your clustering analysis, although any
results that you generate are for your interpretation, use them at your
own risk!"

HTH

Mike

-----Original Message-----
From: Aaron Eakman [mailto:[hidden email]]
Sent: 09 August 2006 08:37
To: [hidden email]; Michael Pearmain
Cc: Aaron Eakman
Subject: Re: Cluster Analysis - Seeds needed for K-Means

Mike,

Would my cluster .sav file have as row labels: 1, 2, 3, given that I had
identified a three cluster solution in the hierarchical approach; and
would the column labels be: cluster_, var1, var2, ... varX. (with "varX"
representing my cluster variates)?

If so, might the values in this matrix that I would submit to a K-means
approach be the mean (average) within cluster values of the the varX
cluster variates derived from my hierchical approach?  As an FYI, my
cluster variates are all of the same ratio scale.

Finally, (1) why would I run the hierarchical approach on a small sample
of my total sample rather than on the total sample?; and (2) why would I
need to run the K-means twice rather than just once?

Thanks much for you reply,

Aaron



On Tue, 8 Aug 2006 09:21:17 +0100, Michael Pearmain
<[hidden email]> wrote:

>Morning Aaron,
>
>Try the following steps
>
>*       Steps:
>1.      Run a Hierarchical Cluster analysis on a small sample
>2.      Choose a solution
>3.      Aggregate the variables used in the Cluster Analysis according
>to the cluster variable
>
>**Change the name of variables in the aggregate file to be the same as
>originally
>
>4.      Name the first variable 'cluster_' in the aggregated file
>5.      The aggregated file will be used as centre in the K-Means
>procedure
>6.      Use the aggregated file as centres when running a K-means on
the

>whole data set
>*       Clustering new cases using a previous cluster analysis
>o       Save the final centre points.
>o       Use them a centres for the new file
>o       Choose as method: classify only
>HTH
>
>Mike
>
>
>-----Original Message-----
>From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf
>Of Aaron Eakman
>Sent: 07 August 2006 18:50
>To: [hidden email]
>Subject: Cluster Analysis - Seeds needed for K-Means
>
>I am using SPSS 12 for my clustering procedures.  I started with
>heirarchical clustering using Wards method with squared euclidean
>distance.  I have identified a three cluster solution as the best
>option from a possible range of 2-4 that I established a priori.
>
>Here is my problem, I want to next run a K-means clustering procedure.
>More specifically, I want to use the centroids of the three clusters
>from my heirarchical procedure as "seed" or starting values for the
>K-means clustering procedure.  Unfortunately, SPSS does not generate
>this output from the heirarchical procedure.  And I do not know 1) how
>to generate cluster centroids from the cluster assignment information
>provided by SPSS heirarchical procedure, and 2) even if I did, I do not

>know how to generate an SPSS.sav file with that information for use by
>the K-means approach.  A further problem, I am a point and clicker and
>not savvy with command syntax; I AM WILLING TO LEARN IF IT CAN GET ME
>OUT OF MY MESS!!
>
>Any persons that are SPSS  - Cluster Analysis savvy, or know others
>that might lend a hand would be met with gratitude for any assistance.
>
>Take care,
>
>Aaron Eakman
>
>_______________________________________________________________________
>_ This e-mail has been scanned for all viruses by Star. The service is
>powered by MessageLabs. For more information on a proactive anti-virus
>service working around the clock, around the globe, visit:
>http://www.star.net.uk
>_______________________________________________________________________
>_
>
>______________________________________________________________________
>This email has been scanned by the MessageLabs Email Security System.
>For more information please visit http://www.messagelabs.com/email
>______________________________________________________________________

________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. For more information on a proactive anti-virus
service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________