Re: Cluster Analysis - Seeds needed for K-Means

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Cluster Analysis - Seeds needed for K-Means

sabine.kleinsasser
Hello Aaron and Hector

I'm just in the same situation as you were some months ago. I have done my
Wards analisis and now I want to use these results for a k-means analisis. I
already created a new file with cluster_ and the means (from ward) of the
different variables (a1-a22). But when I run k-means I always get a warning
and the SPSS can't run the analisis. So I really want to ask you: how did
you solve your problem? to make my problem clearer:
I have 22 variables and 4 clusters generated from ward. Ward created the
different clusters and identified (saved in SPSS working file) wich case
belongs to wich cluster. then I was calculating the means for each variable
and each cluster. I used these numbers as cluster center to run k-means. But
I can't run it, I always get a warning/error!
Thanks so much!
Greetings, Sabine


On Wed, 9 Aug 2006 11:01:56 -0300, Hector Maletta <[hidden email]>
wrote:

>Aaron,
>I did not imply that specifically in your case the results would be
>misleading, but rather to state a general principle. In your case you are
>using Euclidean distance (or squared Euclidean distance) in both procedures,
>and this (IMHO) creates no problem. If an object C is at distances 4 and 5
>from two other objects A and B, and is therefore closer to B, it will still
>be closer to B if any monotonic transformation of the distance is used, such
>as the squared distance (16 and 25 in this example). Therefore if C is
>assigned to cluster A using squared Euclidean distance, it would also be
>assigned to cluster A by simple Euclidean distance, especially if the METHOD
>in CLUSTER is chosen in a sensible way.
>In your case you used WARD's method, which is suitable for this situation.
>However, the METHOD most similar to the one used in k-means is the CENTROID
>method. I do not think this would have any implication in your case, but
>remember that k-means assigns a case to one cluster or another depending on
>the distance of the case to their respective centroids.
>On the other hand CLUSTER admits a large variety of distance or similarity
>specifications, and various methods, some of which may lead to different
>results than Euclidean distance and Ward method, and therefore results found
>with some of these (I called them "fancy") distance measures (and methods)
>may lead to odd results when combined with k-means.
>So in your case I guess you may forget about my comment, which was only
>intended as general advice.
>
>Hector
>
>
>
>-----Mensaje original-----
>De: Aaron Eakman [mailto:[hidden email]]
>Enviado el: Wednesday, August 09, 2006 4:02 AM
>Para: [hidden email]; Hector Maletta
>CC: Aaron Eakman
>Asunto: Re: Cluster Analysis - Seeds needed for K-Means
>
>I would appreciate a further discussion of your statement:
>
>"A solution that seems adequate with some fancy distance function may lead
>to nonsense, or at least to some surprising results, when applied to K-
>means with Euclidean distances."
>
>First(1), are you suggesting that the squared Euclidean distance used in
>the hierarchical clustering (Ward's method) I reported, and the Euclidean
>distance which I intend to employ in K-means will have substantial
>differences in cluster resolution?  I do understand, in general, how
>hierarchical differs from K-means clustering.
>
>If the answer would be "yes" to (1)... I would ask...(2) If I were to run
>my clustering, comparing Euclidean to squared Euclidean in hierarchical
>clustering (Ward's method) in SPSS should I expect substantial differences
>in cluster solutions when reviewing the dendograms?
>
>If the answer to (1) were "no", could you please let me know what you were
>referring to...
>
>If the answer to (2) were "yes", would you recommend that I employ
>Euclidian (rather than squared Euclidean) for the hierarchical analyses in
>my intended progression of : hierarchical Ward's (HW) clustering -to- use
>of HW cluster centers as seeds for K-means clustering?
>
>And if yes, a very brief explanation as to why you believe this...
>
>Thank you much in advance for you replay
>
>
>On Tue, 8 Aug 2006 11:08:05 -0300, Hector Maletta
><[hidden email]> wrote:
>
>>One comment: K-means uses only Euclidean distances, whereas Hierarchical
>>Clustering uses a full array of distance measures. A solution that seems
>>adequate with some fancy distance function may lead to nonsense, or at
>least
>>to some surprising results, when applied to K-means with Euclidean
>>distances.
>>Hector
>>
>>-----Mensaje original-----
>>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
>>Michael Pearmain
>>Enviado el: Tuesday, August 08, 2006 5:21 AM
>>Para: [hidden email]
>>Asunto: Re: Cluster Analysis - Seeds needed for K-Means
>>
>>Morning Aaron,
>>
>>Try the following steps
>>
>>*       Steps:
>>1.      Run a Hierarchical Cluster analysis on a small sample
>>2.      Choose a solution
>>3.      Aggregate the variables used in the Cluster Analysis according
>>to the cluster variable
>>
>>**Change the name of variables in the aggregate file to be the same as
>>originally
>>
>>4.      Name the first variable 'cluster_' in the aggregated file
>>5.      The aggregated file will be used as centre in the K-Means
>>procedure
>>6.      Use the aggregated file as centres when running a K-means on the
>>whole data set
>>*       Clustering new cases using a previous cluster analysis
>>o       Save the final centre points.
>>o       Use them a centres for the new file
>>o       Choose as method: classify only
>>HTH
>>
>>Mike
>>
>>
>>-----Original Message-----
>>From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
>>Aaron Eakman
>>Sent: 07 August 2006 18:50
>>To: [hidden email]
>>Subject: Cluster Analysis - Seeds needed for K-Means
>>
>>I am using SPSS 12 for my clustering procedures.  I started with
>>heirarchical clustering using Wards method with squared euclidean
>>distance.  I have identified a three cluster solution as the best option
>>from a possible range of 2-4 that I established a priori.
>>
>>Here is my problem, I want to next run a K-means clustering procedure.
>>More specifically, I want to use the centroids of the three clusters
>>from my heirarchical procedure as "seed" or starting values for the
>>K-means clustering procedure.  Unfortunately, SPSS does not generate
>>this output from the heirarchical procedure.  And I do not know 1) how
>>to generate cluster centroids from the cluster assignment information
>>provided by SPSS heirarchical procedure, and 2) even if I did, I do not
>>know how to generate an SPSS.sav file with that information for use by
>>the K-means approach.  A further problem, I am a point and clicker and
>>not savvy with command syntax; I AM WILLING TO LEARN IF IT CAN GET ME
>>OUT OF MY MESS!!
>>
>>Any persons that are SPSS  - Cluster Analysis savvy, or know others that
>>might lend a hand would be met with gratitude for any assistance.
>>
>>Take care,
>>
>>Aaron Eakman
>>
>>________________________________________________________________________
>>This e-mail has been scanned for all viruses by Star. The service is
>>powered by MessageLabs. For more information on a proactive anti-virus
>>service working around the clock, around the globe, visit:
>>http://www.star.net.uk
>>________________________________________________________________________
>>
>>______________________________________________________________________
>>This email has been scanned by the MessageLabs Email Security System.
>>For more information please visit http://www.messagelabs.com/email
>>______________________________________________________________________
Reply | Threaded
Open this post in threaded view
|

Re: Cluster Analysis - Seeds needed for K-Means

Hector Maletta
I cannot fathom what is going on in your analysis. However, one possible reason may be related to the use of z-scores. Perhaps you are giving SPSS the variables in standardized form (z scores, with mean zero and unit standard deviation, as usual for K-means) and providing the means for the raw values, which would be out of range for the standardized variables.

When using k-means, the use of z scores is most convenient because otherwise the variables with the larger absolute values will have greater influence in the clustering. For example, if a variable originally expressed in kilometers is converted into meters, it will have values one thousand times larger, and correspondingly greater influence in the results. The choice of units of measurements, thus, would affect the results. To avoid this, it is ordinarily recommended that variables are standardized first. But in that case, the initial means for the initial cluster centroids should be expressed in terms of the standardized variables, not the original units of measurement. I suggest you check whether you are being consistent in this matter before proceeding further in your search for explanations. SPSS would not be able to proceed if the initial cluster centers lie beside the range of variation of your variables.

Hector

----- Mensaje original -----
De: [hidden email]
Fecha: Martes, Abril 24, 2007 1:32 pm
Asunto: Re: Cluster Analysis - Seeds needed for K-Means

> Hello Aaron and Hector
>
> I'm just in the same situation as you were some months ago. I have
> done my
> Wards analisis and now I want to use these results for a k-means
> analisis. I
> already created a new file with cluster_ and the means (from ward)
> of the
> different variables (a1-a22). But when I run k-means I always get
> a warning
> and the SPSS can't run the analisis. So I really want to ask you:
> how did
> you solve your problem? to make my problem clearer:
> I have 22 variables and 4 clusters generated from ward. Ward
> created the
> different clusters and identified (saved in SPSS working file)
> wich case
> belongs to wich cluster. then I was calculating the means for each
> variableand each cluster. I used these numbers as cluster center
> to run k-means. But
> I can't run it, I always get a warning/error!
> Thanks so much!
> Greetings, Sabine
>
>
> On Wed, 9 Aug 2006 11:01:56 -0300, Hector Maletta
> <[hidden email]>wrote:
>
> >Aaron,
> >I did not imply that specifically in your case the results would be
> >misleading, but rather to state a general principle. In your case
> you are
> >using Euclidean distance (or squared Euclidean distance) in both
> procedures,>and this (IMHO) creates no problem. If an object C is
> at distances 4 and 5
> >from two other objects A and B, and is therefore closer to B, it
> will still
> >be closer to B if any monotonic transformation of the distance is
> used, such
> >as the squared distance (16 and 25 in this example). Therefore if
> C is
> >assigned to cluster A using squared Euclidean distance, it would
> also be
> >assigned to cluster A by simple Euclidean distance, especially if
> the METHOD
> >in CLUSTER is chosen in a sensible way.
> >In your case you used WARD's method, which is suitable for this
> situation.>However, the METHOD most similar to the one used in k-
> means is the CENTROID
> >method. I do not think this would have any implication in your
> case, but
> >remember that k-means assigns a case to one cluster or another
> depending on
> >the distance of the case to their respective centroids.
> >On the other hand CLUSTER admits a large variety of distance or
> similarity>specifications, and various methods, some of which may
> lead to different
> >results than Euclidean distance and Ward method, and therefore
> results found
> >with some of these (I called them "fancy") distance measures (and
> methods)>may lead to odd results when combined with k-means.
> >So in your case I guess you may forget about my comment, which
> was only
> >intended as general advice.
> >
> >Hector
> >
> >
> >
> >-----Mensaje original-----
> >De: Aaron Eakman [[hidden email]]
> >Enviado el: Wednesday, August 09, 2006 4:02 AM
> >Para: [hidden email]; Hector Maletta
> >CC: Aaron Eakman
> >Asunto: Re: Cluster Analysis - Seeds needed for K-Means
> >
> >I would appreciate a further discussion of your statement:
> >
> >"A solution that seems adequate with some fancy distance function
> may lead
> >to nonsense, or at least to some surprising results, when applied
> to K-
> >means with Euclidean distances."
> >
> >First(1), are you suggesting that the squared Euclidean distance
> used in
> >the hierarchical clustering (Ward's method) I reported, and the
> Euclidean>distance which I intend to employ in K-means will have
> substantial>differences in cluster resolution?  I do understand,
> in general, how
> >hierarchical differs from K-means clustering.
> >
> >If the answer would be "yes" to (1)... I would ask...(2) If I
> were to run
> >my clustering, comparing Euclidean to squared Euclidean in
> hierarchical>clustering (Ward's method) in SPSS should I expect
> substantial differences
> >in cluster solutions when reviewing the dendograms?
> >
> >If the answer to (1) were "no", could you please let me know what
> you were
> >referring to...
> >
> >If the answer to (2) were "yes", would you recommend that I employ
> >Euclidian (rather than squared Euclidean) for the hierarchical
> analyses in
> >my intended progression of : hierarchical Ward's (HW) clustering -
> to- use
> >of HW cluster centers as seeds for K-means clustering?
> >
> >And if yes, a very brief explanation as to why you believe this...
> >
> >Thank you much in advance for you replay
> >
> >
> >On Tue, 8 Aug 2006 11:08:05 -0300, Hector Maletta
> ><[hidden email]> wrote:
> >
> >>One comment: K-means uses only Euclidean distances, whereas
> Hierarchical>>Clustering uses a full array of distance measures. A
> solution that seems
> >>adequate with some fancy distance function may lead to nonsense,
> or at
> >least
> >>to some surprising results, when applied to K-means with Euclidean
> >>distances.
> >>Hector
> >>
> >>-----Mensaje original-----
> >>De: SPSSX(r) Discussion [[hidden email]] En nombre de
> >>Michael Pearmain
> >>Enviado el: Tuesday, August 08, 2006 5:21 AM
> >>Para: [hidden email]
> >>Asunto: Re: Cluster Analysis - Seeds needed for K-Means
> >>
> >>Morning Aaron,
> >>
> >>Try the following steps
> >>
> >>*       Steps:
> >>1.      Run a Hierarchical Cluster analysis on a small sample
> >>2.      Choose a solution
> >>3.      Aggregate the variables used in the Cluster Analysis
> according>>to the cluster variable
> >>
> >>**Change the name of variables in the aggregate file to be the
> same as
> >>originally
> >>
> >>4.      Name the first variable 'cluster_' in the aggregated file
> >>5.      The aggregated file will be used as centre in the K-Means
> >>procedure
> >>6.      Use the aggregated file as centres when running a K-
> means on the
> >>whole data set
> >>*       Clustering new cases using a previous cluster analysis
> >>o       Save the final centre points.
> >>o       Use them a centres for the new file
> >>o       Choose as method: classify only
> >>HTH
> >>
> >>Mike
> >>
> >>
> >>-----Original Message-----
> >>From: SPSSX(r) Discussion [[hidden email]] On Behalf Of
> >>Aaron Eakman
> >>Sent: 07 August 2006 18:50
> >>To: [hidden email]
> >>Subject: Cluster Analysis - Seeds needed for K-Means
> >>
> >>I am using SPSS 12 for my clustering procedures.  I started with
> >>heirarchical clustering using Wards method with squared euclidean
> >>distance.  I have identified a three cluster solution as the
> best option
> >>from a possible range of 2-4 that I established a priori.
> >>
> >>Here is my problem, I want to next run a K-means clustering
> procedure.>>More specifically, I want to use the centroids of the
> three clusters
> >>from my heirarchical procedure as "seed" or starting values for the
> >>K-means clustering procedure.  Unfortunately, SPSS does not generate
> >>this output from the heirarchical procedure.  And I do not know
> 1) how
> >>to generate cluster centroids from the cluster assignment
> information>>provided by SPSS heirarchical procedure, and 2) even
> if I did, I do not
> >>know how to generate an SPSS.sav file with that information for
> use by
> >>the K-means approach.  A further problem, I am a point and
> clicker and
> >>not savvy with command syntax; I AM WILLING TO LEARN IF IT CAN
> GET ME
> >>OUT OF MY MESS!!
> >>
> >>Any persons that are SPSS  - Cluster Analysis savvy, or know
> others that
> >>might lend a hand would be met with gratitude for any assistance.
> >>
> >>Take care,
> >>
> >>Aaron Eakman
> >>
> >>________________________________________________________________________
> >>This e-mail has been scanned for all viruses by Star. The
> service is
> >>powered by MessageLabs. For more information on a proactive anti-
> virus>>service working around the clock, around the globe, visit:
> >>http://www.star.net.uk
> >>________________________________________________________________________
> >>
> >>______________________________________________________________________
> >>This email has been scanned by the MessageLabs Email Security
> System.>>For more information please visit
> http://www.messagelabs.com/email>>______________________________________________________________________
>