Cluster Analysis - Seeds needed for K-Means

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Cluster Analysis - Seeds needed for K-Means

Aaron Eakman
I am using SPSS 12 for my clustering procedures.  I started with
heirarchical clustering using Wards method with squared euclidean
distance.  I have identified a three cluster solution as the best option
from a possible range of 2-4 that I established a priori.

Here is my problem, I want to next run a K-means clustering procedure.
More specifically, I want to use the centroids of the three clusters from
my heirarchical procedure as "seed" or starting values for the K-means
clustering procedure.  Unfortunately, SPSS does not generate this output
from the heirarchical procedure.  And I do not know 1) how to generate
cluster centroids from the cluster assignment information provided by SPSS
heirarchical procedure, and 2) even if I did, I do not know how
to generate an SPSS.sav file with that information for use by the K-means
approach.  A further problem, I am a point and clicker and not savvy with
command syntax; I AM WILLING TO LEARN IF IT CAN GET ME OUT OF MY MESS!!

Any persons that are SPSS  - Cluster Analysis savvy, or know others that
might lend a hand would be met with gratitude for any assistance.

Take care,

Aaron Eakman
Reply | Threaded
Open this post in threaded view
|

Re: Cluster Analysis - Seeds needed for K-Means

Mike P-5
Morning Aaron,

Try the following steps

*       Steps:
1.      Run a Hierarchical Cluster analysis on a small sample
2.      Choose a solution
3.      Aggregate the variables used in the Cluster Analysis according
to the cluster variable

**Change the name of variables in the aggregate file to be the same as
originally

4.      Name the first variable 'cluster_' in the aggregated file
5.      The aggregated file will be used as centre in the K-Means
procedure
6.      Use the aggregated file as centres when running a K-means on the
whole data set
*       Clustering new cases using a previous cluster analysis
o       Save the final centre points.
o       Use them a centres for the new file
o       Choose as method: classify only
HTH

Mike


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Aaron Eakman
Sent: 07 August 2006 18:50
To: [hidden email]
Subject: Cluster Analysis - Seeds needed for K-Means

I am using SPSS 12 for my clustering procedures.  I started with
heirarchical clustering using Wards method with squared euclidean
distance.  I have identified a three cluster solution as the best option
from a possible range of 2-4 that I established a priori.

Here is my problem, I want to next run a K-means clustering procedure.
More specifically, I want to use the centroids of the three clusters
from my heirarchical procedure as "seed" or starting values for the
K-means clustering procedure.  Unfortunately, SPSS does not generate
this output from the heirarchical procedure.  And I do not know 1) how
to generate cluster centroids from the cluster assignment information
provided by SPSS heirarchical procedure, and 2) even if I did, I do not
know how to generate an SPSS.sav file with that information for use by
the K-means approach.  A further problem, I am a point and clicker and
not savvy with command syntax; I AM WILLING TO LEARN IF IT CAN GET ME
OUT OF MY MESS!!

Any persons that are SPSS  - Cluster Analysis savvy, or know others that
might lend a hand would be met with gratitude for any assistance.

Take care,

Aaron Eakman

________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. For more information on a proactive anti-virus
service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________
Reply | Threaded
Open this post in threaded view
|

Re: Cluster Analysis - Seeds needed for K-Means

Hector Maletta
One comment: K-means uses only Euclidean distances, whereas Hierarchical
Clustering uses a full array of distance measures. A solution that seems
adequate with some fancy distance function may lead to nonsense, or at least
to some surprising results, when applied to K-means with Euclidean
distances.
Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Michael Pearmain
Enviado el: Tuesday, August 08, 2006 5:21 AM
Para: [hidden email]
Asunto: Re: Cluster Analysis - Seeds needed for K-Means

Morning Aaron,

Try the following steps

*       Steps:
1.      Run a Hierarchical Cluster analysis on a small sample
2.      Choose a solution
3.      Aggregate the variables used in the Cluster Analysis according
to the cluster variable

**Change the name of variables in the aggregate file to be the same as
originally

4.      Name the first variable 'cluster_' in the aggregated file
5.      The aggregated file will be used as centre in the K-Means
procedure
6.      Use the aggregated file as centres when running a K-means on the
whole data set
*       Clustering new cases using a previous cluster analysis
o       Save the final centre points.
o       Use them a centres for the new file
o       Choose as method: classify only
HTH

Mike


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Aaron Eakman
Sent: 07 August 2006 18:50
To: [hidden email]
Subject: Cluster Analysis - Seeds needed for K-Means

I am using SPSS 12 for my clustering procedures.  I started with
heirarchical clustering using Wards method with squared euclidean
distance.  I have identified a three cluster solution as the best option
from a possible range of 2-4 that I established a priori.

Here is my problem, I want to next run a K-means clustering procedure.
More specifically, I want to use the centroids of the three clusters
from my heirarchical procedure as "seed" or starting values for the
K-means clustering procedure.  Unfortunately, SPSS does not generate
this output from the heirarchical procedure.  And I do not know 1) how
to generate cluster centroids from the cluster assignment information
provided by SPSS heirarchical procedure, and 2) even if I did, I do not
know how to generate an SPSS.sav file with that information for use by
the K-means approach.  A further problem, I am a point and clicker and
not savvy with command syntax; I AM WILLING TO LEARN IF IT CAN GET ME
OUT OF MY MESS!!

Any persons that are SPSS  - Cluster Analysis savvy, or know others that
might lend a hand would be met with gratitude for any assistance.

Take care,

Aaron Eakman

________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. For more information on a proactive anti-virus
service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________
Reply | Threaded
Open this post in threaded view
|

Re: Cluster Analysis - Seeds needed for K-Means

Aaron Eakman
In reply to this post by Aaron Eakman
I would appreciate a further discussion of your statement:

"A solution that seems adequate with some fancy distance function may lead
to nonsense, or at least to some surprising results, when applied to K-
means with Euclidean distances."

First(1), are you suggesting that the squared Euclidean distance used in
the hierarchical clustering (Ward's method) I reported, and the Euclidean
distance which I intend to employ in K-means will have substantial
differences in cluster resolution?  I do understand, in general, how
hierarchical differs from K-means clustering.

If the answer would be "yes" to (1)... I would ask...(2) If I were to run
my clustering, comparing Euclidean to squared Euclidean in hierarchical
clustering (Ward's method) in SPSS should I expect substantial differences
in cluster solutions when reviewing the dendograms?

If the answer to (1) were "no", could you please let me know what you were
referring to...

If the answer to (2) were "yes", would you recommend that I employ
Euclidian (rather than squared Euclidean) for the hierarchical analyses in
my intended progression of : hierarchical Ward's (HW) clustering -to- use
of HW cluster centers as seeds for K-means clustering?

And if yes, a very brief explanation as to why you believe this...

Thank you much in advance for you replay


On Tue, 8 Aug 2006 11:08:05 -0300, Hector Maletta
<[hidden email]> wrote:

>One comment: K-means uses only Euclidean distances, whereas Hierarchical
>Clustering uses a full array of distance measures. A solution that seems
>adequate with some fancy distance function may lead to nonsense, or at
least

>to some surprising results, when applied to K-means with Euclidean
>distances.
>Hector
>
>-----Mensaje original-----
>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
>Michael Pearmain
>Enviado el: Tuesday, August 08, 2006 5:21 AM
>Para: [hidden email]
>Asunto: Re: Cluster Analysis - Seeds needed for K-Means
>
>Morning Aaron,
>
>Try the following steps
>
>*       Steps:
>1.      Run a Hierarchical Cluster analysis on a small sample
>2.      Choose a solution
>3.      Aggregate the variables used in the Cluster Analysis according
>to the cluster variable
>
>**Change the name of variables in the aggregate file to be the same as
>originally
>
>4.      Name the first variable 'cluster_' in the aggregated file
>5.      The aggregated file will be used as centre in the K-Means
>procedure
>6.      Use the aggregated file as centres when running a K-means on the
>whole data set
>*       Clustering new cases using a previous cluster analysis
>o       Save the final centre points.
>o       Use them a centres for the new file
>o       Choose as method: classify only
>HTH
>
>Mike
>
>
>-----Original Message-----
>From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
>Aaron Eakman
>Sent: 07 August 2006 18:50
>To: [hidden email]
>Subject: Cluster Analysis - Seeds needed for K-Means
>
>I am using SPSS 12 for my clustering procedures.  I started with
>heirarchical clustering using Wards method with squared euclidean
>distance.  I have identified a three cluster solution as the best option
>from a possible range of 2-4 that I established a priori.
>
>Here is my problem, I want to next run a K-means clustering procedure.
>More specifically, I want to use the centroids of the three clusters
>from my heirarchical procedure as "seed" or starting values for the
>K-means clustering procedure.  Unfortunately, SPSS does not generate
>this output from the heirarchical procedure.  And I do not know 1) how
>to generate cluster centroids from the cluster assignment information
>provided by SPSS heirarchical procedure, and 2) even if I did, I do not
>know how to generate an SPSS.sav file with that information for use by
>the K-means approach.  A further problem, I am a point and clicker and
>not savvy with command syntax; I AM WILLING TO LEARN IF IT CAN GET ME
>OUT OF MY MESS!!
>
>Any persons that are SPSS  - Cluster Analysis savvy, or know others that
>might lend a hand would be met with gratitude for any assistance.
>
>Take care,
>
>Aaron Eakman
>
>________________________________________________________________________
>This e-mail has been scanned for all viruses by Star. The service is
>powered by MessageLabs. For more information on a proactive anti-virus
>service working around the clock, around the globe, visit:
>http://www.star.net.uk
>________________________________________________________________________
>
>______________________________________________________________________
>This email has been scanned by the MessageLabs Email Security System.
>For more information please visit http://www.messagelabs.com/email
>______________________________________________________________________
Reply | Threaded
Open this post in threaded view
|

Re: Cluster Analysis - Seeds needed for K-Means

Aaron Eakman
In reply to this post by Aaron Eakman
Mike,

Would my cluster .sav file have as row labels: 1, 2, 3, given that I had
identified a three cluster solution in the hierarchical approach; and
would the column labels be: cluster_, var1, var2, ... varX. (with "varX"
representing my cluster variates)?

If so, might the values in this matrix that I would submit to a K-means
approach be the mean (average) within cluster values of the the varX
cluster variates derived from my hierchical approach?  As an FYI, my
cluster variates are all of the same ratio scale.

Finally, (1) why would I run the hierarchical approach on a small sample
of my total sample rather than on the total sample?; and (2) why would I
need to run the K-means twice rather than just once?

Thanks much for you reply,

Aaron



On Tue, 8 Aug 2006 09:21:17 +0100, Michael Pearmain
<[hidden email]> wrote:

>Morning Aaron,
>
>Try the following steps
>
>*       Steps:
>1.      Run a Hierarchical Cluster analysis on a small sample
>2.      Choose a solution
>3.      Aggregate the variables used in the Cluster Analysis according
>to the cluster variable
>
>**Change the name of variables in the aggregate file to be the same as
>originally
>
>4.      Name the first variable 'cluster_' in the aggregated file
>5.      The aggregated file will be used as centre in the K-Means
>procedure
>6.      Use the aggregated file as centres when running a K-means on the
>whole data set
>*       Clustering new cases using a previous cluster analysis
>o       Save the final centre points.
>o       Use them a centres for the new file
>o       Choose as method: classify only
>HTH
>
>Mike
>
>
>-----Original Message-----
>From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
>Aaron Eakman
>Sent: 07 August 2006 18:50
>To: [hidden email]
>Subject: Cluster Analysis - Seeds needed for K-Means
>
>I am using SPSS 12 for my clustering procedures.  I started with
>heirarchical clustering using Wards method with squared euclidean
>distance.  I have identified a three cluster solution as the best option
>from a possible range of 2-4 that I established a priori.
>
>Here is my problem, I want to next run a K-means clustering procedure.
>More specifically, I want to use the centroids of the three clusters
>from my heirarchical procedure as "seed" or starting values for the
>K-means clustering procedure.  Unfortunately, SPSS does not generate
>this output from the heirarchical procedure.  And I do not know 1) how
>to generate cluster centroids from the cluster assignment information
>provided by SPSS heirarchical procedure, and 2) even if I did, I do not
>know how to generate an SPSS.sav file with that information for use by
>the K-means approach.  A further problem, I am a point and clicker and
>not savvy with command syntax; I AM WILLING TO LEARN IF IT CAN GET ME
>OUT OF MY MESS!!
>
>Any persons that are SPSS  - Cluster Analysis savvy, or know others that
>might lend a hand would be met with gratitude for any assistance.
>
>Take care,
>
>Aaron Eakman
>
>________________________________________________________________________
>This e-mail has been scanned for all viruses by Star. The service is
>powered by MessageLabs. For more information on a proactive anti-virus
>service working around the clock, around the globe, visit:
>http://www.star.net.uk
>________________________________________________________________________
>
>______________________________________________________________________
>This email has been scanned by the MessageLabs Email Security System.
>For more information please visit http://www.messagelabs.com/email
>______________________________________________________________________
Reply | Threaded
Open this post in threaded view
|

Re: Cluster Analysis - Seeds needed for K-Means

Hector Maletta
In reply to this post by Aaron Eakman
Aaron,
I did not imply that specifically in your case the results would be
misleading, but rather to state a general principle. In your case you are
using Euclidean distance (or squared Euclidean distance) in both procedures,
and this (IMHO) creates no problem. If an object C is at distances 4 and 5
from two other objects A and B, and is therefore closer to B, it will still
be closer to B if any monotonic transformation of the distance is used, such
as the squared distance (16 and 25 in this example). Therefore if C is
assigned to cluster A using squared Euclidean distance, it would also be
assigned to cluster A by simple Euclidean distance, especially if the METHOD
in CLUSTER is chosen in a sensible way.
In your case you used WARD's method, which is suitable for this situation.
However, the METHOD most similar to the one used in k-means is the CENTROID
method. I do not think this would have any implication in your case, but
remember that k-means assigns a case to one cluster or another depending on
the distance of the case to their respective centroids.
On the other hand CLUSTER admits a large variety of distance or similarity
specifications, and various methods, some of which may lead to different
results than Euclidean distance and Ward method, and therefore results found
with some of these (I called them "fancy") distance measures (and methods)
may lead to odd results when combined with k-means.
So in your case I guess you may forget about my comment, which was only
intended as general advice.

Hector



-----Mensaje original-----
De: Aaron Eakman [mailto:[hidden email]]
Enviado el: Wednesday, August 09, 2006 4:02 AM
Para: [hidden email]; Hector Maletta
CC: Aaron Eakman
Asunto: Re: Cluster Analysis - Seeds needed for K-Means

I would appreciate a further discussion of your statement:

"A solution that seems adequate with some fancy distance function may lead
to nonsense, or at least to some surprising results, when applied to K-
means with Euclidean distances."

First(1), are you suggesting that the squared Euclidean distance used in
the hierarchical clustering (Ward's method) I reported, and the Euclidean
distance which I intend to employ in K-means will have substantial
differences in cluster resolution?  I do understand, in general, how
hierarchical differs from K-means clustering.

If the answer would be "yes" to (1)... I would ask...(2) If I were to run
my clustering, comparing Euclidean to squared Euclidean in hierarchical
clustering (Ward's method) in SPSS should I expect substantial differences
in cluster solutions when reviewing the dendograms?

If the answer to (1) were "no", could you please let me know what you were
referring to...

If the answer to (2) were "yes", would you recommend that I employ
Euclidian (rather than squared Euclidean) for the hierarchical analyses in
my intended progression of : hierarchical Ward's (HW) clustering -to- use
of HW cluster centers as seeds for K-means clustering?

And if yes, a very brief explanation as to why you believe this...

Thank you much in advance for you replay


On Tue, 8 Aug 2006 11:08:05 -0300, Hector Maletta
<[hidden email]> wrote:

>One comment: K-means uses only Euclidean distances, whereas Hierarchical
>Clustering uses a full array of distance measures. A solution that seems
>adequate with some fancy distance function may lead to nonsense, or at
least

>to some surprising results, when applied to K-means with Euclidean
>distances.
>Hector
>
>-----Mensaje original-----
>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
>Michael Pearmain
>Enviado el: Tuesday, August 08, 2006 5:21 AM
>Para: [hidden email]
>Asunto: Re: Cluster Analysis - Seeds needed for K-Means
>
>Morning Aaron,
>
>Try the following steps
>
>*       Steps:
>1.      Run a Hierarchical Cluster analysis on a small sample
>2.      Choose a solution
>3.      Aggregate the variables used in the Cluster Analysis according
>to the cluster variable
>
>**Change the name of variables in the aggregate file to be the same as
>originally
>
>4.      Name the first variable 'cluster_' in the aggregated file
>5.      The aggregated file will be used as centre in the K-Means
>procedure
>6.      Use the aggregated file as centres when running a K-means on the
>whole data set
>*       Clustering new cases using a previous cluster analysis
>o       Save the final centre points.
>o       Use them a centres for the new file
>o       Choose as method: classify only
>HTH
>
>Mike
>
>
>-----Original Message-----
>From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
>Aaron Eakman
>Sent: 07 August 2006 18:50
>To: [hidden email]
>Subject: Cluster Analysis - Seeds needed for K-Means
>
>I am using SPSS 12 for my clustering procedures.  I started with
>heirarchical clustering using Wards method with squared euclidean
>distance.  I have identified a three cluster solution as the best option
>from a possible range of 2-4 that I established a priori.
>
>Here is my problem, I want to next run a K-means clustering procedure.
>More specifically, I want to use the centroids of the three clusters
>from my heirarchical procedure as "seed" or starting values for the
>K-means clustering procedure.  Unfortunately, SPSS does not generate
>this output from the heirarchical procedure.  And I do not know 1) how
>to generate cluster centroids from the cluster assignment information
>provided by SPSS heirarchical procedure, and 2) even if I did, I do not
>know how to generate an SPSS.sav file with that information for use by
>the K-means approach.  A further problem, I am a point and clicker and
>not savvy with command syntax; I AM WILLING TO LEARN IF IT CAN GET ME
>OUT OF MY MESS!!
>
>Any persons that are SPSS  - Cluster Analysis savvy, or know others that
>might lend a hand would be met with gratitude for any assistance.
>
>Take care,
>
>Aaron Eakman
>
>________________________________________________________________________
>This e-mail has been scanned for all viruses by Star. The service is
>powered by MessageLabs. For more information on a proactive anti-virus
>service working around the clock, around the globe, visit:
>http://www.star.net.uk
>________________________________________________________________________
>
>______________________________________________________________________
>This email has been scanned by the MessageLabs Email Security System.
>For more information please visit http://www.messagelabs.com/email
>______________________________________________________________________
Reply | Threaded
Open this post in threaded view
|

Re: Cluster Analysis - Seeds needed for K-Means

Art Kendall
In reply to this post by Aaron Eakman
It is some time since I used version 12,  but the hierarchical
clustering part has been around for since the  70's.
If you used the SAVE specification, you should have a new variable that
indicates for each case to which cluster it is assigned. say you called
it  Kluster3 and the variables to base the clustering on Var01 to Var12.


to get the centroids
(I'm not sure how you would have interpreted the cluster meanings
without using DISCRIMINANT or means already.)
discriminant  groups= kluster3 (1,3)/  variables = var01 to var12 . . ..

or
means tables= var01 to var12 by kluster3 /cells= count means . . . .

once you type the above command into a syntax window, highlight (select)
the procedure name with you mouse and click the syntax button to see
other possibilities for the procedure.

In DFA, I recommend closely examining the probabilities of assignment to
each cluster for each case, and the probability that a member of a
cluster would be as far away from the centroid as this particular case
is. This is a very old but very useful aid in interpreting a clustering.
The classification phase of DFA should provide insight into the
reliability of the cluster assignments.

The GUI in SPSS is very useful for the first draft of your syntax.
Simply exit the menus via the "paste" button.  This shows you the syntax
that will do what you specified in the menu.  As you look at your
results, and as you develop your approach you can simply edit the pasted
syntax.

To get your means into a .sav file.  There are more automated ways to
get the centroids into kmeans, but this is straightforward.
open a new data file
label the variables  kluster3 and var01 ... var12.
key in the centroids.
save the file.



You might also want to consider applying the TWOSTEP procedure.
It will produce AIC and BIC to check on the number of clusters to retain.

Art Kendall
Social Research Consultants


Aaron Eakman wrote:

>I am using SPSS 12 for my clustering procedures.  I started with
>heirarchical clustering using Wards method with squared euclidean
>distance.  I have identified a three cluster solution as the best option
>from a possible range of 2-4 that I established a priori.
>
>Here is my problem, I want to next run a K-means clustering procedure.
>More specifically, I want to use the centroids of the three clusters from
>my heirarchical procedure as "seed" or starting values for the K-means
>clustering procedure.  Unfortunately, SPSS does not generate this output
>from the heirarchical procedure.  And I do not know 1) how to generate
>cluster centroids from the cluster assignment information provided by SPSS
>heirarchical procedure, and 2) even if I did, I do not know how
>to generate an SPSS.sav file with that information for use by the K-means
>approach.  A further problem, I am a point and clicker and not savvy with
>command syntax; I AM WILLING TO LEARN IF IT CAN GET ME OUT OF MY MESS!!
>
>Any persons that are SPSS  - Cluster Analysis savvy, or know others that
>might lend a hand would be met with gratitude for any assistance.
>
>Take care,
>
>Aaron Eakman
>
>
>
>
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Cluster Analysis - Seeds needed for K-Means

Hector Maletta
I agree with Art Kendall opinion that "In DFA, I recommend closely examining
the probabilities of assignment to each cluster for each case, and the
probability that a member of a cluster would be as far away from the
centroid as this particular case is. This is a very old but very useful aid
in interpreting a clustering. The classification phase of DFA should provide
insight into the reliability of the cluster assignments."
Besides using or not using DFA for this purpose, cases far away from the
centroid are often of doubtful usefulness. In one exercise I did with a
large sample some time ago, I applied clustering to create a certain number
of clusters, but there were a lot of cases of borderline membership. We
figured a small amount of measurement error would land those cases in
another cluster altogether.
For certain research purpose it proved useful to divide each cluster into a
"core" and a "periphery", the core being a relatively small area around the
centroid. This is only useful when many cases are near the centroid, and few
are in the no-man's land or borderline area between clusters, far away from
the centroid.
I do not remember all the details, but I do remember I tried several ways of
defining the core, including the following: (1) all cases situated within
the minimum distance from the centroid that encompassed, say, 25% of all
cases in the cluster; (2) all cases, whichever their number or proportion as
long as they were at least 30, located within an Euclidean distance of, say,
one cluster-specific standard deviation from the centroid.
The "core" of the cluster is usually quite homogeneous, and proved a very
useful tool to define the "typical" features of the cluster, and to select
typical cases for frequent follow-up, at least for means if not for
variability around the mean.
In fact, what we did was creating a "model" (a "model farm-household" in
that experience) defined by the centroid values of all variables,
periodically re-evaluating those values by following-up a small rotational
sample of cases randomly selected from the core. Since the centroid was
supposed to be defined by the mean of those variables for the entire cluster
(core+periphery), we boldly multiplied the updated centroid means times the
clusters' total membership to obtain updated population means and totals in
an economical way (this was done in order to monitor rural development at
farm/household level in a poor developing country, where large sample
surveys cannot be carried out with the necessary frequency, and casual
visits by extension workers are not enough).
Hope this helps.

Hector Maletta

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Art
Kendall
Enviado el: Monday, August 14, 2006 2:13 PM
Para: [hidden email]
Asunto: Re: Cluster Analysis - Seeds needed for K-Means

It is some time since I used version 12,  but the hierarchical
clustering part has been around for since the  70's.
If you used the SAVE specification, you should have a new variable that
indicates for each case to which cluster it is assigned. say you called
it  Kluster3 and the variables to base the clustering on Var01 to Var12.


to get the centroids
(I'm not sure how you would have interpreted the cluster meanings
without using DISCRIMINANT or means already.)
discriminant  groups= kluster3 (1,3)/  variables = var01 to var12 . . ..

or
means tables= var01 to var12 by kluster3 /cells= count means . . . .

once you type the above command into a syntax window, highlight (select)
the procedure name with you mouse and click the syntax button to see
other possibilities for the procedure.

In DFA, I recommend closely examining the probabilities of assignment to
each cluster for each case, and the probability that a member of a
cluster would be as far away from the centroid as this particular case
is. This is a very old but very useful aid in interpreting a clustering.
The classification phase of DFA should provide insight into the
reliability of the cluster assignments.

The GUI in SPSS is very useful for the first draft of your syntax.
Simply exit the menus via the "paste" button.  This shows you the syntax
that will do what you specified in the menu.  As you look at your
results, and as you develop your approach you can simply edit the pasted
syntax.

To get your means into a .sav file.  There are more automated ways to
get the centroids into kmeans, but this is straightforward.
open a new data file
label the variables  kluster3 and var01 ... var12.
key in the centroids.
save the file.



You might also want to consider applying the TWOSTEP procedure.
It will produce AIC and BIC to check on the number of clusters to retain.

Art Kendall
Social Research Consultants


Aaron Eakman wrote:

>I am using SPSS 12 for my clustering procedures.  I started with
>heirarchical clustering using Wards method with squared euclidean
>distance.  I have identified a three cluster solution as the best option
>from a possible range of 2-4 that I established a priori.
>
>Here is my problem, I want to next run a K-means clustering procedure.
>More specifically, I want to use the centroids of the three clusters from
>my heirarchical procedure as "seed" or starting values for the K-means
>clustering procedure.  Unfortunately, SPSS does not generate this output
>from the heirarchical procedure.  And I do not know 1) how to generate
>cluster centroids from the cluster assignment information provided by SPSS
>heirarchical procedure, and 2) even if I did, I do not know how
>to generate an SPSS.sav file with that information for use by the K-means
>approach.  A further problem, I am a point and clicker and not savvy with
>command syntax; I AM WILLING TO LEARN IF IT CAN GET ME OUT OF MY MESS!!
>
>Any persons that are SPSS  - Cluster Analysis savvy, or know others that
>might lend a hand would be met with gratitude for any assistance.
>
>Take care,
>
>Aaron Eakman
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Cluster Analysis - Seeds needed for K-Means

Art Kendall
Hector Maletta's  approach sounds very useful.
I'll keep it in mind the next time I'm using dfa to interpret or refine
a clustering.

-- some elabaoration --
In the 70's I started calling sets of cases "core clusters" when several
different agglomeration methods and/or distance measures placed those
cases together.
I then used DFA to refine assignments/interpretation.  A case was
considered unclassified for the first phase of a dfa  if  it was far
from the centroid or if it was a "splitter" across the probabilities.
What constitutes a splitter is subjective and you might want to try
different approaches.  Obviously, (.98, .01, .01) is a very definite
assignment, while (.33, .34, .33) is a very ambiguous assignment.  You
might want to try different criteria such as at least .55 with next best
being no more than .4  or best at least .1 better than second best.
The dfa was run iteratively until table of  "original" and "assigned"
groups was as stable as it could be.

Another reason to use dfa is that, although the "tests" in the first
phase should not be interpreted in the conventional way they can be
useful in interpreting what distinguishes the cluster profiles.

Another way to word Hector's point about sampling, is that the clusters
in exploratory terminology, can be very useful as strata in sampling
terminology.


Art

Social Research Consultants

Hector Maletta wrote:

>I agree with Art Kendall opinion that "In DFA, I recommend closely examining
>the probabilities of assignment to each cluster for each case, and the
>probability that a member of a cluster would be as far away from the
>centroid as this particular case is. This is a very old but very useful aid
>in interpreting a clustering. The classification phase of DFA should provide
>insight into the reliability of the cluster assignments."
>Besides using or not using DFA for this purpose, cases far away from the
>centroid are often of doubtful usefulness. In one exercise I did with a
>large sample some time ago, I applied clustering to create a certain number
>of clusters, but there were a lot of cases of borderline membership. We
>figured a small amount of measurement error would land those cases in
>another cluster altogether.
>For certain research purpose it proved useful to divide each cluster into a
>"core" and a "periphery", the core being a relatively small area around the
>centroid. This is only useful when many cases are near the centroid, and few
>are in the no-man's land or borderline area between clusters, far away from
>the centroid.
>I do not remember all the details, but I do remember I tried several ways of
>defining the core, including the following: (1) all cases situated within
>the minimum distance from the centroid that encompassed, say, 25% of all
>cases in the cluster; (2) all cases, whichever their number or proportion as
>long as they were at least 30, located within an Euclidean distance of, say,
>one cluster-specific standard deviation from the centroid.
>The "core" of the cluster is usually quite homogeneous, and proved a very
>useful tool to define the "typical" features of the cluster, and to select
>typical cases for frequent follow-up, at least for means if not for
>variability around the mean.
>In fact, what we did was creating a "model" (a "model farm-household" in
>that experience) defined by the centroid values of all variables,
>periodically re-evaluating those values by following-up a small rotational
>sample of cases randomly selected from the core. Since the centroid was
>supposed to be defined by the mean of those variables for the entire cluster
>(core+periphery), we boldly multiplied the updated centroid means times the
>clusters' total membership to obtain updated population means and totals in
>an economical way (this was done in order to monitor rural development at
>farm/household level in a poor developing country, where large sample
>surveys cannot be carried out with the necessary frequency, and casual
>visits by extension workers are not enough).
>Hope this helps.
>
>Hector Maletta
>
>-----Mensaje original-----
>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Art
>Kendall
>Enviado el: Monday, August 14, 2006 2:13 PM
>Para: [hidden email]
>Asunto: Re: Cluster Analysis - Seeds needed for K-Means
>
>It is some time since I used version 12,  but the hierarchical
>clustering part has been around for since the  70's.
>If you used the SAVE specification, you should have a new variable that
>indicates for each case to which cluster it is assigned. say you called
>it  Kluster3 and the variables to base the clustering on Var01 to Var12.
>
>
>to get the centroids
>(I'm not sure how you would have interpreted the cluster meanings
>without using DISCRIMINANT or means already.)
>discriminant  groups= kluster3 (1,3)/  variables = var01 to var12 . . ..
>
>or
>means tables= var01 to var12 by kluster3 /cells= count means . . . .
>
>once you type the above command into a syntax window, highlight (select)
>the procedure name with you mouse and click the syntax button to see
>other possibilities for the procedure.
>
>In DFA, I recommend closely examining the probabilities of assignment to
>each cluster for each case, and the probability that a member of a
>cluster would be as far away from the centroid as this particular case
>is. This is a very old but very useful aid in interpreting a clustering.
>The classification phase of DFA should provide insight into the
>reliability of the cluster assignments.
>
>The GUI in SPSS is very useful for the first draft of your syntax.
>Simply exit the menus via the "paste" button.  This shows you the syntax
>that will do what you specified in the menu.  As you look at your
>results, and as you develop your approach you can simply edit the pasted
>syntax.
>
>To get your means into a .sav file.  There are more automated ways to
>get the centroids into kmeans, but this is straightforward.
>open a new data file
>label the variables  kluster3 and var01 ... var12.
>key in the centroids.
>save the file.
>
>
>
>You might also want to consider applying the TWOSTEP procedure.
>It will produce AIC and BIC to check on the number of clusters to retain.
>
>Art Kendall
>Social Research Consultants
>
>
>Aaron Eakman wrote:
>
>
>
>>I am using SPSS 12 for my clustering procedures.  I started with
>>heirarchical clustering using Wards method with squared euclidean
>>distance.  I have identified a three cluster solution as the best option
>>
>>
>>from a possible range of 2-4 that I established a priori.
>
>
>>Here is my problem, I want to next run a K-means clustering procedure.
>>More specifically, I want to use the centroids of the three clusters from
>>my heirarchical procedure as "seed" or starting values for the K-means
>>clustering procedure.  Unfortunately, SPSS does not generate this output
>>
>>
>>from the heirarchical procedure.  And I do not know 1) how to generate
>
>
>>cluster centroids from the cluster assignment information provided by SPSS
>>heirarchical procedure, and 2) even if I did, I do not know how
>>to generate an SPSS.sav file with that information for use by the K-means
>>approach.  A further problem, I am a point and clicker and not savvy with
>>command syntax; I AM WILLING TO LEARN IF IT CAN GET ME OUT OF MY MESS!!
>>
>>Any persons that are SPSS  - Cluster Analysis savvy, or know others that
>>might lend a hand would be met with gratitude for any assistance.
>>
>>Take care,
>>
>>Aaron Eakman
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Cluster Analysis - Seeds needed for K-Means

Hector Maletta
Art Kendall wrote:

Another way to word Hector's point about sampling, is that the clusters in
exploratory terminology, can be very useful as strata in sampling
terminology.



That's all right, provided you have prior data on all the clustering
variables for all the population, e.g. if they are all census variables.
Otherwise, you may have to conduct a census before taking your stratified
sample based on clusters constructed on those variables. More often what you
get is a large sample (say the baseline survey of an area in a large
development project), which may or may not be based on a previous census; on
this large baseline sample survey clusters of cases (e.g. farmers,
beneficiary or not) are formed, and then small homogeneous samples of "core"
farmers are extracted from each sample for frequent follow up (including
beneficiaries and controls) until the next big sample is taken some years
after to asses the overall impact of the development project. This is a
frequent setup in developing countries for internationally financed
projects, where a limited amount of money is available for monitoring
purposes, and moreover, the local people is supposed to continue the
continuous or frequent monitoring when the international money runs out, and
therefore  that frequent monitoring methodology should be kept cheap.



Of course Art would be right if the big sample is considered as a
population. But it is normally but a sample, on which strata are defined,
e.g. by clustering.



Hector