I am using SPSS 12 for my clustering procedures. I started with
heirarchical clustering using Wards method with squared euclidean distance. I have identified a three cluster solution as the best option from a possible range of 2-4 that I established a priori. Here is my problem, I want to next run a K-means clustering procedure. More specifically, I want to use the centroids of the three clusters from my heirarchical procedure as "seed" or starting values for the K-means clustering procedure. Unfortunately, SPSS does not generate this output from the heirarchical procedure. And I do not know 1) how to generate cluster centroids from the cluster assignment information provided by SPSS heirarchical procedure, and 2) even if I did, I do not know how to generate an SPSS.sav file with that information for use by the K-means approach. A further problem, I am a point and clicker and not savvy with command syntax; I AM WILLING TO LEARN IF IT CAN GET ME OUT OF MY MESS!! Any persons that are SPSS - Cluster Analysis savvy, or know others that might lend a hand would be met with gratitude for any assistance. Take care, Aaron Eakman |
Morning Aaron,
Try the following steps * Steps: 1. Run a Hierarchical Cluster analysis on a small sample 2. Choose a solution 3. Aggregate the variables used in the Cluster Analysis according to the cluster variable **Change the name of variables in the aggregate file to be the same as originally 4. Name the first variable 'cluster_' in the aggregated file 5. The aggregated file will be used as centre in the K-Means procedure 6. Use the aggregated file as centres when running a K-means on the whole data set * Clustering new cases using a previous cluster analysis o Save the final centre points. o Use them a centres for the new file o Choose as method: classify only HTH Mike -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Aaron Eakman Sent: 07 August 2006 18:50 To: [hidden email] Subject: Cluster Analysis - Seeds needed for K-Means I am using SPSS 12 for my clustering procedures. I started with heirarchical clustering using Wards method with squared euclidean distance. I have identified a three cluster solution as the best option from a possible range of 2-4 that I established a priori. Here is my problem, I want to next run a K-means clustering procedure. More specifically, I want to use the centroids of the three clusters from my heirarchical procedure as "seed" or starting values for the K-means clustering procedure. Unfortunately, SPSS does not generate this output from the heirarchical procedure. And I do not know 1) how to generate cluster centroids from the cluster assignment information provided by SPSS heirarchical procedure, and 2) even if I did, I do not know how to generate an SPSS.sav file with that information for use by the K-means approach. A further problem, I am a point and clicker and not savvy with command syntax; I AM WILLING TO LEARN IF IT CAN GET ME OUT OF MY MESS!! Any persons that are SPSS - Cluster Analysis savvy, or know others that might lend a hand would be met with gratitude for any assistance. Take care, Aaron Eakman ________________________________________________________________________ This e-mail has been scanned for all viruses by Star. The service is powered by MessageLabs. For more information on a proactive anti-virus service working around the clock, around the globe, visit: http://www.star.net.uk ________________________________________________________________________ ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________ |
One comment: K-means uses only Euclidean distances, whereas Hierarchical
Clustering uses a full array of distance measures. A solution that seems adequate with some fancy distance function may lead to nonsense, or at least to some surprising results, when applied to K-means with Euclidean distances. Hector -----Mensaje original----- De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Michael Pearmain Enviado el: Tuesday, August 08, 2006 5:21 AM Para: [hidden email] Asunto: Re: Cluster Analysis - Seeds needed for K-Means Morning Aaron, Try the following steps * Steps: 1. Run a Hierarchical Cluster analysis on a small sample 2. Choose a solution 3. Aggregate the variables used in the Cluster Analysis according to the cluster variable **Change the name of variables in the aggregate file to be the same as originally 4. Name the first variable 'cluster_' in the aggregated file 5. The aggregated file will be used as centre in the K-Means procedure 6. Use the aggregated file as centres when running a K-means on the whole data set * Clustering new cases using a previous cluster analysis o Save the final centre points. o Use them a centres for the new file o Choose as method: classify only HTH Mike -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Aaron Eakman Sent: 07 August 2006 18:50 To: [hidden email] Subject: Cluster Analysis - Seeds needed for K-Means I am using SPSS 12 for my clustering procedures. I started with heirarchical clustering using Wards method with squared euclidean distance. I have identified a three cluster solution as the best option from a possible range of 2-4 that I established a priori. Here is my problem, I want to next run a K-means clustering procedure. More specifically, I want to use the centroids of the three clusters from my heirarchical procedure as "seed" or starting values for the K-means clustering procedure. Unfortunately, SPSS does not generate this output from the heirarchical procedure. And I do not know 1) how to generate cluster centroids from the cluster assignment information provided by SPSS heirarchical procedure, and 2) even if I did, I do not know how to generate an SPSS.sav file with that information for use by the K-means approach. A further problem, I am a point and clicker and not savvy with command syntax; I AM WILLING TO LEARN IF IT CAN GET ME OUT OF MY MESS!! Any persons that are SPSS - Cluster Analysis savvy, or know others that might lend a hand would be met with gratitude for any assistance. Take care, Aaron Eakman ________________________________________________________________________ This e-mail has been scanned for all viruses by Star. The service is powered by MessageLabs. For more information on a proactive anti-virus service working around the clock, around the globe, visit: http://www.star.net.uk ________________________________________________________________________ ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________ |
In reply to this post by Aaron Eakman
I would appreciate a further discussion of your statement:
"A solution that seems adequate with some fancy distance function may lead to nonsense, or at least to some surprising results, when applied to K- means with Euclidean distances." First(1), are you suggesting that the squared Euclidean distance used in the hierarchical clustering (Ward's method) I reported, and the Euclidean distance which I intend to employ in K-means will have substantial differences in cluster resolution? I do understand, in general, how hierarchical differs from K-means clustering. If the answer would be "yes" to (1)... I would ask...(2) If I were to run my clustering, comparing Euclidean to squared Euclidean in hierarchical clustering (Ward's method) in SPSS should I expect substantial differences in cluster solutions when reviewing the dendograms? If the answer to (1) were "no", could you please let me know what you were referring to... If the answer to (2) were "yes", would you recommend that I employ Euclidian (rather than squared Euclidean) for the hierarchical analyses in my intended progression of : hierarchical Ward's (HW) clustering -to- use of HW cluster centers as seeds for K-means clustering? And if yes, a very brief explanation as to why you believe this... Thank you much in advance for you replay On Tue, 8 Aug 2006 11:08:05 -0300, Hector Maletta <[hidden email]> wrote: >One comment: K-means uses only Euclidean distances, whereas Hierarchical >Clustering uses a full array of distance measures. A solution that seems >adequate with some fancy distance function may lead to nonsense, or at least >to some surprising results, when applied to K-means with Euclidean >distances. >Hector > >-----Mensaje original----- >De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de >Michael Pearmain >Enviado el: Tuesday, August 08, 2006 5:21 AM >Para: [hidden email] >Asunto: Re: Cluster Analysis - Seeds needed for K-Means > >Morning Aaron, > >Try the following steps > >* Steps: >1. Run a Hierarchical Cluster analysis on a small sample >2. Choose a solution >3. Aggregate the variables used in the Cluster Analysis according >to the cluster variable > >**Change the name of variables in the aggregate file to be the same as >originally > >4. Name the first variable 'cluster_' in the aggregated file >5. The aggregated file will be used as centre in the K-Means >procedure >6. Use the aggregated file as centres when running a K-means on the >whole data set >* Clustering new cases using a previous cluster analysis >o Save the final centre points. >o Use them a centres for the new file >o Choose as method: classify only >HTH > >Mike > > >-----Original Message----- >From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of >Aaron Eakman >Sent: 07 August 2006 18:50 >To: [hidden email] >Subject: Cluster Analysis - Seeds needed for K-Means > >I am using SPSS 12 for my clustering procedures. I started with >heirarchical clustering using Wards method with squared euclidean >distance. I have identified a three cluster solution as the best option >from a possible range of 2-4 that I established a priori. > >Here is my problem, I want to next run a K-means clustering procedure. >More specifically, I want to use the centroids of the three clusters >from my heirarchical procedure as "seed" or starting values for the >K-means clustering procedure. Unfortunately, SPSS does not generate >this output from the heirarchical procedure. And I do not know 1) how >to generate cluster centroids from the cluster assignment information >provided by SPSS heirarchical procedure, and 2) even if I did, I do not >know how to generate an SPSS.sav file with that information for use by >the K-means approach. A further problem, I am a point and clicker and >not savvy with command syntax; I AM WILLING TO LEARN IF IT CAN GET ME >OUT OF MY MESS!! > >Any persons that are SPSS - Cluster Analysis savvy, or know others that >might lend a hand would be met with gratitude for any assistance. > >Take care, > >Aaron Eakman > >________________________________________________________________________ >This e-mail has been scanned for all viruses by Star. The service is >powered by MessageLabs. For more information on a proactive anti-virus >service working around the clock, around the globe, visit: >http://www.star.net.uk >________________________________________________________________________ > >______________________________________________________________________ >This email has been scanned by the MessageLabs Email Security System. >For more information please visit http://www.messagelabs.com/email >______________________________________________________________________ |
In reply to this post by Aaron Eakman
Mike,
Would my cluster .sav file have as row labels: 1, 2, 3, given that I had identified a three cluster solution in the hierarchical approach; and would the column labels be: cluster_, var1, var2, ... varX. (with "varX" representing my cluster variates)? If so, might the values in this matrix that I would submit to a K-means approach be the mean (average) within cluster values of the the varX cluster variates derived from my hierchical approach? As an FYI, my cluster variates are all of the same ratio scale. Finally, (1) why would I run the hierarchical approach on a small sample of my total sample rather than on the total sample?; and (2) why would I need to run the K-means twice rather than just once? Thanks much for you reply, Aaron On Tue, 8 Aug 2006 09:21:17 +0100, Michael Pearmain <[hidden email]> wrote: >Morning Aaron, > >Try the following steps > >* Steps: >1. Run a Hierarchical Cluster analysis on a small sample >2. Choose a solution >3. Aggregate the variables used in the Cluster Analysis according >to the cluster variable > >**Change the name of variables in the aggregate file to be the same as >originally > >4. Name the first variable 'cluster_' in the aggregated file >5. The aggregated file will be used as centre in the K-Means >procedure >6. Use the aggregated file as centres when running a K-means on the >whole data set >* Clustering new cases using a previous cluster analysis >o Save the final centre points. >o Use them a centres for the new file >o Choose as method: classify only >HTH > >Mike > > >-----Original Message----- >From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of >Aaron Eakman >Sent: 07 August 2006 18:50 >To: [hidden email] >Subject: Cluster Analysis - Seeds needed for K-Means > >I am using SPSS 12 for my clustering procedures. I started with >heirarchical clustering using Wards method with squared euclidean >distance. I have identified a three cluster solution as the best option >from a possible range of 2-4 that I established a priori. > >Here is my problem, I want to next run a K-means clustering procedure. >More specifically, I want to use the centroids of the three clusters >from my heirarchical procedure as "seed" or starting values for the >K-means clustering procedure. Unfortunately, SPSS does not generate >this output from the heirarchical procedure. And I do not know 1) how >to generate cluster centroids from the cluster assignment information >provided by SPSS heirarchical procedure, and 2) even if I did, I do not >know how to generate an SPSS.sav file with that information for use by >the K-means approach. A further problem, I am a point and clicker and >not savvy with command syntax; I AM WILLING TO LEARN IF IT CAN GET ME >OUT OF MY MESS!! > >Any persons that are SPSS - Cluster Analysis savvy, or know others that >might lend a hand would be met with gratitude for any assistance. > >Take care, > >Aaron Eakman > >________________________________________________________________________ >This e-mail has been scanned for all viruses by Star. The service is >powered by MessageLabs. For more information on a proactive anti-virus >service working around the clock, around the globe, visit: >http://www.star.net.uk >________________________________________________________________________ > >______________________________________________________________________ >This email has been scanned by the MessageLabs Email Security System. >For more information please visit http://www.messagelabs.com/email >______________________________________________________________________ |
In reply to this post by Aaron Eakman
Aaron,
I did not imply that specifically in your case the results would be misleading, but rather to state a general principle. In your case you are using Euclidean distance (or squared Euclidean distance) in both procedures, and this (IMHO) creates no problem. If an object C is at distances 4 and 5 from two other objects A and B, and is therefore closer to B, it will still be closer to B if any monotonic transformation of the distance is used, such as the squared distance (16 and 25 in this example). Therefore if C is assigned to cluster A using squared Euclidean distance, it would also be assigned to cluster A by simple Euclidean distance, especially if the METHOD in CLUSTER is chosen in a sensible way. In your case you used WARD's method, which is suitable for this situation. However, the METHOD most similar to the one used in k-means is the CENTROID method. I do not think this would have any implication in your case, but remember that k-means assigns a case to one cluster or another depending on the distance of the case to their respective centroids. On the other hand CLUSTER admits a large variety of distance or similarity specifications, and various methods, some of which may lead to different results than Euclidean distance and Ward method, and therefore results found with some of these (I called them "fancy") distance measures (and methods) may lead to odd results when combined with k-means. So in your case I guess you may forget about my comment, which was only intended as general advice. Hector -----Mensaje original----- De: Aaron Eakman [mailto:[hidden email]] Enviado el: Wednesday, August 09, 2006 4:02 AM Para: [hidden email]; Hector Maletta CC: Aaron Eakman Asunto: Re: Cluster Analysis - Seeds needed for K-Means I would appreciate a further discussion of your statement: "A solution that seems adequate with some fancy distance function may lead to nonsense, or at least to some surprising results, when applied to K- means with Euclidean distances." First(1), are you suggesting that the squared Euclidean distance used in the hierarchical clustering (Ward's method) I reported, and the Euclidean distance which I intend to employ in K-means will have substantial differences in cluster resolution? I do understand, in general, how hierarchical differs from K-means clustering. If the answer would be "yes" to (1)... I would ask...(2) If I were to run my clustering, comparing Euclidean to squared Euclidean in hierarchical clustering (Ward's method) in SPSS should I expect substantial differences in cluster solutions when reviewing the dendograms? If the answer to (1) were "no", could you please let me know what you were referring to... If the answer to (2) were "yes", would you recommend that I employ Euclidian (rather than squared Euclidean) for the hierarchical analyses in my intended progression of : hierarchical Ward's (HW) clustering -to- use of HW cluster centers as seeds for K-means clustering? And if yes, a very brief explanation as to why you believe this... Thank you much in advance for you replay On Tue, 8 Aug 2006 11:08:05 -0300, Hector Maletta <[hidden email]> wrote: >One comment: K-means uses only Euclidean distances, whereas Hierarchical >Clustering uses a full array of distance measures. A solution that seems >adequate with some fancy distance function may lead to nonsense, or at least >to some surprising results, when applied to K-means with Euclidean >distances. >Hector > >-----Mensaje original----- >De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de >Michael Pearmain >Enviado el: Tuesday, August 08, 2006 5:21 AM >Para: [hidden email] >Asunto: Re: Cluster Analysis - Seeds needed for K-Means > >Morning Aaron, > >Try the following steps > >* Steps: >1. Run a Hierarchical Cluster analysis on a small sample >2. Choose a solution >3. Aggregate the variables used in the Cluster Analysis according >to the cluster variable > >**Change the name of variables in the aggregate file to be the same as >originally > >4. Name the first variable 'cluster_' in the aggregated file >5. The aggregated file will be used as centre in the K-Means >procedure >6. Use the aggregated file as centres when running a K-means on the >whole data set >* Clustering new cases using a previous cluster analysis >o Save the final centre points. >o Use them a centres for the new file >o Choose as method: classify only >HTH > >Mike > > >-----Original Message----- >From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of >Aaron Eakman >Sent: 07 August 2006 18:50 >To: [hidden email] >Subject: Cluster Analysis - Seeds needed for K-Means > >I am using SPSS 12 for my clustering procedures. I started with >heirarchical clustering using Wards method with squared euclidean >distance. I have identified a three cluster solution as the best option >from a possible range of 2-4 that I established a priori. > >Here is my problem, I want to next run a K-means clustering procedure. >More specifically, I want to use the centroids of the three clusters >from my heirarchical procedure as "seed" or starting values for the >K-means clustering procedure. Unfortunately, SPSS does not generate >this output from the heirarchical procedure. And I do not know 1) how >to generate cluster centroids from the cluster assignment information >provided by SPSS heirarchical procedure, and 2) even if I did, I do not >know how to generate an SPSS.sav file with that information for use by >the K-means approach. A further problem, I am a point and clicker and >not savvy with command syntax; I AM WILLING TO LEARN IF IT CAN GET ME >OUT OF MY MESS!! > >Any persons that are SPSS - Cluster Analysis savvy, or know others that >might lend a hand would be met with gratitude for any assistance. > >Take care, > >Aaron Eakman > >________________________________________________________________________ >This e-mail has been scanned for all viruses by Star. The service is >powered by MessageLabs. For more information on a proactive anti-virus >service working around the clock, around the globe, visit: >http://www.star.net.uk >________________________________________________________________________ > >______________________________________________________________________ >This email has been scanned by the MessageLabs Email Security System. >For more information please visit http://www.messagelabs.com/email >______________________________________________________________________ |
In reply to this post by Aaron Eakman
It is some time since I used version 12, but the hierarchical
clustering part has been around for since the 70's. If you used the SAVE specification, you should have a new variable that indicates for each case to which cluster it is assigned. say you called it Kluster3 and the variables to base the clustering on Var01 to Var12. to get the centroids (I'm not sure how you would have interpreted the cluster meanings without using DISCRIMINANT or means already.) discriminant groups= kluster3 (1,3)/ variables = var01 to var12 . . .. or means tables= var01 to var12 by kluster3 /cells= count means . . . . once you type the above command into a syntax window, highlight (select) the procedure name with you mouse and click the syntax button to see other possibilities for the procedure. In DFA, I recommend closely examining the probabilities of assignment to each cluster for each case, and the probability that a member of a cluster would be as far away from the centroid as this particular case is. This is a very old but very useful aid in interpreting a clustering. The classification phase of DFA should provide insight into the reliability of the cluster assignments. The GUI in SPSS is very useful for the first draft of your syntax. Simply exit the menus via the "paste" button. This shows you the syntax that will do what you specified in the menu. As you look at your results, and as you develop your approach you can simply edit the pasted syntax. To get your means into a .sav file. There are more automated ways to get the centroids into kmeans, but this is straightforward. open a new data file label the variables kluster3 and var01 ... var12. key in the centroids. save the file. You might also want to consider applying the TWOSTEP procedure. It will produce AIC and BIC to check on the number of clusters to retain. Art Kendall Social Research Consultants Aaron Eakman wrote: >I am using SPSS 12 for my clustering procedures. I started with >heirarchical clustering using Wards method with squared euclidean >distance. I have identified a three cluster solution as the best option >from a possible range of 2-4 that I established a priori. > >Here is my problem, I want to next run a K-means clustering procedure. >More specifically, I want to use the centroids of the three clusters from >my heirarchical procedure as "seed" or starting values for the K-means >clustering procedure. Unfortunately, SPSS does not generate this output >from the heirarchical procedure. And I do not know 1) how to generate >cluster centroids from the cluster assignment information provided by SPSS >heirarchical procedure, and 2) even if I did, I do not know how >to generate an SPSS.sav file with that information for use by the K-means >approach. A further problem, I am a point and clicker and not savvy with >command syntax; I AM WILLING TO LEARN IF IT CAN GET ME OUT OF MY MESS!! > >Any persons that are SPSS - Cluster Analysis savvy, or know others that >might lend a hand would be met with gratitude for any assistance. > >Take care, > >Aaron Eakman > > > >
Art Kendall
Social Research Consultants |
I agree with Art Kendall opinion that "In DFA, I recommend closely examining
the probabilities of assignment to each cluster for each case, and the probability that a member of a cluster would be as far away from the centroid as this particular case is. This is a very old but very useful aid in interpreting a clustering. The classification phase of DFA should provide insight into the reliability of the cluster assignments." Besides using or not using DFA for this purpose, cases far away from the centroid are often of doubtful usefulness. In one exercise I did with a large sample some time ago, I applied clustering to create a certain number of clusters, but there were a lot of cases of borderline membership. We figured a small amount of measurement error would land those cases in another cluster altogether. For certain research purpose it proved useful to divide each cluster into a "core" and a "periphery", the core being a relatively small area around the centroid. This is only useful when many cases are near the centroid, and few are in the no-man's land or borderline area between clusters, far away from the centroid. I do not remember all the details, but I do remember I tried several ways of defining the core, including the following: (1) all cases situated within the minimum distance from the centroid that encompassed, say, 25% of all cases in the cluster; (2) all cases, whichever their number or proportion as long as they were at least 30, located within an Euclidean distance of, say, one cluster-specific standard deviation from the centroid. The "core" of the cluster is usually quite homogeneous, and proved a very useful tool to define the "typical" features of the cluster, and to select typical cases for frequent follow-up, at least for means if not for variability around the mean. In fact, what we did was creating a "model" (a "model farm-household" in that experience) defined by the centroid values of all variables, periodically re-evaluating those values by following-up a small rotational sample of cases randomly selected from the core. Since the centroid was supposed to be defined by the mean of those variables for the entire cluster (core+periphery), we boldly multiplied the updated centroid means times the clusters' total membership to obtain updated population means and totals in an economical way (this was done in order to monitor rural development at farm/household level in a poor developing country, where large sample surveys cannot be carried out with the necessary frequency, and casual visits by extension workers are not enough). Hope this helps. Hector Maletta -----Mensaje original----- De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Art Kendall Enviado el: Monday, August 14, 2006 2:13 PM Para: [hidden email] Asunto: Re: Cluster Analysis - Seeds needed for K-Means It is some time since I used version 12, but the hierarchical clustering part has been around for since the 70's. If you used the SAVE specification, you should have a new variable that indicates for each case to which cluster it is assigned. say you called it Kluster3 and the variables to base the clustering on Var01 to Var12. to get the centroids (I'm not sure how you would have interpreted the cluster meanings without using DISCRIMINANT or means already.) discriminant groups= kluster3 (1,3)/ variables = var01 to var12 . . .. or means tables= var01 to var12 by kluster3 /cells= count means . . . . once you type the above command into a syntax window, highlight (select) the procedure name with you mouse and click the syntax button to see other possibilities for the procedure. In DFA, I recommend closely examining the probabilities of assignment to each cluster for each case, and the probability that a member of a cluster would be as far away from the centroid as this particular case is. This is a very old but very useful aid in interpreting a clustering. The classification phase of DFA should provide insight into the reliability of the cluster assignments. The GUI in SPSS is very useful for the first draft of your syntax. Simply exit the menus via the "paste" button. This shows you the syntax that will do what you specified in the menu. As you look at your results, and as you develop your approach you can simply edit the pasted syntax. To get your means into a .sav file. There are more automated ways to get the centroids into kmeans, but this is straightforward. open a new data file label the variables kluster3 and var01 ... var12. key in the centroids. save the file. You might also want to consider applying the TWOSTEP procedure. It will produce AIC and BIC to check on the number of clusters to retain. Art Kendall Social Research Consultants Aaron Eakman wrote: >I am using SPSS 12 for my clustering procedures. I started with >heirarchical clustering using Wards method with squared euclidean >distance. I have identified a three cluster solution as the best option >from a possible range of 2-4 that I established a priori. > >Here is my problem, I want to next run a K-means clustering procedure. >More specifically, I want to use the centroids of the three clusters from >my heirarchical procedure as "seed" or starting values for the K-means >clustering procedure. Unfortunately, SPSS does not generate this output >from the heirarchical procedure. And I do not know 1) how to generate >cluster centroids from the cluster assignment information provided by SPSS >heirarchical procedure, and 2) even if I did, I do not know how >to generate an SPSS.sav file with that information for use by the K-means >approach. A further problem, I am a point and clicker and not savvy with >command syntax; I AM WILLING TO LEARN IF IT CAN GET ME OUT OF MY MESS!! > >Any persons that are SPSS - Cluster Analysis savvy, or know others that >might lend a hand would be met with gratitude for any assistance. > >Take care, > >Aaron Eakman > > > > |
Hector Maletta's approach sounds very useful.
I'll keep it in mind the next time I'm using dfa to interpret or refine a clustering. -- some elabaoration -- In the 70's I started calling sets of cases "core clusters" when several different agglomeration methods and/or distance measures placed those cases together. I then used DFA to refine assignments/interpretation. A case was considered unclassified for the first phase of a dfa if it was far from the centroid or if it was a "splitter" across the probabilities. What constitutes a splitter is subjective and you might want to try different approaches. Obviously, (.98, .01, .01) is a very definite assignment, while (.33, .34, .33) is a very ambiguous assignment. You might want to try different criteria such as at least .55 with next best being no more than .4 or best at least .1 better than second best. The dfa was run iteratively until table of "original" and "assigned" groups was as stable as it could be. Another reason to use dfa is that, although the "tests" in the first phase should not be interpreted in the conventional way they can be useful in interpreting what distinguishes the cluster profiles. Another way to word Hector's point about sampling, is that the clusters in exploratory terminology, can be very useful as strata in sampling terminology. Art Social Research Consultants Hector Maletta wrote: >I agree with Art Kendall opinion that "In DFA, I recommend closely examining >the probabilities of assignment to each cluster for each case, and the >probability that a member of a cluster would be as far away from the >centroid as this particular case is. This is a very old but very useful aid >in interpreting a clustering. The classification phase of DFA should provide >insight into the reliability of the cluster assignments." >Besides using or not using DFA for this purpose, cases far away from the >centroid are often of doubtful usefulness. In one exercise I did with a >large sample some time ago, I applied clustering to create a certain number >of clusters, but there were a lot of cases of borderline membership. We >figured a small amount of measurement error would land those cases in >another cluster altogether. >For certain research purpose it proved useful to divide each cluster into a >"core" and a "periphery", the core being a relatively small area around the >centroid. This is only useful when many cases are near the centroid, and few >are in the no-man's land or borderline area between clusters, far away from >the centroid. >I do not remember all the details, but I do remember I tried several ways of >defining the core, including the following: (1) all cases situated within >the minimum distance from the centroid that encompassed, say, 25% of all >cases in the cluster; (2) all cases, whichever their number or proportion as >long as they were at least 30, located within an Euclidean distance of, say, >one cluster-specific standard deviation from the centroid. >The "core" of the cluster is usually quite homogeneous, and proved a very >useful tool to define the "typical" features of the cluster, and to select >typical cases for frequent follow-up, at least for means if not for >variability around the mean. >In fact, what we did was creating a "model" (a "model farm-household" in >that experience) defined by the centroid values of all variables, >periodically re-evaluating those values by following-up a small rotational >sample of cases randomly selected from the core. Since the centroid was >supposed to be defined by the mean of those variables for the entire cluster >(core+periphery), we boldly multiplied the updated centroid means times the >clusters' total membership to obtain updated population means and totals in >an economical way (this was done in order to monitor rural development at >farm/household level in a poor developing country, where large sample >surveys cannot be carried out with the necessary frequency, and casual >visits by extension workers are not enough). >Hope this helps. > >Hector Maletta > >-----Mensaje original----- >De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Art >Kendall >Enviado el: Monday, August 14, 2006 2:13 PM >Para: [hidden email] >Asunto: Re: Cluster Analysis - Seeds needed for K-Means > >It is some time since I used version 12, but the hierarchical >clustering part has been around for since the 70's. >If you used the SAVE specification, you should have a new variable that >indicates for each case to which cluster it is assigned. say you called >it Kluster3 and the variables to base the clustering on Var01 to Var12. > > >to get the centroids >(I'm not sure how you would have interpreted the cluster meanings >without using DISCRIMINANT or means already.) >discriminant groups= kluster3 (1,3)/ variables = var01 to var12 . . .. > >or >means tables= var01 to var12 by kluster3 /cells= count means . . . . > >once you type the above command into a syntax window, highlight (select) >the procedure name with you mouse and click the syntax button to see >other possibilities for the procedure. > >In DFA, I recommend closely examining the probabilities of assignment to >each cluster for each case, and the probability that a member of a >cluster would be as far away from the centroid as this particular case >is. This is a very old but very useful aid in interpreting a clustering. >The classification phase of DFA should provide insight into the >reliability of the cluster assignments. > >The GUI in SPSS is very useful for the first draft of your syntax. >Simply exit the menus via the "paste" button. This shows you the syntax >that will do what you specified in the menu. As you look at your >results, and as you develop your approach you can simply edit the pasted >syntax. > >To get your means into a .sav file. There are more automated ways to >get the centroids into kmeans, but this is straightforward. >open a new data file >label the variables kluster3 and var01 ... var12. >key in the centroids. >save the file. > > > >You might also want to consider applying the TWOSTEP procedure. >It will produce AIC and BIC to check on the number of clusters to retain. > >Art Kendall >Social Research Consultants > > >Aaron Eakman wrote: > > > >>I am using SPSS 12 for my clustering procedures. I started with >>heirarchical clustering using Wards method with squared euclidean >>distance. I have identified a three cluster solution as the best option >> >> >>from a possible range of 2-4 that I established a priori. > > >>Here is my problem, I want to next run a K-means clustering procedure. >>More specifically, I want to use the centroids of the three clusters from >>my heirarchical procedure as "seed" or starting values for the K-means >>clustering procedure. Unfortunately, SPSS does not generate this output >> >> >>from the heirarchical procedure. And I do not know 1) how to generate > > >>cluster centroids from the cluster assignment information provided by SPSS >>heirarchical procedure, and 2) even if I did, I do not know how >>to generate an SPSS.sav file with that information for use by the K-means >>approach. A further problem, I am a point and clicker and not savvy with >>command syntax; I AM WILLING TO LEARN IF IT CAN GET ME OUT OF MY MESS!! >> >>Any persons that are SPSS - Cluster Analysis savvy, or know others that >>might lend a hand would be met with gratitude for any assistance. >> >>Take care, >> >>Aaron Eakman >> >> >> >> >> >> > > > > > >
Art Kendall
Social Research Consultants |
Art Kendall wrote:
Another way to word Hector's point about sampling, is that the clusters in exploratory terminology, can be very useful as strata in sampling terminology. That's all right, provided you have prior data on all the clustering variables for all the population, e.g. if they are all census variables. Otherwise, you may have to conduct a census before taking your stratified sample based on clusters constructed on those variables. More often what you get is a large sample (say the baseline survey of an area in a large development project), which may or may not be based on a previous census; on this large baseline sample survey clusters of cases (e.g. farmers, beneficiary or not) are formed, and then small homogeneous samples of "core" farmers are extracted from each sample for frequent follow up (including beneficiaries and controls) until the next big sample is taken some years after to asses the overall impact of the development project. This is a frequent setup in developing countries for internationally financed projects, where a limited amount of money is available for monitoring purposes, and moreover, the local people is supposed to continue the continuous or frequent monitoring when the international money runs out, and therefore that frequent monitoring methodology should be kept cheap. Of course Art would be right if the big sample is considered as a population. But it is normally but a sample, on which strata are defined, e.g. by clustering. Hector |
Free forum by Nabble | Edit this page |