SPSSX Discussion

Quick cluster ('K means'): how are initial cluster centers chosen?

Classic

List

Threaded

6 messages Options

Ruben Geert van den Berg

Quick cluster ('K means'): how are initial cluster centers chosen?

Dear all,

Could anyone please tell me how these are chosen? I tried to decipher the explanation from Algorithms -> Quick cluster but I just can't figure it out. Is the end result of the initial cluster center selection those K observations that have the largest average Euclidian distance among them? Is a proximity matrix among all observations not needed in order to detect those K observations? Are all distances computed between the first case and all the others, in order to proceed to case #2, case #3 etcetera? Could anyone please shed some more light on this?

TIA and a nice weekend to all!

Ruben van den Berg

See all the ways you can stay connected to friends and family

Swank, Paul R

Re: Quick cluster ('K means'): how are initial cluster centers chosen?

I usually use a hierarchical agglomerative technique to get the initial cluster centroids and then plug those in. Another method is to try random starts and then do several of them to ensure you arrive at similar solutions.

Dr. Paul R. Swank,

Professor and Director of Research

Children's Learning Institute

University of Texas Health Science Center-Houston

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Ruben van den Berg
Sent: Friday, August 21, 2009 8:25 AM
To: [hidden email]
Subject: Quick cluster ('K means'): how are initial cluster centers chosen?

See all the ways you can stay connected to friends and family

Hector Maletta

Re: Quick cluster ('K means'): how are initial cluster centers chosen?

In reply to this post by Ruben Geert van den Berg

Ruben,

The procedure is explained in the Algorithms of SPSS. In this case, to be short, the procedure is as follows:

The first k cases with no missing values are initially selected as cluster centers.
Then the remaining cases are examined, in the order they are in the file.

3. If the distance between xk (a specific case in the file) and its closest cluster mean is greater than the distance between the means of the two closest clusters

then xk replaces either one or another of those means, whichever is closer to it.
In case xk has not replaced any initial cluster mean after the precedent test, a second test is made: take Mn and Mm as the first and second closest cluster means; if xk is further from the second cluster mean (Mm) than the closest cluster mean (Mn) is from any other cluster center, then xk replaces the closest cluster center Mn.

The result of these operations, performed at the first pass, are the initial cluster centers.

If desired, the keyword NOINITIAL would simply take the first k cases as initial cluster centers. This allows you to place such cases at the beginning of the data file. For instance, k cases with clearly different cluster means.

Hector

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Ruben van den Berg
Sent: 21 August 2009 09:25
To: [hidden email]
Subject: Quick cluster ('K means'): how are initial cluster centers chosen?

See all the ways you can stay connected to friends and family

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.409 / Virus Database: 270.13.63/2316 - Release Date: 08/21/09 06:04:00

Art Kendall

Re: Quick cluster ('K means'): how are initial cluster centers chosen?

In reply to this post by Ruben Geert van den Berg

IN GENERAL
What to recommend depends a lot on how many cases you have and the capabilities of your system.
I would not rely on a solution from a single run. I retain solutions based on what is found in common by several clustering methods and distance measures.

It is often a good idea to use some form of factor analysis (PCA, PAF, CATPCA) to obtain variables that are fairly independent of each other.
If you use variables that you feel are already independent, standardizing will remove the influence of differences in scaling.

I would suggest that you look at TWOSTEP which starts with a hierarchical clustering of a subset of the data. Its advantage is that it provide AIC or BIC for different numbers of clusters. This aids in the number of clusters to keep.

If you have the machine resources, try different hierarchical methods and distance measures on all of the data. If not try using as large samples as you machine can handle.

SPECIFIC replies interspersed below

Art Kendall
Social Research Consultants

Ruben van den Berg wrote:

Dear all,

Could anyone please tell me how these are chosen? I tried to decipher the explanation from Algorithms -> Quick cluster but I just can't figure it out. Is the end result of the initial cluster center selection those K observations that have the largest average Euclidian distance among them?

Since TWOSTEP has been available I have not been using QUICK CLUSTER much. If memory serves, the initial clusters are single cases first chosen from the first k cases. At the file is passed cases are assigned to clusters so that the within cluster variance is minimized.

Is a proximity matrix among all observations not needed in order to detect those K observations?

That is correct. This is an advantage only in terms of machine resources.

Are all distances computed between the first case and all the others, in order to proceed to case #2, case #3 etcetera? Could anyone please shed some more light on this?

If memory serves, the distance of each case from each of the k cluster centers is used.

TIA and a nice weekend to all!

Ruben van den Berg

See all the ways you can stay connected to friends and family

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Art Kendall
Social Research Consultants

William Dudley WNDUDLEY

Accessing data through URL

In reply to this post by Swank, Paul R

Is it possible (through syntax)
for SPSS to access data stored at a web location like google docs?

William N. Dudley, PhD
Associate Dean for Research
The School of Health and Human Performance Office of Research
The University of North Carolina at Greensboro
126 HHP Building, PO Box 26170
Greensboro, NC 27402-6170
VOICE 336.2562475
FAX 336.334.3238

Peck, Jon

Re: Accessing data through URL

Yes, if you use the extension command SPSSINC GETURI DATA. This can be downloaded from SPSS Developer Central (www.spss.com/devcentral). It requires at least V17 and the Python programmability plugin, but no Python knowledge is needed in order to use it. It also has a dialog box that will appear on the File menu after installation.

HTH,

Jon Peck

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of William Dudley WNDUDLEY
Sent: Friday, August 21, 2009 8:53 AM
To: [hidden email]
Subject: [SPSSX-L] Accessing data through URL