k-means clustering

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

k-means clustering

Matthew Pirritano
Am I just doing something wrong or do the k-means cluster method results
differ depending on how the data are sorted? I keep running the same k-means
analysis and getting different cluster centers each time.

Also, can someone tell me what the cluster centers are exactly. If I use raw
scores are they just the mean on that variable within the particular
cluster. Are they just means?

Thanks,
Matt

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: k-means clustering

Matthew Pirritano
Paul,

Sounds good. I used Ward's to identify the number of clusters and to create the initial cluster centers for the K-means procedure. From what I'm hearing from you what I've done is kosher, yes?

Thanks from a novice clusterer.
Matt

Matthew Pirritano, Ph.D.
Assistant Professor of Psychology
Smith Hall 116C
Chapman University
Department of Psychology
One University Drive
Orange, CA 92866
Telephone (714)744-7940
FAX (714)997-6780

----- Original Message ----
From: "Swank, Paul R" <[hidden email]>
To: Matt <[hidden email]>
Sent: Monday, November 19, 2007 8:31:10 AM
Subject: RE: k-means clustering


Typically, K means clustering starts by randomly selecting cases as
seeds for the clusters so if you resort, then the seeds are different.
What this may indicate is a problem with clustering. If the clusters
 are
inherent in tha data, then it shouldn't matter where the seeds start.
However, I usually start with a hierarchical method to identify the #
 of
clusters and use the cluster means as the seeds for the K means method.


Paul R. Swank, Ph.D.
Professor and Director of Research
Children's Learning Institute
University of Texas Health Science Center - Houston



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf
 Of
Matt
Sent: Sunday, November 18, 2007 8:28 PM
To: [hidden email]
Subject: k-means clustering

Am I just doing something wrong or do the k-means cluster method
 results
differ depending on how the data are sorted? I keep running the same
k-means
analysis and getting different cluster centers each time.

Also, can someone tell me what the cluster centers are exactly. If I
 use
raw
scores are they just the mean on that variable within the particular
cluster. Are they just means?

Thanks,
Matt

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except
 the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: k-means clustering

Swank, Paul R
As far as I'm concerned, but if that's what you did, why did you get
different clusters from different sorts?

 

 

Paul R. Swank, Ph.D.

Professor and Director of Research

Children's Learning Institute

University of Texas Health Science Center - Houston

 

 

 

From: Matthew Pirritano [mailto:[hidden email]]
Sent: Monday, November 19, 2007 10:37 AM
To: Swank, Paul R; Matt; [hidden email]
Subject: Re: k-means clustering

 

Paul,

Sounds good. I used Ward's to identify the number of clusters and to
create the initial cluster centers for the K-means procedure. From what
I'm hearing from you what I've done is kosher, yes?

Thanks from a novice clusterer.
Matt

 

Matthew Pirritano, Ph.D.
Assistant Professor of Psychology
Smith Hall 116C
Chapman University
Department of Psychology
One University Drive
Orange, CA 92866
Telephone (714)744-7940
FAX (714)997-6780

 

----- Original Message ----
From: "Swank, Paul R" <[hidden email]>
To: Matt <[hidden email]>
Sent: Monday, November 19, 2007 8:31:10 AM
Subject: RE: k-means clustering

Typically, K means clustering starts by randomly selecting cases as
seeds for the clusters so if you resort, then the seeds are different.
What this may indicate is a problem with clustering. If the clusters are
inherent in tha data, then it shouldn't matter where the seeds start.
However, I usually start with a hierarchical method to identify the # of
clusters and use the cluster means as the seeds for the K means method.


Paul R. Swank, Ph.D.
Professor and Director of Research
Children's Learning Institute
University of Texas Health Science Center - Houston



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Matt
Sent: Sunday, November 18, 2007 8:28 PM
To: [hidden email]
Subject: k-means clustering

Am I just doing something wrong or do the k-means cluster method results
differ depending on how the data are sorted? I keep running the same
k-means
analysis and getting different cluster centers each time.

Also, can someone tell me what the cluster centers are exactly. If I use
raw
scores are they just the mean on that variable within the particular
cluster. Are they just means?

Thanks,
Matt

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

 

====================To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: k-means clustering

Pirritano, Matthew
The different clusters for different sorts is what I was getting before
I used the Ward's centers as starting seeds. I was lost and now I'm
found.

Thanks a bunch,
Matt

Matthew Pirritano, Ph.D.
Assistant Professor of Psychology
Smith Hall 116C
Chapman University
Department of Psychology
One University Drive
Orange, CA 92866
Telephone (714)744-7940
FAX (714)997-6780

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Swank, Paul R
Sent: Monday, November 19, 2007 9:45 AM
To: [hidden email]
Subject: Re: k-means clustering

As far as I'm concerned, but if that's what you did, why did you get
different clusters from different sorts?





Paul R. Swank, Ph.D.

Professor and Director of Research

Children's Learning Institute

University of Texas Health Science Center - Houston







From: Matthew Pirritano [mailto:[hidden email]]
Sent: Monday, November 19, 2007 10:37 AM
To: Swank, Paul R; Matt; [hidden email]
Subject: Re: k-means clustering



Paul,

Sounds good. I used Ward's to identify the number of clusters and to
create the initial cluster centers for the K-means procedure. From what
I'm hearing from you what I've done is kosher, yes?

Thanks from a novice clusterer.
Matt



Matthew Pirritano, Ph.D.
Assistant Professor of Psychology
Smith Hall 116C
Chapman University
Department of Psychology
One University Drive
Orange, CA 92866
Telephone (714)744-7940
FAX (714)997-6780



----- Original Message ----
From: "Swank, Paul R" <[hidden email]>
To: Matt <[hidden email]>
Sent: Monday, November 19, 2007 8:31:10 AM
Subject: RE: k-means clustering

Typically, K means clustering starts by randomly selecting cases as
seeds for the clusters so if you resort, then the seeds are different.
What this may indicate is a problem with clustering. If the clusters are
inherent in tha data, then it shouldn't matter where the seeds start.
However, I usually start with a hierarchical method to identify the # of
clusters and use the cluster means as the seeds for the K means method.


Paul R. Swank, Ph.D.
Professor and Director of Research
Children's Learning Institute
University of Texas Health Science Center - Houston



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Matt
Sent: Sunday, November 18, 2007 8:28 PM
To: [hidden email]
Subject: k-means clustering

Am I just doing something wrong or do the k-means cluster method results
differ depending on how the data are sorted? I keep running the same
k-means
analysis and getting different cluster centers each time.

Also, can someone tell me what the cluster centers are exactly. If I use
raw
scores are they just the mean on that variable within the particular
cluster. Are they just means?

Thanks,
Matt

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



=======
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Durbin-Watson test

David Hitchin
In reply to this post by Matthew Pirritano
Recent contributions to this list have stated the conditions for a
Durbin-Watson test to be used, but haven't explained exactly what it
tests for.

The quality of a regression, in terms of standard errors and signficance
tests, depends partly on the number of observations, and the more
observations the better, PROVIDED that they are independent. Simply
making another copy of the data to double the sample size doesn't help,
because there is no new information.

One of the assumptions for multiple regression is that the observations
have a certain kind of independence. Consider an experiment in which
seedlings are grown under a number of different conditions (x variables
indicate fertiliser, watering, temperature, etc, which according to the
curious fiction that we adopt with regression, are assumed to be
measured with complete accuracy). Seedlings grown under the same X
values do not all achieve the same height Y; they vary about the
predicted value of Y, and these deviations are know as the errors or
residuals. They are assumed to have a zero mean, and to be independent
of each other.

If the seedlings are grown by different people in different laboratories
there is a reasonable chance that the errors are independent. If they
are all grown in the same pot, then the errors are likely to be
correlated - they are NOT independent.

When the observations come from a time sequence a new problem arises.
You might have a nice equation that predicts the output of a factory in
terms of season, number of employees, cost of materials, etc, which make
the predictable part of the regression. There are unpredictable factors
which cause the "errors", such as late deliveries of materials, machine
breakdowns, strikes. Over a long period these might average out, but if
observations are taken on a short-term basis, perhaps every week, then
the unpredictable factors hang over for several time periods, and the
errors for observations close in time are not independent.

When observations are independent, knowing the error of one observation
tells you nothing about the likely error of the next one. When there is
lack of independence, then the error in one observation is likely to
affect the next one. This is known as autocorrelation or serial correlation.

The Durbin-Watson test examines the errors of nearby observations to see
if there is a pattern, and if it is significant this indicates that you
have less information than if the observations were independent; the
estimated standard errors are too small, and significance is not as good
as it appears.

David Hitchin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD