SPSSX Discussion

Distance from cluster centre query.

Classic

List

Threaded

7 messages Options

Mark Webb-3

Distance from cluster centre query.

In K Means it's possible to save this information as a variable.
Is this possible in any of the hierarchical methods offered in SPSS ?
They offer a proximity matrix - which I see as different - as this shows distances between individual respondents NOT the classification mean.
Am I missing something ?

Regards

paulandpen

Re: Distance from cluster centre query.

Mark

I do not think this is possible using K-means due to the algorithm used, but I may be wrong. One way round it might be to work out which variables are contributing to the cluster solution. Then formulate an algorithm based on Chaid, Cart or Discrim, Logistic and assign each case a score using the algorithm (coding rules). You could then compute scores from the algorithm and for all cases that are assigned to that cluster with a high (say 80%) level of probability generate means and standard deviations and treat the mean scores as cluster centres and the standard deviations as your index of dispersion (i.e. distance from cluster centre).

Cheers Paul

> Mark Webb <[hidden email]> wrote:
>
> In K Means it's possible to save this information as a variable.
> Is this possible in any of the hierarchical methods offered in SPSS ?
> They offer a proximity matrix - which I see as different - as this shows
> distances between individual respondents NOT the classification mean.
> Am I missing something ?
>
> Regards

paulandpen

Re: Distance from cluster centre query.

In reply to this post by Mark Webb-3

Mark

Apologies. My post should have said I do not think this is possible using hierarchical clustering due to the algorithm used, but I may be wrong.

regards Paul

> Mark Webb <[hidden email]> wrote:
>
> In K Means it's possible to save this information as a variable.
> Is this possible in any of the hierarchical methods offered in SPSS ?
> They offer a proximity matrix - which I see as different - as this shows
> distances between individual respondents NOT the classification mean.
> Am I missing something ?
>
> Regards

Spousta Jan

Re: Distance from cluster centre query.

In reply to this post by Mark Webb-3

Hi Mark,

While K-Means operates in a metric Euclidean space or something similar,
and therefore can easily define the centroids (and uses them during the
computing), the Hierarchical algorithm can be used in a more general
topological spaces where there are no well defined centroids. Imagine
clustering species; take a cluster {baboon, human, chimpanzee} - what is
the centroid here? Michael Jackson? Really hard to say. And that is
perhaps the reason why SPSS does not prompt you to save the
centroid-derived statistics.

Otherwise, if you think that they really do give a sense, you can
compute the centroid coordinates easily using Aggregate and add them to
the file. And then you can compute the distance case - centroid using
the familiar formula for the Euclidean distance.

Unfortunately, my SPSS 14 is broken now, so I will draft the example
syntax in SPSS 12 which is more cumbersome because of the lack of
ADDVARIABLES mode in Aggregate.

GET FILE='C:\Program Files\SPSS\Cars.sav'.
SELE IF nmiss(mpg to cylinder)=0 and uniform(1) < 0.2.
DESCRIPTIVES mpg to accel /SAVE.
CLUSTER Zmpg to Zaccel /SAVE CLUSTER(5).

*Save the coordinates of the centroids.
AGGREGATE /OUTF='C:\Program Files\SPSS/aggr.sav' /BREAK=CLU5_1
/Cmpg Cengine Chorse Cweight Caccel = MEAN(Zmpg Zengine Zhorse Zweight
Zaccel).

*Add them to the file.
SORT CASES BY CLU5_1 (A) .
MATCH FILES /FILE=* /TABLE='C:\Program Files\SPSS\aggr.sav' /BY CLU5_1.
exe.

*Compute the Euclidean distance case-centroid.
comp distance = 0.
do repe centr = Cmpg to Caccel /case = Zmpg to Zaccel.
- comp distance = distance + (centr-case)**2.
end repe.
comp distance = sqrt(distance).
var lab distance "Distance case-centroid".
exe.

*End of the example.

Greetings

Jan

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Mark Webb
Sent: Monday, July 31, 2006 7:43 AM
To: [hidden email]
Subject: Distance from cluster centre query.

In K Means it's possible to save this information as a variable.
Is this possible in any of the hierarchical methods offered in SPSS ?
They offer a proximity matrix - which I see as different - as this shows
distances between individual respondents NOT the classification mean.
Am I missing something ?

Regards

Mark Webb-3

Re: Distance from cluster centre query.

Thanks for this Jan.
I may well use your suggestion & compute the centroids BUT would like to
discuss the idea of a cluster centroid in the context of what I'm trying to
do.
I'm finding that discriminant analysis [DA] based on clusters[dep var] & the
statements used to make the clusters [indep vars] are not working well in
practice.
I would like to remove "weakly"associated respondents from each clusters and
put them into an additional cluster representing "unclassifiable".
I was hoping to define these weak respondents by using the distance from
centriod idea but I use Hierarchical methods [Wards] most often - hence my
initial querry.
Do you think what I'm suggesting is feasible ?
I would then run DA on the original clusters plus 1.

Regards

Mark

----- Original Message -----
From: "Spousta Jan" <[hidden email]>
To: "Mark Webb" <[hidden email]>; <[hidden email]>
Sent: Monday, July 31, 2006 12:55 PM
Subject: RE: Distance from cluster centre query.

Hi Mark,

While K-Means operates in a metric Euclidean space or something similar,
and therefore can easily define the centroids (and uses them during the
computing), the Hierarchical algorithm can be used in a more general
topological spaces where there are no well defined centroids. Imagine
clustering species; take a cluster {baboon, human, chimpanzee} - what is
the centroid here? Michael Jackson? Really hard to say. And that is
perhaps the reason why SPSS does not prompt you to save the
centroid-derived statistics.

Otherwise, if you think that they really do give a sense, you can
compute the centroid coordinates easily using Aggregate and add them to
the file. And then you can compute the distance case - centroid using
the familiar formula for the Euclidean distance.

Unfortunately, my SPSS 14 is broken now, so I will draft the example
syntax in SPSS 12 which is more cumbersome because of the lack of
ADDVARIABLES mode in Aggregate.

GET FILE='C:\Program Files\SPSS\Cars.sav'.
SELE IF nmiss(mpg to cylinder)=0 and uniform(1) < 0.2.
DESCRIPTIVES mpg to accel /SAVE.
CLUSTER Zmpg to Zaccel /SAVE CLUSTER(5).

*Save the coordinates of the centroids.
AGGREGATE /OUTF='C:\Program Files\SPSS/aggr.sav' /BREAK=CLU5_1
/Cmpg Cengine Chorse Cweight Caccel = MEAN(Zmpg Zengine Zhorse Zweight
Zaccel).

*Add them to the file.
SORT CASES BY CLU5_1 (A) .
MATCH FILES /FILE=* /TABLE='C:\Program Files\SPSS\aggr.sav' /BY CLU5_1.
exe.

*Compute the Euclidean distance case-centroid.
comp distance = 0.
do repe centr = Cmpg to Caccel /case = Zmpg to Zaccel.
- comp distance = distance + (centr-case)**2.
end repe.
comp distance = sqrt(distance).
var lab distance "Distance case-centroid".
exe.

*End of the example.

Greetings

Jan

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Mark Webb
Sent: Monday, July 31, 2006 7:43 AM
To: [hidden email]
Subject: Distance from cluster centre query.

In K Means it's possible to save this information as a variable.
Is this possible in any of the hierarchical methods offered in SPSS ?
They offer a proximity matrix - which I see as different - as this shows
distances between individual respondents NOT the classification mean.
Am I missing something ?

Regards

__________ NOD32 1.1684 (20060729) Information __________

This message was checked by NOD32 antivirus system.
http://www.eset.com

Spousta Jan

Re: Distance from cluster centre query.

In reply to this post by Mark Webb-3

Hi Mark,

A slightly better idea would be to drop the unclassifiable cluster from
the analyzis. These unclassifiable cases are hardly separable and will
destroy your DA. Also clusters with small number of cases can create
similar problems. I suspect that your problems with DA can be caused by
such splittered solution of CA.

Try to find a good, stable solution of CA first, eliminate the outliers
(small clusters + you can use standard diagnostics to find the unusual
cases), and DA will probably work better.

Jan

-----Original Message-----
From: Mark Webb [mailto:[hidden email]]
Sent: Monday, July 31, 2006 1:27 PM
To: Spousta Jan
Cc: [hidden email]
Subject: Re: Distance from cluster centre query.

Thanks for this Jan.
I may well use your suggestion & compute the centroids BUT would like to
discuss the idea of a cluster centroid in the context of what I'm trying
to do.
I'm finding that discriminant analysis [DA] based on clusters[dep var] &
the statements used to make the clusters [indep vars] are not working
well in practice.
I would like to remove "weakly"associated respondents from each clusters
and put them into an additional cluster representing "unclassifiable".
I was hoping to define these weak respondents by using the distance from
centriod idea but I use Hierarchical methods [Wards] most often - hence
my initial querry.
Do you think what I'm suggesting is feasible ?
I would then run DA on the original clusters plus 1.

Regards

Mark

----- Original Message -----
From: "Spousta Jan" <[hidden email]>
To: "Mark Webb" <[hidden email]>; <[hidden email]>
Sent: Monday, July 31, 2006 12:55 PM
Subject: RE: Distance from cluster centre query.

Hi Mark,

While K-Means operates in a metric Euclidean space or something similar,
and therefore can easily define the centroids (and uses them during the
computing), the Hierarchical algorithm can be used in a more general
topological spaces where there are no well defined centroids. Imagine
clustering species; take a cluster {baboon, human, chimpanzee} - what is
the centroid here? Michael Jackson? Really hard to say. And that is
perhaps the reason why SPSS does not prompt you to save the
centroid-derived statistics.

Otherwise, if you think that they really do give a sense, you can
compute the centroid coordinates easily using Aggregate and add them to
the file. And then you can compute the distance case - centroid using
the familiar formula for the Euclidean distance.

Unfortunately, my SPSS 14 is broken now, so I will draft the example
syntax in SPSS 12 which is more cumbersome because of the lack of
ADDVARIABLES mode in Aggregate.

GET FILE='C:\Program Files\SPSS\Cars.sav'.
SELE IF nmiss(mpg to cylinder)=0 and uniform(1) < 0.2.
DESCRIPTIVES mpg to accel /SAVE.
CLUSTER Zmpg to Zaccel /SAVE CLUSTER(5).

*Save the coordinates of the centroids.
AGGREGATE /OUTF='C:\Program Files\SPSS/aggr.sav' /BREAK=CLU5_1
/Cmpg Cengine Chorse Cweight Caccel = MEAN(Zmpg Zengine Zhorse Zweight
Zaccel).

*Add them to the file.
SORT CASES BY CLU5_1 (A) .
MATCH FILES /FILE=* /TABLE='C:\Program Files\SPSS\aggr.sav' /BY CLU5_1.
exe.

*Compute the Euclidean distance case-centroid.
comp distance = 0.
do repe centr = Cmpg to Caccel /case = Zmpg to Zaccel.
- comp distance = distance + (centr-case)**2.
end repe.
comp distance = sqrt(distance).
var lab distance "Distance case-centroid".
exe.

*End of the example.

Greetings

Jan

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Mark Webb
Sent: Monday, July 31, 2006 7:43 AM
To: [hidden email]
Subject: Distance from cluster centre query.

In K Means it's possible to save this information as a variable.
Is this possible in any of the hierarchical methods offered in SPSS ?
They offer a proximity matrix - which I see as different - as this shows
distances between individual respondents NOT the classification mean.
Am I missing something ?

Regards

__________ NOD32 1.1684 (20060729) Information __________

This message was checked by NOD32 antivirus system.
http://www.eset.com

Steve Peck

Re: Distance from cluster centre query.

In reply to this post by Mark Webb-3

a methodological framework designed to handle these (and many other
related) issues can be found here:
http://www.psychology.su.se/sleipner/
(e.g., you can remove multivariate outliers prior to clustering and
work directly with the centroids after clustering)

Mark Webb wrote:

> Thanks for this Jan.
> I may well use your suggestion & compute the centroids BUT would like to
> discuss the idea of a cluster centroid in the context of what I'm
> trying to
> do.
> I'm finding that discriminant analysis [DA] based on clusters[dep var]
> & the
> statements used to make the clusters [indep vars] are not working well in
> practice.
> I would like to remove "weakly"associated respondents from each
> clusters and
> put them into an additional cluster representing "unclassifiable".
> I was hoping to define these weak respondents by using the distance from
> centriod idea but I use Hierarchical methods [Wards] most often -
> hence my
> initial querry.
> Do you think what I'm suggesting is feasible ?
> I would then run DA on the original clusters plus 1.
>
> Regards
>
> Mark
>
>
> ----- Original Message -----
> From: "Spousta Jan" <[hidden email]>
> To: "Mark Webb" <[hidden email]>; <[hidden email]>
> Sent: Monday, July 31, 2006 12:55 PM
> Subject: RE: Distance from cluster centre query.
>
>
> Hi Mark,
>
> While K-Means operates in a metric Euclidean space or something similar,
> and therefore can easily define the centroids (and uses them during the
> computing), the Hierarchical algorithm can be used in a more general
> topological spaces where there are no well defined centroids. Imagine
> clustering species; take a cluster {baboon, human, chimpanzee} - what is
> the centroid here? Michael Jackson? Really hard to say. And that is
> perhaps the reason why SPSS does not prompt you to save the
> centroid-derived statistics.
>
> Otherwise, if you think that they really do give a sense, you can
> compute the centroid coordinates easily using Aggregate and add them to
> the file. And then you can compute the distance case - centroid using
> the familiar formula for the Euclidean distance.
>
> Unfortunately, my SPSS 14 is broken now, so I will draft the example
> syntax in SPSS 12 which is more cumbersome because of the lack of
> ADDVARIABLES mode in Aggregate.
>
> GET FILE='C:\Program Files\SPSS\Cars.sav'.
> SELE IF nmiss(mpg to cylinder)=0 and uniform(1) < 0.2.
> DESCRIPTIVES mpg to accel /SAVE.
> CLUSTER Zmpg to Zaccel /SAVE CLUSTER(5).
>
> *Save the coordinates of the centroids.
> AGGREGATE /OUTF='C:\Program Files\SPSS/aggr.sav' /BREAK=CLU5_1
> /Cmpg Cengine Chorse Cweight Caccel = MEAN(Zmpg Zengine Zhorse Zweight
> Zaccel).
>
> *Add them to the file.
> SORT CASES BY CLU5_1 (A) .
> MATCH FILES /FILE=* /TABLE='C:\Program Files\SPSS\aggr.sav' /BY CLU5_1.
> exe.
>
> *Compute the Euclidean distance case-centroid.
> comp distance = 0.
> do repe centr = Cmpg to Caccel /case = Zmpg to Zaccel.
> - comp distance = distance + (centr-case)**2.
> end repe.
> comp distance = sqrt(distance).
> var lab distance "Distance case-centroid".
> exe.
>
> *End of the example.
>
> Greetings
>
> Jan
>
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
> Mark Webb
> Sent: Monday, July 31, 2006 7:43 AM
> To: [hidden email]
> Subject: Distance from cluster centre query.
>
> In K Means it's possible to save this information as a variable.
> Is this possible in any of the hierarchical methods offered in SPSS ?
> They offer a proximity matrix - which I see as different - as this shows
> distances between individual respondents NOT the classification mean.
> Am I missing something ?
>
> Regards
>
> __________ NOD32 1.1684 (20060729) Information __________
>
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
>
>
>