K-means clustering

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

K-means clustering

Guerrero, Rodrigo
Hi all,

I was wondering if it is appropriate to include rating scale data (1 to 7) of attitudes with other types of data such as practice size, physician age, and other practice descriptors in a k-means clustering procedure. I am not sure if you can mix data types.

Thanks.

Rodrigo.

The information transmitted is intended only for the addressee(s) and may contain confidential or privileged material, or both.  Any review, receipt, dissemination or other use of this information by non-addressees is prohibited.   If you received this in error or are a non-addressee, please contact the sender and delete the transmitted information.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: K-means clustering

Art Kendall
There is no clear cut answer.  A lot depends on what you want to ask of
the data. If you have a mix of categorical and scale data TWOSTEP is
designed for all scale variables, all categorical variables, or a mix of
scale and categorical data. TwoStep also provides some help in deciding
on the number of clusters to retain. I would not use a single method of
clustering, but would base my retained clusters on consensus among very
different clustering methods and proximities measures.
Also, k-means is very sensitive to the order of cases. You would want to
sort the cases into a few random orders to see how good the consensus is
among k-means runs.

K-means is for scale data (not very discrepant from interval level) so
if you are worried about level of measurement using attitude scale
scores would be ok on that basis.

Substantively, without knowing the details of your situation, it seems
unusual to have attitudes and practice characteristics is the same
clustering.  Without knowing more about your application, my knee-jerk
reaction would be to see if there were clusters of practices, and then
see if those clusters differed on attitudes.  An additional exploration
would be to cluster practices and to cluster attitude scale scores.

Art Kendall
Social Research Consultants


Guerrero, Rodrigo wrote:

> Hi all,
>
> I was wondering if it is appropriate to include rating scale data (1 to 7) of attitudes with other types of data such as practice size, physician age, and other practice descriptors in a k-means clustering procedure. I am not sure if you can mix data types.
>
> Thanks.
>
> Rodrigo.
>
> The information transmitted is intended only for the addressee(s) and may contain confidential or privileged material, or both.  Any review, receipt, dissemination or other use of this information by non-addressees is prohibited.   If you received this in error or are a non-addressee, please contact the sender and delete the transmitted information.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: K-means clustering

Steve Simon, P.Mean Consulting
In reply to this post by Guerrero, Rodrigo
Guerrero, Rodrigo wrote:

> I was wondering if it is appropriate to include rating scale data (1
> to 7) of attitudes with other types of data such as practice size,
> physician age, and other practice descriptors in a k-means clustering
> procedure. I am not sure if you can mix data types.

This is a commonly asked question in many statistical procedures. There
is no consensus in the research community about this, so you have to be
prepared for a peer-reviewer to complain, no matter which approach you take.

PASW Statistics/SPSS will let you include an ordinal scale variable in
k-means clustering, and there are several reasons why you would want to
do this.

First, k-means is a descriptive procedure rather than an inferential
procedure. You can't screw up something like the Type I error rate,
because there is no null hypothesis that you can mistakenly reject.

Second, k-means does not have a distributional requirement for the input
variables. You can't blithely ignore things like extreme outliers, but
the skewed pattern for much ordinal data caused by data piling up at one
of the extremes is no more an issue than skewed data from a ratio scale
measurement.

Third, the quality of the clusters produced is likely to be better when
you include more information. You could examine this by clustering with
and without your ordinal variable. Keep in mind that most measures of
the value of the information produced by a cluster analysis are rather
subjective in nature.

There are "purists" who will point out that ordinal data can never
satisfy certain assumptions that would, for example, make the mean a
meaningful measure. If a mean is meaningless, then k-means clustering is
also meaningless. I am an "impurist" (pragmatist would be a more
flattering term). I find that the mean for ordinal variables usually
behaves reasonably well and provides almost as good a summary as
measures like the median, which are well defined even for ordinal data.

The question that distinguishes "purists" from "pragmatists" is whether
you believe in grade point averages. I like them, but they do make the
rather questionable assumption that a student with an A and an F is
comparable to a student with a B and a D, and to a student with two Cs.

My new website has discussion of a similar question in the context of
ANOVA at
  * http://www.pmean.com/08/LikertSum.html

I hope this helps.
--
Steve Simon, Standard Disclaimer
"The first three steps in a descriptive
data analysis, with examples in PASW/SPSS"
Thursday, January 21, 2010, 11am-noon, CST.
Details at www.pmean.com/webinars

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: K-means clustering

Jon K Peck

You might also consider TWOSTEP CLUSTER, which treats categorical and scale variables differently, although it makes a different set of assumptions about the variables.

Regards,
Jon Peck
SPSS, an IBM Company
[hidden email]
312-651-3435



From: "Steve Simon, P.Mean Consulting" <[hidden email]>
To: [hidden email]
Date: 01/06/2010 10:14 AM
Subject: Re: [SPSSX-L] K-means clustering
Sent by: "SPSSX(r) Discussion" <[hidden email]>





Guerrero, Rodrigo wrote:

> I was wondering if it is appropriate to include rating scale data (1
> to 7) of attitudes with other types of data such as practice size,
> physician age, and other practice descriptors in a k-means clustering
> procedure. I am not sure if you can mix data types.

This is a commonly asked question in many statistical procedures. There
is no consensus in the research community about this, so you have to be
prepared for a peer-reviewer to complain, no matter which approach you take.

PASW Statistics/SPSS will let you include an ordinal scale variable in
k-means clustering, and there are several reasons why you would want to
do this.

First, k-means is a descriptive procedure rather than an inferential
procedure. You can't screw up something like the Type I error rate,
because there is no null hypothesis that you can mistakenly reject.

Second, k-means does not have a distributional requirement for the input
variables. You can't blithely ignore things like extreme outliers, but
the skewed pattern for much ordinal data caused by data piling up at one
of the extremes is no more an issue than skewed data from a ratio scale
measurement.

Third, the quality of the clusters produced is likely to be better when
you include more information. You could examine this by clustering with
and without your ordinal variable. Keep in mind that most measures of
the value of the information produced by a cluster analysis are rather
subjective in nature.

There are "purists" who will point out that ordinal data can never
satisfy certain assumptions that would, for example, make the mean a
meaningful measure. If a mean is meaningless, then k-means clustering is
also meaningless. I am an "impurist" (pragmatist would be a more
flattering term). I find that the mean for ordinal variables usually
behaves reasonably well and provides almost as good a summary as
measures like the median, which are well defined even for ordinal data.

The question that distinguishes "purists" from "pragmatists" is whether
you believe in grade point averages. I like them, but they do make the
rather questionable assumption that a student with an A and an F is
comparable to a student with a B and a D, and to a student with two Cs.

My new website has discussion of a similar question in the context of
ANOVA at
 *
http://www.pmean.com/08/LikertSum.html

I hope this helps.
--
Steve Simon, Standard Disclaimer
"The first three steps in a descriptive
data analysis, with examples in PASW/SPSS"
Thursday, January 21, 2010, 11am-noon, CST.
Details at
www.pmean.com/webinars

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: K-means clustering

Wilhelm Landerholm | Queue
In reply to this post by Steve Simon, P.Mean Consulting
There are "purists" who will point out that ordinal data can never
satisfy certain assumptions that would, for example, make the mean a
meaningful measure. If a mean is meaningless, then k-means clustering is
also meaningless. I am an "impurist" (pragmatist would be a more
flattering term). I find that the mean for ordinal variables usually
behaves reasonably well and provides almost as good a summary as
measures like the median, which are well defined even for ordinal data.

I do not care if people call me "purist" or "practical".
I believe that it is more practical to do right instead of wrong; and to calculate mean (and sd) on ordinal data is nothing else but wrong.

All the best

Wilhelm (Wille) Landerholm

Queue/STATB
BOX 92
162 12 Vallingby
Sweden

+46-735-460000
http://www.qsweden.com
http://www.statb.com

QUEUE/STATB - your partner in data analysis, data modeling and data mining.


2010/1/6 Steve Simon, P.Mean Consulting <[hidden email]>
Guerrero, Rodrigo wrote:

I was wondering if it is appropriate to include rating scale data (1
to 7) of attitudes with other types of data such as practice size,
physician age, and other practice descriptors in a k-means clustering
procedure. I am not sure if you can mix data types.

This is a commonly asked question in many statistical procedures. There
is no consensus in the research community about this, so you have to be
prepared for a peer-reviewer to complain, no matter which approach you take.

PASW Statistics/SPSS will let you include an ordinal scale variable in
k-means clustering, and there are several reasons why you would want to
do this.

First, k-means is a descriptive procedure rather than an inferential
procedure. You can't screw up something like the Type I error rate,
because there is no null hypothesis that you can mistakenly reject.

Second, k-means does not have a distributional requirement for the input
variables. You can't blithely ignore things like extreme outliers, but
the skewed pattern for much ordinal data caused by data piling up at one
of the extremes is no more an issue than skewed data from a ratio scale
measurement.

Third, the quality of the clusters produced is likely to be better when
you include more information. You could examine this by clustering with
and without your ordinal variable. Keep in mind that most measures of
the value of the information produced by a cluster analysis are rather
subjective in nature.

There are "purists" who will point out that ordinal data can never
satisfy certain assumptions that would, for example, make the mean a
meaningful measure. If a mean is meaningless, then k-means clustering is
also meaningless. I am an "impurist" (pragmatist would be a more
flattering term). I find that the mean for ordinal variables usually
behaves reasonably well and provides almost as good a summary as
measures like the median, which are well defined even for ordinal data.

The question that distinguishes "purists" from "pragmatists" is whether
you believe in grade point averages. I like them, but they do make the
rather questionable assumption that a student with an A and an F is
comparable to a student with a B and a D, and to a student with two Cs.

My new website has discussion of a similar question in the context of
ANOVA at
 * http://www.pmean.com/08/LikertSum.html

I hope this helps.
--
Steve Simon, Standard Disclaimer
"The first three steps in a descriptive
data analysis, with examples in PASW/SPSS"
Thursday, January 21, 2010, 11am-noon, CST.
Details at www.pmean.com/webinars


=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: K-means clustering

Guerrero, Rodrigo

List,

 

I would like to thank everyone for their input on my k-means clustering data questions. 

 

RG

 

Rodrigo A. Guerrero | Director Of Marketing Research and Analysis | The Scooter Store | 830.627.4317

 

 

 

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Wilhelm Landerholm | Queue
Sent: Thursday, January 07, 2010 2:41 AM
To: [hidden email]
Subject: Re: K-means clustering

 

There are "purists" who will point out that ordinal data can never
satisfy certain assumptions that would, for example, make the mean a
meaningful measure. If a mean is meaningless, then k-means clustering is
also meaningless. I am an "impurist" (pragmatist would be a more
flattering term). I find that the mean for ordinal variables usually
behaves reasonably well and provides almost as good a summary as
measures like the median, which are well defined even for ordinal data.


I do not care if people call me "purist" or "practical".
I believe that it is more practical to do right instead of wrong; and to calculate mean (and sd) on ordinal data is nothing else but wrong.

All the best

Wilhelm (Wille) Landerholm

Queue/STATB
BOX 92
162 12 Vallingby
Sweden

+46-735-460000
http://www.qsweden.com
http://www.statb.com

QUEUE/STATB - your partner in data analysis, data modeling and data mining.

2010/1/6 Steve Simon, P.Mean Consulting <[hidden email]>

Guerrero, Rodrigo wrote:

I was wondering if it is appropriate to include rating scale data (1
to 7) of attitudes with other types of data such as practice size,
physician age, and other practice descriptors in a k-means clustering
procedure. I am not sure if you can mix data types.

 

This is a commonly asked question in many statistical procedures. There
is no consensus in the research community about this, so you have to be
prepared for a peer-reviewer to complain, no matter which approach you take.

PASW Statistics/SPSS will let you include an ordinal scale variable in
k-means clustering, and there are several reasons why you would want to
do this.

First, k-means is a descriptive procedure rather than an inferential
procedure. You can't screw up something like the Type I error rate,
because there is no null hypothesis that you can mistakenly reject.

Second, k-means does not have a distributional requirement for the input
variables. You can't blithely ignore things like extreme outliers, but
the skewed pattern for much ordinal data caused by data piling up at one
of the extremes is no more an issue than skewed data from a ratio scale
measurement.

Third, the quality of the clusters produced is likely to be better when
you include more information. You could examine this by clustering with
and without your ordinal variable. Keep in mind that most measures of
the value of the information produced by a cluster analysis are rather
subjective in nature.

There are "purists" who will point out that ordinal data can never
satisfy certain assumptions that would, for example, make the mean a
meaningful measure. If a mean is meaningless, then k-means clustering is
also meaningless. I am an "impurist" (pragmatist would be a more
flattering term). I find that the mean for ordinal variables usually
behaves reasonably well and provides almost as good a summary as
measures like the median, which are well defined even for ordinal data.

The question that distinguishes "purists" from "pragmatists" is whether
you believe in grade point averages. I like them, but they do make the
rather questionable assumption that a student with an A and an F is
comparable to a student with a B and a D, and to a student with two Cs.

My new website has discussion of a similar question in the context of
ANOVA at
 * http://www.pmean.com/08/LikertSum.html

I hope this helps.
--
Steve Simon, Standard Disclaimer
"The first three steps in a descriptive
data analysis, with examples in PASW/SPSS"
Thursday, January 21, 2010, 11am-noon, CST.
Details at www.pmean.com/webinars



=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

 


The information transmitted is intended only for the addressee(s) and may contain confidential or privileged material, or both. Any review, receipt, dissemination or other use of this information by non-addressees is prohibited. If you received this in error or are a non-addressee, please contact the sender and delete the transmitted information.