Weighted Cluster Analysis in SPSS

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Weighted Cluster Analysis in SPSS

Catharine Liddicoat
We are performing a cluster analysis where we want to weight the clustering
variables.

1.  Can this be done directly in SPSS?  If so, how?

2.  Has anyone had experience weighting a cluster analysis by the
standarized regression coefficients from a multiple regression model used
to identify the clustering variables?  What are the advantages or
disadvantages of weighting the clustering variables this way?

Thank you for any information provided.

C. Liddicoat
California Community Colleges
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Weighted Cluster Analysis in SPSS

Hector Maletta
        1. I do not know of any explicit weighting device in the clustering
procedures available in SPSS. However, you can achieve it. SPSS clusters
STANDARDIZED variables, i.e. variables converted into z scores with zero
mean and unit standard deviation; you may multiply the z scores by the
desired weight, and then use the desired weight as unit of measurement (i.e.
in lieu of the standard deviation).
        2. The result of using different units of measurement in the
clustering variables is that some variables will have greater influence in
the assignment of a case to a cluster. Also, changing those units of
measurement will imply that some cases move to different clusters.
        3. If the weights are the BETAS, i.e. the standardized regression
coefficients of the clustering variables (which affect the standardized
version of the variables, i.e. the z scores) in a regression equation
predicting a dependent variable Y, possibly not involved in the clustering
exercise), then the clustering will reflect the relative importance of the
variables as predictors of Y, but may not be useful for other purposes.
Recall that cluster analysis ordinarily does not involve an external
dependent variable, but seeks only to put together cases that are similar in
the clustering variables themselves. It seems doable, but I do not know of
any specific example, and furthermore, I have not given much thought to the
implications of the proposed approach in statistical terms.

        Hector

        -----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Catharine Liddicoat
Enviado el: 01 February 2007 18:40
Para: [hidden email]
Asunto: Weighted Cluster Analysis in SPSS

        We are performing a cluster analysis where we want to weight the
clustering
        variables.

        1.  Can this be done directly in SPSS?  If so, how?

        2.  Has anyone had experience weighting a cluster analysis by the
        standarized regression coefficients from a multiple regression model
used
        to identify the clustering variables?  What are the advantages or
        disadvantages of weighting the clustering variables this way?

        Thank you for any information provided.

        C. Liddicoat
        California Community Colleges
        [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Weighted Cluster Analysis in SPSS

paulandpen
In reply to this post by Catharine Liddicoat
Hi Catharine,

My first concern with weighting in the way you propose would be that importance weights from regression have little or nothing to do with variables and their importance that 'drive' a cluster analysis.  If you use a regression model (OLS, the standard regression in spss), one of your assumptions that the group is homogeneous (similar), and that the importance of drivers is uniform for the entire group.  To weight your cluster analysis (and drive your clusters) using variables weighted by their regression weights on this basis is therefore totally counter-intuitive to me.

The second issue is that the algorithm for cluster analysis (quick cluster) is distance based, and is based on euclidean distance.  I think (I use a different package) an important variable in spss is one that either minimises the euclidean distance within groups (generates groups that are similar) or maximises the distance between groups (makes the clusters different) and you can get a proxy (not a great one) using anova for the relative importance of the variables that drive the segments.

I think that what you might need is latent class regression from a package called 'latent gold'.  This program segments groups (recovers heterogeneity) at the same time as computing importance weights for each of the different segments.

HTH Paul



> Catharine Liddicoat <[hidden email]> wrote:
>
> We are performing a cluster analysis where we want to weight the
> clustering
> variables.
>
> 1.  Can this be done directly in SPSS?  If so, how?
>
> 2.  Has anyone had experience weighting a cluster analysis by the
> standarized regression coefficients from a multiple regression model
> used
> to identify the clustering variables?  What are the advantages or
> disadvantages of weighting the clustering variables this way?
>
> Thank you for any information provided.
>
> C. Liddicoat
> California Community Colleges
> [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Weighted Cluster Analysis in SPSS

Hector Maletta
In reply to this post by Catharine Liddicoat
Two comments to Paul's:
1. Indeed, latent class analysis do group cases in clusters by taking correlation of variables into account, all in one pass. But Catharine is asking how to do it in two steps: giving some differential weight to variables, then apply ordinary cluster analysis.

2. Using regression coefficients as weights makes sense in a way. In fact, not using any weights is equivalent to using unit weights, which is a form of weighting after all. So one cannot avoid weighting the clustering variables anyway. The idea of using a criterion variable to give more weight to variables more closely related to it (or having more weight in a predictive equation) makes sense to me at first glance. The finer points, in which I have yet not thought through, is whether regression assumptions (such as homogeneity of variance etc) have any influence on the results, especially for cases well away from the mean of predictor variables. My intuitive fear is that coefficients for predictors with lower significance will have wider confidence intervals, and small errors in independent variables may entail large errors in the allocation of cases to clusters, which would be magnified if the cases are not close to the mean.

3. I should refer listers to my previous message to this thread.
Hector

----- Mensaje original -----
De: Paul Dickson <[hidden email]>
Fecha: Martes, Febrero 6, 2007 6:06 am
Asunto: Re: Weighted Cluster Analysis in SPSS

> Hi Catharine,
>
> My first concern with weighting in the way you propose would be
> that importance weights from regression have little or nothing to
> do with variables and their importance that 'drive' a cluster
> analysis.  If you use a regression model (OLS, the standard
> regression in spss), one of your assumptions that the group is
> homogeneous (similar), and that the importance of drivers is
> uniform for the entire group.  To weight your cluster analysis
> (and drive your clusters) using variables weighted by their
> regression weights on this basis is therefore totally counter-
> intuitive to me.
>
> The second issue is that the algorithm for cluster analysis (quick
> cluster) is distance based, and is based on euclidean distance.  I
> think (I use a different package) an important variable in spss is
> one that either minimises the euclidean distance within groups
> (generates groups that are similar) or maximises the distance
> between groups (makes the clusters different) and you can get a
> proxy (not a great one) using anova for the relative importance of
> the variables that drive the segments.
>
> I think that what you might need is latent class regression from a
> package called 'latent gold'.  This program segments groups
> (recovers heterogeneity) at the same time as computing importance
> weights for each of the different segments.
>
> HTH Paul
>
>
>
> > Catharine Liddicoat <[hidden email]> wrote:
> >
> > We are performing a cluster analysis where we want to weight the
> > clustering
> > variables.
> >
> > 1.  Can this be done directly in SPSS?  If so, how?
> >
> > 2.  Has anyone had experience weighting a cluster analysis by the
> > standarized regression coefficients from a multiple regression model
> > used
> > to identify the clustering variables?  What are the advantages or
> > disadvantages of weighting the clustering variables this way?
> >
> > Thank you for any information provided.
> >
> > C. Liddicoat
> > California Community Colleges
> > [hidden email]
>
Reply | Threaded
Open this post in threaded view
|

Re: Weighted Cluster Analysis in SPSS

paulandpen
In reply to this post by Catharine Liddicoat
Hi Hector and Catharine,

If for a start we take the position that the importance weights are derived 'effectively', and by this I mean that you adopt an appropriate derived importance modelling process, not just a single regression where a set of predictors are entered once and no account is given for collinearity etc .....see (http://www.tfh-berlin.de/~groemp/rpack.html) for what I mean here, although the package is R and I believe some of these approachs could be replicated with the correct programming in spss (see a previous post of mine for this request).

With that said, I am assuming you are using either a set of scale questions, or possibly mixed data types.  To cope with the process of dealing with mixed data types, the common practice at least in k-means is to convert all the variable to z-scores, and I would think at the outset that if you were using standardised beta-weights as your weights, you would be applying them to transformed data based on z-scores (i.e. scales are standardised at both levels).

Why I think this approach is problematic is as follows:

Say you have a set of beta-weights such as the following and multiply the variables by the beta-weights
(.305 *v1) scored on a 1-10 scale
(.205 *v2) scored on a 1-10 scale
(.110 *v3) scored on a binary scale

Say v1 is converted to z-scores.  Z-scores of cases with scores close to the mean multiplied by .305 will not be influential (may be clustered together anyway), while those that are extreme outliers will have an undue influence on the segmentation and ultimate grouping of cases.  You cannot control for the influence of outliers in spss on the final solution in spss (you can in clustan).  Even worse to me is if the binaries are converted to z-scores and then subsequently given weighting.  They will drive the segmentation way too much because you are upweighting their distance between cases (ie only 2 z-scores as opposed to multiple z-scores with a scale question) whereas with scale questions the distances are more spread across a range of scores.............

I think your solution will be one where true groups are drawn apart and heavily skewed in favour of outliers.  With that said, at the end of all this, you are then left with the task of determining the 'actual influence' of the variables on the segmentation after weighting and after the segmentation has been done (very challenging with or without weighting)

You may actually find that some variables you intended to weight higher and drive the segments are not driving the segmentation and others you did not intend to weight in terms of their influence are in fact driving the segmentation.  Even with clustan graphics, where the algorithms allow you to assign weights to variables (a lot more control than you have with spss), after evaluating their influence on the cluster solutions it highlighted to me that my intended weights and the overall influence of the variables were very discordant.

What is used as inputs (weights and attempts to influence a solution) may not result in the same balanced influence in the outcomes as that which was first intended

As a final after-thought, you could always assess the influence of the variables prior to and after weighting............ to test the impact of the weighting on the solutions

Regards Paul




> Hector Maletta <[hidden email]> wrote:
>
> Two comments to Paul's:
> 1. Indeed, latent class analysis do group cases in clusters by taking
> correlation of variables into account, all in one pass. But Catharine is
> asking how to do it in two steps: giving some differential weight to
> variables, then apply ordinary cluster analysis.
>
> 2. Using regression coefficients as weights makes sense in a way. In
> fact, not using any weights is equivalent to using unit weights, which
> is a form of weighting after all. So one cannot avoid weighting the
> clustering variables anyway. The idea of using a criterion variable to
> give more weight to variables more closely related to it (or having more
> weight in a predictive equation) makes sense to me at first glance. The
> finer points, in which I have yet not thought through, is whether
> regression assumptions (such as homogeneity of variance etc) have any
> influence on the results, especially for cases well away from the mean
> of predictor variables. My intuitive fear is that coefficients for
> predictors with lower significance will have wider confidence intervals,
> and small errors in independent variables may entail large errors in the
> allocation of cases to clusters, which would be magnified if the cases
> are not close to the mean.
>
> 3. I should refer listers to my previous message to this thread.
> Hector
>
> ----- Mensaje original -----
> De: Paul Dickson <[hidden email]>
> Fecha: Martes, Febrero 6, 2007 6:06 am
> Asunto: Re: Weighted Cluster Analysis in SPSS
>
> > Hi Catharine,
> >
> > My first concern with weighting in the way you propose would be
> > that importance weights from regression have little or nothing to
> > do with variables and their importance that 'drive' a cluster
> > analysis.  If you use a regression model (OLS, the standard
> > regression in spss), one of your assumptions that the group is
> > homogeneous (similar), and that the importance of drivers is
> > uniform for the entire group.  To weight your cluster analysis
> > (and drive your clusters) using variables weighted by their
> > regression weights on this basis is therefore totally counter-
> > intuitive to me.
> >
> > The second issue is that the algorithm for cluster analysis (quick
> > cluster) is distance based, and is based on euclidean distance.  I
> > think (I use a different package) an important variable in spss is
> > one that either minimises the euclidean distance within groups
> > (generates groups that are similar) or maximises the distance
> > between groups (makes the clusters different) and you can get a
> > proxy (not a great one) using anova for the relative importance of
> > the variables that drive the segments.
> >
> > I think that what you might need is latent class regression from a
> > package called 'latent gold'.  This program segments groups
> > (recovers heterogeneity) at the same time as computing importance
> > weights for each of the different segments.
> >
> > HTH Paul
> >
> >
> >
> > > Catharine Liddicoat <[hidden email]> wrote:
> > >
> > > We are performing a cluster analysis where we want to weight the
> > > clustering
> > > variables.
> > >
> > > 1.  Can this be done directly in SPSS?  If so, how?
> > >
> > > 2.  Has anyone had experience weighting a cluster analysis by the
> > > standarized regression coefficients from a multiple regression model
> > > used
> > > to identify the clustering variables?  What are the advantages or
> > > disadvantages of weighting the clustering variables this way?
> > >
> > > Thank you for any information provided.
> > >
> > > C. Liddicoat
> > > California Community Colleges
> > > [hidden email]
> >