Missing values and cluster analysis

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Missing values and cluster analysis

Carsten Pauck
Hello,

I have a question concerning a hierarchical cluster analysis. I want to
cluster a group
of persons according to their responses to - lets say - twenty questions q1
to q20. Those
questions are equally scaled from 1 to 10. Certain kinds of attitudes are
measured.

Now, some of those respondands did not answer one question (="dn/na") -while
they
answered the other ones.
For those cases - after a ca has been performed - no cluster will be
assigned - as one (or more)
variables were "missing" for this case.

How can I deal with the problem?

- Is it better to find a solution that fills the missing values before
running a cluster analysis?
  Which method/algorithm can be used?

- Or is there a cluster algor. that tolerates some missing values in a
series of variables used
 for clustering?

Thank you for an advice
Carsten.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Missing values and cluster analysis

Hector Maletta
As far as I know, clustering procedures in SPSS exclude all cases with at
least one missing value in some of the relevant variables or include them
all considering all values as valid. On the other side, the Missing Values
component of SPSS can assign valid values to the missing cases based on
their responses to other questions (not only their other actitudinal
responses in your case, but also background variables such as age, sex,
education, occupation, etc.). I think you have the following options:
1. Leave out those cases, for good. If they are very few, that would
probably do no harm to your research, nor unduly reduce your sample by much.
2. Use the non-valid values (give them some non-system-missin numerical
code, such as 9 or 99) as it they were valid. Remember, however, that the
values in those cases do not represent an amount or quantity, so the
variable (if it was an interval scale) would become a categorical variable.
Some clustering procedures do not tolerate categorical variables. All in all
I do not like this option.
3. Assign estimated valid values to cases with non-valid ones, using the
Missing Value component, and proceed as if missing values never existed.
Assigning them the mean can be a very crude solution, because (given the
other responses a subject has given) the mean may not be her more likely
answer to the missing question.
4. If the valid responses range from positive to negative as in a Likert
scale, you may have the option of considering missing values as an
indifferent response, like "I don't care', and assign them to a middle
value. This is, however, dangerous because it assumes a meaning for the
missing values, and perhaps they do not have that meaning in all cases.
Perhaps the subjects did care, and had a definite opinion, but somehow
omitted the answer.
Hope this helps.
Hector

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Carsten Pauck
Sent: 07 August 2008 11:32
To: [hidden email]
Subject: Missing values and cluster analysis

Hello,

I have a question concerning a hierarchical cluster analysis. I want to
cluster a group
of persons according to their responses to - lets say - twenty questions q1
to q20. Those
questions are equally scaled from 1 to 10. Certain kinds of attitudes are
measured.

Now, some of those respondands did not answer one question (="dn/na") -while
they
answered the other ones.
For those cases - after a ca has been performed - no cluster will be
assigned - as one (or more)
variables were "missing" for this case.

How can I deal with the problem?

- Is it better to find a solution that fills the missing values before
running a cluster analysis?
  Which method/algorithm can be used?

- Or is there a cluster algor. that tolerates some missing values in a
series of variables used
 for clustering?

Thank you for an advice
Carsten.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Missing values and cluster analysis

David Hitchin
In reply to this post by Carsten Pauck
Quoting Carsten Pauck <[hidden email]>:

> Hello,
>
> I have a question concerning a hierarchical cluster analysis. I want
> to cluster a group of persons according to their responses ...
>
> Now, some of those respondands did not answer one question (="dn/na")
> -while they  answered the other ones.

Hector Maletta has offered some good advice, but there is an additional
point that you might consider. You are trying to group "similar" people
together, and if there were several people who left the same question or
questions unanswered this might indicate that they are similar.

Consider an example of a study into people's attitudes to sexuality.
Those who were too embarrassed to answer some questions might well be
similar in some respect, and other questions too might locate the reason
for the refusal and the similarity.

In any study where there are missing values you need to consider WHY
they are missing, and there are dozens of possible reasons. However, the
important distinction is between "missing at random" and other kinds of
missing response.

Imagine that you have collected all of the data, and that someone
randomly deletes numbers here and there. There is no relationship
between the characteristics of the person and the fact that they have
missing data. This is "missing at random" and such numbers can often be
replaced quite usefully with the mean values, or even better with values
of the other variables to "predict" what the missing values might have
been - see the missing value module in SPSS if you have it.

When variables are not missing at random you need to think very
carefully about WHY they might be missing, and your conclusion about
this will guide the next stage of your analysis.

David Hitchin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD