SPSSX Discussion

outliers??

Classic

List

Threaded

9 messages Options

Samuel Solomon

outliers??

Hi List,

I am clueless on how to handle outlier (especially when it comes to
prices) through the help of SPSS. Is there any surefire way? I heard the
HB method will do the job well but I remain clueless in this issue too.
Would any one highlight me?

Thanks,

Samuel

Richard Ristow

Re: outliers??

At 04:19 AM 8/14/2006, Samuel Solomon wrote:

>I am clueless on how to handle outliers (especially when it comes to
>prices) through the help of SPSS. Is there any surefire way?

OK, there are better statisticians than I on this list, but to start
with:

There is not, and never will be, a surefire way, in SPSS or anywhere
else. 'Outlier' values may be the most important data, and you may
distort your analysis very badly by dropping them. I take the liberty
of re-posting and essay on outliers I wrote some time ago; I hope it is
at least partly germane to your needs.

*Whether*, and *by what standard*, to identify outliers, is at least as
important as 'how'.

. Outliers that fail consistency checks: For example, an event date
prior to the beginning of the study, or later than the present, can be
rejected as wrong. (I've got 'event dates' on the brain from a project
I'm working on. And, of course, I'm assuming that 'event dates' in the
past or future aren't valid; in some study designs, they may be.) Those
should be made missing; or they should be checked against the primary
data source and corrected, if that is feasible.

. Outliers that can't be rejected *a priori*: First, you shouldn't even
try to look at those until you reject any demonstrable errors.

Second, I would say a good way to look for them is to look at the
high-percentile cutpoints in the distribution. Depending on the size of
your dataset, 'high-percentile' could be 99%, 99.9%, and 99.99%. (These
are not alternatives. If you use, say, 99.9%, you should look at 99% as
well. Consider also looking at the 90% or 95% cutpoint, for a sense of
the 'normal' range of the distribution. 5% outliers are NOT outliers.
And, of course, look at both ends: 1%, 0.1%, 0.01% percentile
cutpoints, as well.)

Third, I think I'm seeing a trend in the statistics community against
removing 'outliers' by internal criteria (n standard deviations, 1st
and 99th percentiles). The rationale, and it's a strong one, is that
those are observed values of whatever it is that you're measuring. If
you eliminate them, you'll get a model based on their rarity; and that
model, itself, can become an argument for eliminating them (because
they don't fit it), and you can talk yourself into a model that's quite
unrepresentative of reality.

Fourth, however, the largest values will have a disproportionate,
possibly dominant, effect on most linear models -- regression, ANOVA,
even taking the arithmetic mean. Depending on your study, you can

- Go ahead. In this case, the model's fit will be weighted toward
predicting the largest values, and may show little discrimination
within the 'cloud' of more-typical values. That, however, may be the
right insight to be gained from the data.

- If available, use a non-parametric method. That's often favored,
because it neither rejects the large values nor gives them
disproportionate weight. By the same token, however, if much of the
useful information is in the largest values, non-parametric methods can
unduly DE-emphasize these values.

- There are reasons to reject this as heresy, but if you're doing
linear modelling, I'd probably try it both with the largest values
retained and with them eliminated. (I'd only do this if the 'largest
values' look very far from the 'cloud' of regular values. A scatter
plot can be an invaluable tool for this.) If the two models are closely
similar, you have an argument that there's a single process going on,
with the largest values being part of the same process. If they're very
different, you may have two processes, one of which operates
occasionally to produce the largest values, the other of which operates
'normally' but is swamped when the larger process happens. And if the
run without the large values produces a poor R^2, you may have an
argument that the observable process is represented by the largest
values, and the variation in the 'normal cloud' is mostly noise.

- [ADDED] Investigate carefully using a 'bootstrap' method - sampling
from your sample. Very large data values occurring in very small
proportion can give you a huge variance in your estimates, that won't
be detected by standard analytic methods. With a very small proportion
of large values, the expected number in a sample may be very small,
with a large variance. (Let's see - Poisson distributed for all
practical purposes, I think.) Particularly, because of the 'leverage'
of the very large values, the estimates in a sample that includes one
or more may be drastically different from those in a sample that
happens not to include any. You'll have to see whether that's a problem
in your data. If it is, it may help to do a stratified sample in which
the large values are over-represented, and then assigned lower weights
in proportion. If, that is, you can identify a subgroup of 'large
values,' and have access to enough of them to get a significant sample.

Onward, and good luck,
Richard

.........................................
(*) Ristow, Richard, "Re: Outliers", on list "SAS(r) Discussion"
<[hidden email]>, Mon, 15 Nov 2004 14:26:51; reposted as "Re:
Question on excluding extremes", SPSSX-L Fri, 17 Feb 2006 11:21:29

Hector Maletta

Re: outliers??

"Outliers" have never been defined satisfactorily, and the concept is seldom
used in a consistent way. Outliers are not "impossible" values, such as a
widower who is 4 years old, or a mother who is younger than her daughter.
Those are most likely data-entry or data-taking errors.
Outliers are, most properly, extreme values. In a sample about heights they
are individuals measuring over seven feet, or dwarfs. In an income sample
they are people like Bill Gates. They are not impossible, they are simply
rare.
Of course, and extreme value may also be a simple mistake: a person 112
years old may be just 12, and someone who is 7'11" tall may be a more common
5'11" just wrongly written or typed. But they just might exist, extremely
old, extremely tall, extremely wealthy.
Now, what is wrong with finding rare cases? If they exist, they should be
dutifully recorded in your data, not hidden under the carpet. The problem is
that they may distort your sample results if you are not careful in their
treatment. If you have a 1/10,000 sample of a certain area, in order to
estimate the distribution of heights, and stumble on the one and only dwarf
in the neighborhood, you may end up estimating that the area is populated by
10,000 small people, or (in another example) by 10,000 people with the
income of Bill Gates. They may alter the shape of your curve, or disfigure
your mean or standard deviation.
From another point of view, if you start again and draw your sample anew,
chances are you won't stumble again on the only giant or the only dwarf in
town. Of all possible random samples of the same size, just very few will
include them, precisely because such subjects are rare, perhaps unique. If
you have some grounds to know that they are extremely rare in the general
population from which your sample comes, you may decide to exclude them from
the sample, though this is seldom advisable without careful statistical
analysis.
One interesting exercise is considering the impact of their removal on the
mean and standard deviation of important variables, and on the slope of key
regression coefficients in your research. Suppose you are investigating the
relationship between capital and technology, and discover a very strong
relationship: more money, more high tech, but then you discover that the
whole thing crumbles down when you withdraw Bill Gates from the sample: he
was that solitary point to the Northeast of your scatterplot, while all
other capitalists in your sample made their money in old fashioned low-tech
businesses. Out goes Bill, and your money-tech beta falls into
non-significance (just a fictional example, of course). SPSS Regression, for
instance, lets you see the impact of removing each case on the overall fit
of a regression model. A high-impact case is probably an outlier worth
considering for closer inspection (and possible removal if suspect).
Even if SPSS may identify high-impact cases, all this requires human
intelligence. No surefire statistical device can do it for you.
Hope this helps.
Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Richard Ristow
Enviado el: Sunday, August 20, 2006 5:31 PM
Para: [hidden email]
Asunto: Re: outliers??

At 04:19 AM 8/14/2006, Samuel Solomon wrote:

>I am clueless on how to handle outliers (especially when it comes to
>prices) through the help of SPSS. Is there any surefire way?

OK, there are better statisticians than I on this list, but to start
with:

There is not, and never will be, a surefire way, in SPSS or anywhere
else. 'Outlier' values may be the most important data, and you may
distort your analysis very badly by dropping them. I take the liberty
of re-posting and essay on outliers I wrote some time ago; I hope it is
at least partly germane to your needs.

*Whether*, and *by what standard*, to identify outliers, is at least as
important as 'how'.

. Outliers that fail consistency checks: For example, an event date
prior to the beginning of the study, or later than the present, can be
rejected as wrong. (I've got 'event dates' on the brain from a project
I'm working on. And, of course, I'm assuming that 'event dates' in the
past or future aren't valid; in some study designs, they may be.) Those
should be made missing; or they should be checked against the primary
data source and corrected, if that is feasible.

. Outliers that can't be rejected *a priori*: First, you shouldn't even
try to look at those until you reject any demonstrable errors.

Second, I would say a good way to look for them is to look at the
high-percentile cutpoints in the distribution. Depending on the size of
your dataset, 'high-percentile' could be 99%, 99.9%, and 99.99%. (These
are not alternatives. If you use, say, 99.9%, you should look at 99% as
well. Consider also looking at the 90% or 95% cutpoint, for a sense of
the 'normal' range of the distribution. 5% outliers are NOT outliers.
And, of course, look at both ends: 1%, 0.1%, 0.01% percentile
cutpoints, as well.)

Third, I think I'm seeing a trend in the statistics community against
removing 'outliers' by internal criteria (n standard deviations, 1st
and 99th percentiles). The rationale, and it's a strong one, is that
those are observed values of whatever it is that you're measuring. If
you eliminate them, you'll get a model based on their rarity; and that
model, itself, can become an argument for eliminating them (because
they don't fit it), and you can talk yourself into a model that's quite
unrepresentative of reality.

Fourth, however, the largest values will have a disproportionate,
possibly dominant, effect on most linear models -- regression, ANOVA,
even taking the arithmetic mean. Depending on your study, you can

- Go ahead. In this case, the model's fit will be weighted toward
predicting the largest values, and may show little discrimination
within the 'cloud' of more-typical values. That, however, may be the
right insight to be gained from the data.

- If available, use a non-parametric method. That's often favored,
because it neither rejects the large values nor gives them
disproportionate weight. By the same token, however, if much of the
useful information is in the largest values, non-parametric methods can
unduly DE-emphasize these values.

- There are reasons to reject this as heresy, but if you're doing
linear modelling, I'd probably try it both with the largest values
retained and with them eliminated. (I'd only do this if the 'largest
values' look very far from the 'cloud' of regular values. A scatter
plot can be an invaluable tool for this.) If the two models are closely
similar, you have an argument that there's a single process going on,
with the largest values being part of the same process. If they're very
different, you may have two processes, one of which operates
occasionally to produce the largest values, the other of which operates
'normally' but is swamped when the larger process happens. And if the
run without the large values produces a poor R^2, you may have an
argument that the observable process is represented by the largest
values, and the variation in the 'normal cloud' is mostly noise.

- [ADDED] Investigate carefully using a 'bootstrap' method - sampling
from your sample. Very large data values occurring in very small
proportion can give you a huge variance in your estimates, that won't
be detected by standard analytic methods. With a very small proportion
of large values, the expected number in a sample may be very small,
with a large variance. (Let's see - Poisson distributed for all
practical purposes, I think.) Particularly, because of the 'leverage'
of the very large values, the estimates in a sample that includes one
or more may be drastically different from those in a sample that
happens not to include any. You'll have to see whether that's a problem
in your data. If it is, it may help to do a stratified sample in which
the large values are over-represented, and then assigned lower weights
in proportion. If, that is, you can identify a subgroup of 'large
values,' and have access to enough of them to get a significant sample.

Onward, and good luck,
Richard

.........................................
(*) Ristow, Richard, "Re: Outliers", on list "SAS(r) Discussion"
<[hidden email]>, Mon, 15 Nov 2004 14:26:51; reposted as "Re:
Question on excluding extremes", SPSSX-L Fri, 17 Feb 2006 11:21:29

Art Kendall

Re: outliers??

I left the previous two excellent responses in this message.
I have pasted one of my soapbox statements below which gives another
perspective.

Art Kendall
Social Research Consultants

<soapbox>

"Outliers" is a very problematic concept. There are a wide variety of
meanings ascribed to the term.

Based on consulting on stat and methodology for over 30 years, I believe
the usual explanation when there are suspicious values is failure of the
quality assurance procedure. . I think of a potential outlier as a
surprising or suspicious value for a variable (including residuals).
In my experience, in the vast majority of instances, they indicate data
gathering or data entry errors, i.e., insufficient attention in quality
assurance in data gathering or data entry. In my experience, rechecking
qa typically eliminates over 80% of suspicious data values. This is one
reason I advocate thorough exploration of a set of data before doing
the analysis. By thorough exploration I mean things like frequencies,
multi-way crosstabs, scatterplots, box plots, rechecking scale keys and
reliability, etc.

Derived variables such as residuals and rates, should be subjected to
the same thorough examination and understanding-seeking as raw
variables. This identifies suspicious values.

Unusual values may be "real". They should not be simply tossed. In
cluster analysis, sometimes there are singleton clusters, e.g., Los
Angeles county is distinct from other counties in the western states.
Some times there are 500 lb persons. There might be a rose growing in a
cornfield. There may be strong interaction (synergy) effects.

The first thing to do about outliers is to prevent them by careful
quality assurance procedures in data gathering and handling.

A thorough search for suspect data values and potentially treating them
as outliers in analysis is an important part of data quality assurance.
Values for a variable are suspect and in need of further review when
they are unusual given the subject matter area, outside the legitimate
range of the response scale, show as isolated on scattergrams, have
subjectively extreme residuals, when the data shows very high order
interaction on ANOVA analyses, when they result in a case being
extremely influential in a regression, etc. Recall that researchers
consider Murphy a Pollyanna.

The detection of odd/peculiar/suspicious values late in the data
analysis process is one one reason to assure that you can go all the way
back and redo the process. Keeping all of the data gathering
instruments, and preserving the syntax for all data transformation are
important parts of going back and checking on "outliers". The
occurrence of many outliers suggests the data entry was sloppy. There
are likely to be incorrectly entered values that are not "outliers".
Although it is painful, another round of data entry and verification may
be in order.

Correcting the data.

Sometimes you can actually go back to redo the measurements. (Is there
really a 500 pound 9 year old?). You should always have all the paper
from which data were transcribed.
On the rare occasions when there are very good reasons, you might modify
the value for a particular case. e.g., percent correct entered as 1000%
==> 100%.

Modifying the data.
Values of variables should be trimmed or recoded to "missing" only when
there is a clear rationale. And then only when it is not possible to
redo the measurement process. (Maybe there really is a six year old who
weighs 400 lbs. Go back and look if possible.)

If suspected outliers are recoded or trimmed, the analysis should be
done as is and as modified to see what the effect of the modification
is. Changing the values of variables suspected to be outliers frequently
leads to misleading results. These procedures should be used very
sparingly.

Math criteria can identify suspects. There should be a trial before
there is a verdict and the presumption should be against outlier status
for a value.

I don't recommend undesirable practices such as cavalierly trimming to 3
SDs. Having a value beyond 3 SD can be reason to examine a case more
thoroughly.

It is advisable to consult with a statistician before changing the
values of suspected outliers.

Multiple analyses.

If you have re-entered the data, or re-run the experiment, and done very
thorough exploration of the data, you are stuck as a last resort with
doing multiple analyses: including vs excluding the case(s); changing
the values for the case(s) to hotdeck values, to some central tendency
value, or to max or min on the response scale (e.g., for achievement,
personality, or attitude measures), modeling the specialness of the
particular value, etc.

In the small minority of occasions where the data can not be cleaned up,
the analysis should be done in three or more ways (include the
outliers as is, trim the values, treat the values as missing, transform
to ranks, include in the model variables that flag those cases, or
...). The reporting becomes much more complex. Consider yourself very
lucky if the conclusions do not vary substantially.

Art Kendall
Social Research Consultants

Hector Maletta wrote:

>"Outliers" have never been defined satisfactorily, and the concept is seldom
>used in a consistent way. Outliers are not "impossible" values, such as a
>widower who is 4 years old, or a mother who is younger than her daughter.
>Those are most likely data-entry or data-taking errors.
>Outliers are, most properly, extreme values. In a sample about heights they
>are individuals measuring over seven feet, or dwarfs. In an income sample
>they are people like Bill Gates. They are not impossible, they are simply
>rare.
>Of course, and extreme value may also be a simple mistake: a person 112
>years old may be just 12, and someone who is 7'11" tall may be a more common
>5'11" just wrongly written or typed. But they just might exist, extremely
>old, extremely tall, extremely wealthy.
>Now, what is wrong with finding rare cases? If they exist, they should be
>dutifully recorded in your data, not hidden under the carpet. The problem is
>that they may distort your sample results if you are not careful in their
>treatment. If you have a 1/10,000 sample of a certain area, in order to
>estimate the distribution of heights, and stumble on the one and only dwarf
>in the neighborhood, you may end up estimating that the area is populated by
>10,000 small people, or (in another example) by 10,000 people with the
>income of Bill Gates. They may alter the shape of your curve, or disfigure
>your mean or standard deviation.
>>From another point of view, if you start again and draw your sample anew,
>chances are you won't stumble again on the only giant or the only dwarf in
>town. Of all possible random samples of the same size, just very few will
>include them, precisely because such subjects are rare, perhaps unique. If
>you have some grounds to know that they are extremely rare in the general
>population from which your sample comes, you may decide to exclude them from
>the sample, though this is seldom advisable without careful statistical
>analysis.
>One interesting exercise is considering the impact of their removal on the
>mean and standard deviation of important variables, and on the slope of key
>regression coefficients in your research. Suppose you are investigating the
>relationship between capital and technology, and discover a very strong
>relationship: more money, more high tech, but then you discover that the
>whole thing crumbles down when you withdraw Bill Gates from the sample: he
>was that solitary point to the Northeast of your scatterplot, while all
>other capitalists in your sample made their money in old fashioned low-tech
>businesses. Out goes Bill, and your money-tech beta falls into
>non-significance (just a fictional example, of course). SPSS Regression, for
>instance, lets you see the impact of removing each case on the overall fit
>of a regression model. A high-impact case is probably an outlier worth
>considering for closer inspection (and possible removal if suspect).
>Even if SPSS may identify high-impact cases, all this requires human
>intelligence. No surefire statistical device can do it for you.
>Hope this helps.
>Hector
>
>
>
>
>-----Mensaje original-----
>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
>Richard Ristow
>Enviado el: Sunday, August 20, 2006 5:31 PM
>Para: [hidden email]
>Asunto: Re: outliers??
>
>At 04:19 AM 8/14/2006, Samuel Solomon wrote:
>
>
>
>>I am clueless on how to handle outliers (especially when it comes to
>>prices) through the help of SPSS. Is there any surefire way?
>>
>>
>
>OK, there are better statisticians than I on this list, but to start
>with:
>
>There is not, and never will be, a surefire way, in SPSS or anywhere
>else. 'Outlier' values may be the most important data, and you may
>distort your analysis very badly by dropping them. I take the liberty
>of re-posting and essay on outliers I wrote some time ago; I hope it is
>at least partly germane to your needs.
>
>
>*Whether*, and *by what standard*, to identify outliers, is at least as
>important as 'how'.
>
>. Outliers that fail consistency checks: For example, an event date
>prior to the beginning of the study, or later than the present, can be
>rejected as wrong. (I've got 'event dates' on the brain from a project
>I'm working on. And, of course, I'm assuming that 'event dates' in the
>past or future aren't valid; in some study designs, they may be.) Those
>should be made missing; or they should be checked against the primary
>data source and corrected, if that is feasible.
>
>. Outliers that can't be rejected *a priori*: First, you shouldn't even
>try to look at those until you reject any demonstrable errors.
>
>Second, I would say a good way to look for them is to look at the
>high-percentile cutpoints in the distribution. Depending on the size of
>your dataset, 'high-percentile' could be 99%, 99.9%, and 99.99%. (These
>are not alternatives. If you use, say, 99.9%, you should look at 99% as
>well. Consider also looking at the 90% or 95% cutpoint, for a sense of
>the 'normal' range of the distribution. 5% outliers are NOT outliers.
>And, of course, look at both ends: 1%, 0.1%, 0.01% percentile
>cutpoints, as well.)
>
>Third, I think I'm seeing a trend in the statistics community against
>removing 'outliers' by internal criteria (n standard deviations, 1st
>and 99th percentiles). The rationale, and it's a strong one, is that
>those are observed values of whatever it is that you're measuring. If
>you eliminate them, you'll get a model based on their rarity; and that
>model, itself, can become an argument for eliminating them (because
>they don't fit it), and you can talk yourself into a model that's quite
>unrepresentative of reality.
>
>Fourth, however, the largest values will have a disproportionate,
>possibly dominant, effect on most linear models -- regression, ANOVA,
>even taking the arithmetic mean. Depending on your study, you can
>
>- Go ahead. In this case, the model's fit will be weighted toward
>predicting the largest values, and may show little discrimination
>within the 'cloud' of more-typical values. That, however, may be the
>right insight to be gained from the data.
>
>- If available, use a non-parametric method. That's often favored,
>because it neither rejects the large values nor gives them
>disproportionate weight. By the same token, however, if much of the
>useful information is in the largest values, non-parametric methods can
>unduly DE-emphasize these values.
>
>- There are reasons to reject this as heresy, but if you're doing
>linear modelling, I'd probably try it both with the largest values
>retained and with them eliminated. (I'd only do this if the 'largest
>values' look very far from the 'cloud' of regular values. A scatter
>plot can be an invaluable tool for this.) If the two models are closely
>similar, you have an argument that there's a single process going on,
>with the largest values being part of the same process. If they're very
>different, you may have two processes, one of which operates
>occasionally to produce the largest values, the other of which operates
>'normally' but is swamped when the larger process happens. And if the
>run without the large values produces a poor R^2, you may have an
>argument that the observable process is represented by the largest
>values, and the variation in the 'normal cloud' is mostly noise.
>
>- [ADDED] Investigate carefully using a 'bootstrap' method - sampling
>from your sample. Very large data values occurring in very small
>proportion can give you a huge variance in your estimates, that won't
>be detected by standard analytic methods. With a very small proportion
>of large values, the expected number in a sample may be very small,
>with a large variance. (Let's see - Poisson distributed for all
>practical purposes, I think.) Particularly, because of the 'leverage'
>of the very large values, the estimates in a sample that includes one
>or more may be drastically different from those in a sample that
>happens not to include any. You'll have to see whether that's a problem
>in your data. If it is, it may help to do a stratified sample in which
>the large values are over-represented, and then assigned lower weights
>in proportion. If, that is, you can identify a subgroup of 'large
>values,' and have access to enough of them to get a significant sample.
>
>Onward, and good luck,
>Richard
>
>.........................................
>(*) Ristow, Richard, "Re: Outliers", on list "SAS(r) Discussion"
><[hidden email]>, Mon, 15 Nov 2004 14:26:51; reposted as "Re:
>Question on excluding extremes", SPSSX-L Fri, 17 Feb 2006 11:21:29
>
>
>
>

Art Kendall
Social Research Consultants

Peck, Jon

Re: outliers??

To add to all this good advice,

In many cases whether something is an outlier or not depends on a model. It may be an extreme value not explained by the model. It may be much more complicated than a univariate extreme. The new Anomaly Detection procedure in SPSS can help to find these in a multivariate framework, although it is still up to you to decide what to do about it.

In a context such as regression, it is good to look at the leverage statistics to see whether potential outliers actually affect your results much or not.

Finally, consider the process assumed to be generating the data. It is commonly observed that stock market prices follow a random walk model, which means that the variance is not finite. Such a fat-tailed distribution will intrinsically have more outliers than we are probably accustomed to seeing, but that is part of the phenomenon to model.

Note that outlier and unusual are not quite the same thing. You might have a very lonely and suspicious value buried in the middle of your data in a sparse region. Is that an outlier? It might be equally suspicious.

Regards,
Jon Peck
SPSS

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Art Kendall
Sent: Sunday, August 27, 2006 8:06 AM
To: [hidden email]
Subject: Re: [SPSSX-L] outliers??

I left the previous two excellent responses in this message.
I have pasted one of my soapbox statements below which gives another
perspective.

Art Kendall
Social Research Consultants

<soapbox>

"Outliers" is a very problematic concept. There are a wide variety of
meanings ascribed to the term.

Based on consulting on stat and methodology for over 30 years, I believe
the usual explanation when there are suspicious values is failure of the
quality assurance procedure. . I think of a potential outlier as a
surprising or suspicious value for a variable (including residuals).
In my experience, in the vast majority of instances, they indicate data
gathering or data entry errors, i.e., insufficient attention in quality
assurance in data gathering or data entry. In my experience, rechecking
qa typically eliminates over 80% of suspicious data values. This is one
reason I advocate thorough exploration of a set of data before doing
the analysis. By thorough exploration I mean things like frequencies,
multi-way crosstabs, scatterplots, box plots, rechecking scale keys and
reliability, etc.

Derived variables such as residuals and rates, should be subjected to
the same thorough examination and understanding-seeking as raw
variables. This identifies suspicious values.

Unusual values may be "real". They should not be simply tossed. In
cluster analysis, sometimes there are singleton clusters, e.g., Los
Angeles county is distinct from other counties in the western states.
Some times there are 500 lb persons. There might be a rose growing in a
cornfield. There may be strong interaction (synergy) effects.

The first thing to do about outliers is to prevent them by careful
quality assurance procedures in data gathering and handling.

A thorough search for suspect data values and potentially treating them
as outliers in analysis is an important part of data quality assurance.
Values for a variable are suspect and in need of further review when
they are unusual given the subject matter area, outside the legitimate
range of the response scale, show as isolated on scattergrams, have
subjectively extreme residuals, when the data shows very high order
interaction on ANOVA analyses, when they result in a case being
extremely influential in a regression, etc. Recall that researchers
consider Murphy a Pollyanna.

The detection of odd/peculiar/suspicious values late in the data
analysis process is one one reason to assure that you can go all the way
back and redo the process. Keeping all of the data gathering
instruments, and preserving the syntax for all data transformation are
important parts of going back and checking on "outliers". The
occurrence of many outliers suggests the data entry was sloppy. There
are likely to be incorrectly entered values that are not "outliers".
Although it is painful, another round of data entry and verification may
be in order.

Correcting the data.

Sometimes you can actually go back to redo the measurements. (Is there
really a 500 pound 9 year old?). You should always have all the paper
from which data were transcribed.
On the rare occasions when there are very good reasons, you might modify
the value for a particular case. e.g., percent correct entered as 1000%
==> 100%.

Modifying the data.
Values of variables should be trimmed or recoded to "missing" only when
there is a clear rationale. And then only when it is not possible to
redo the measurement process. (Maybe there really is a six year old who
weighs 400 lbs. Go back and look if possible.)

If suspected outliers are recoded or trimmed, the analysis should be
done as is and as modified to see what the effect of the modification
is. Changing the values of variables suspected to be outliers frequently
leads to misleading results. These procedures should be used very
sparingly.

Math criteria can identify suspects. There should be a trial before
there is a verdict and the presumption should be against outlier status
for a value.

I don't recommend undesirable practices such as cavalierly trimming to 3
SDs. Having a value beyond 3 SD can be reason to examine a case more
thoroughly.

It is advisable to consult with a statistician before changing the
values of suspected outliers.

Multiple analyses.

If you have re-entered the data, or re-run the experiment, and done very
thorough exploration of the data, you are stuck as a last resort with
doing multiple analyses: including vs excluding the case(s); changing
the values for the case(s) to hotdeck values, to some central tendency
value, or to max or min on the response scale (e.g., for achievement,
personality, or attitude measures), modeling the specialness of the
particular value, etc.

In the small minority of occasions where the data can not be cleaned up,
the analysis should be done in three or more ways (include the
outliers as is, trim the values, treat the values as missing, transform
to ranks, include in the model variables that flag those cases, or
...). The reporting becomes much more complex. Consider yourself very
lucky if the conclusions do not vary substantially.

Art Kendall
Social Research Consultants

Hector Maletta wrote:

Art Kendall

Re: outliers??

Very well put. On a case selection model, a measurement
operationalization model, and an analytic model. The concept of an
"inlier" is also critical to understanding.

SPSS is to be commended for including the anomaly detection (AD)
procedure and find duplicate cases (FDC). Inclusion of these helps to
reinforce the idea that the data needs to be cleaned, checked and
explored. AD and FDC facilitate quality assurance.

Although it is possible to workaround to compare files that are supposed
to be double keying, it is not as straight forward as it should be. I
strongly urge SPSS to implement a single syntax command procedure that
compares 2 files and reports differences 1) in the dictionary 2) in the
data.
Double keying is a venerable QA procedure When SPSS was run on card
images in 1972, it was routine to compare the input data cards and the
output from WRITE FILEINFO using routines from the operating system.

Whereas FDC looks for situations where there is duplication and should
not be, the procedure I am urging looks for situations where there is
NOT duplication.

Art Kendall
Social Research Consultants

Peck, Jon wrote:

>To add to all this good advice,
>
>In many cases whether something is an outlier or not depends on a model. It may be an extreme value not explained by the model. It may be much more complicated than a univariate extreme. The new Anomaly Detection procedure in SPSS can help to find these in a multivariate framework, although it is still up to you to decide what to do about it.
>
>In a context such as regression, it is good to look at the leverage statistics to see whether potential outliers actually affect your results much or not.
>
>Finally, consider the process assumed to be generating the data. It is commonly observed that stock market prices follow a random walk model, which means that the variance is not finite. Such a fat-tailed distribution will intrinsically have more outliers than we are probably accustomed to seeing, but that is part of the phenomenon to model.
>
>Note that outlier and unusual are not quite the same thing. You might have a very lonely and suspicious value buried in the middle of your data in a sparse region. Is that an outlier? It might be equally suspicious.
>
>Regards,
>Jon Peck
>SPSS
>
>-----Original Message-----
>From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Art Kendall
>Sent: Sunday, August 27, 2006 8:06 AM
>To: [hidden email]
>Subject: Re: [SPSSX-L] outliers??
>
>I left the previous two excellent responses in this message.
>I have pasted one of my soapbox statements below which gives another
>perspective.
>
>Art Kendall
>Social Research Consultants
>
><soapbox>
>
>"Outliers" is a very problematic concept. There are a wide variety of
>meanings ascribed to the term.
>
>Based on consulting on stat and methodology for over 30 years, I believe
>the usual explanation when there are suspicious values is failure of the
>quality assurance procedure. . I think of a potential outlier as a
>surprising or suspicious value for a variable (including residuals).
>In my experience, in the vast majority of instances, they indicate data
>gathering or data entry errors, i.e., insufficient attention in quality
>assurance in data gathering or data entry. In my experience, rechecking
>qa typically eliminates over 80% of suspicious data values. This is one
>reason I advocate thorough exploration of a set of data before doing
>the analysis. By thorough exploration I mean things like frequencies,
>multi-way crosstabs, scatterplots, box plots, rechecking scale keys and
>reliability, etc.
>
>Derived variables such as residuals and rates, should be subjected to
>the same thorough examination and understanding-seeking as raw
>variables. This identifies suspicious values.
>
>Unusual values may be "real". They should not be simply tossed. In
>cluster analysis, sometimes there are singleton clusters, e.g., Los
>Angeles county is distinct from other counties in the western states.
>Some times there are 500 lb persons. There might be a rose growing in a
>cornfield. There may be strong interaction (synergy) effects.
>
>
>The first thing to do about outliers is to prevent them by careful
>quality assurance procedures in data gathering and handling.
>
>A thorough search for suspect data values and potentially treating them
>as outliers in analysis is an important part of data quality assurance.
>Values for a variable are suspect and in need of further review when
>they are unusual given the subject matter area, outside the legitimate
>range of the response scale, show as isolated on scattergrams, have
>subjectively extreme residuals, when the data shows very high order
>interaction on ANOVA analyses, when they result in a case being
>extremely influential in a regression, etc. Recall that researchers
>consider Murphy a Pollyanna.
>
>
>The detection of odd/peculiar/suspicious values late in the data
>analysis process is one one reason to assure that you can go all the way
>back and redo the process. Keeping all of the data gathering
>instruments, and preserving the syntax for all data transformation are
>important parts of going back and checking on "outliers". The
>occurrence of many outliers suggests the data entry was sloppy. There
>are likely to be incorrectly entered values that are not "outliers".
>Although it is painful, another round of data entry and verification may
>be in order.
>
>
>Correcting the data.
>
>Sometimes you can actually go back to redo the measurements. (Is there
>really a 500 pound 9 year old?). You should always have all the paper
>from which data were transcribed.
>On the rare occasions when there are very good reasons, you might modify
>the value for a particular case. e.g., percent correct entered as 1000%
>==> 100%.
>
>
>Modifying the data.
>Values of variables should be trimmed or recoded to "missing" only when
>there is a clear rationale. And then only when it is not possible to
>redo the measurement process. (Maybe there really is a six year old who
>weighs 400 lbs. Go back and look if possible.)
>
>If suspected outliers are recoded or trimmed, the analysis should be
>done as is and as modified to see what the effect of the modification
>is. Changing the values of variables suspected to be outliers frequently
>leads to misleading results. These procedures should be used very
>sparingly.
>
>Math criteria can identify suspects. There should be a trial before
>there is a verdict and the presumption should be against outlier status
>for a value.
>
>
>I don't recommend undesirable practices such as cavalierly trimming to 3
>SDs. Having a value beyond 3 SD can be reason to examine a case more
>thoroughly.
>
>It is advisable to consult with a statistician before changing the
>values of suspected outliers.
>
>Multiple analyses.
>
>If you have re-entered the data, or re-run the experiment, and done very
>thorough exploration of the data, you are stuck as a last resort with
>doing multiple analyses: including vs excluding the case(s); changing
>the values for the case(s) to hotdeck values, to some central tendency
>value, or to max or min on the response scale (e.g., for achievement,
>personality, or attitude measures), modeling the specialness of the
>particular value, etc.
>
>In the small minority of occasions where the data can not be cleaned up,
>the analysis should be done in three or more ways (include the
>outliers as is, trim the values, treat the values as missing, transform
>to ranks, include in the model variables that flag those cases, or
>...). The reporting becomes much more complex. Consider yourself very
>lucky if the conclusions do not vary substantially.
>
>Art Kendall
>Social Research Consultants
>
>
>Hector Maletta wrote:
>
>
>
>>"Outliers" have never been defined satisfactorily, and the concept is seldom
>>used in a consistent way. Outliers are not "impossible" values, such as a
>>widower who is 4 years old, or a mother who is younger than her daughter.
>>Those are most likely data-entry or data-taking errors.
>>Outliers are, most properly, extreme values. In a sample about heights they
>>are individuals measuring over seven feet, or dwarfs. In an income sample
>>they are people like Bill Gates. They are not impossible, they are simply
>>rare.
>>Of course, and extreme value may also be a simple mistake: a person 112
>>years old may be just 12, and someone who is 7'11" tall may be a more common
>>5'11" just wrongly written or typed. But they just might exist, extremely
>>old, extremely tall, extremely wealthy.
>>Now, what is wrong with finding rare cases? If they exist, they should be
>>dutifully recorded in your data, not hidden under the carpet. The problem is
>>that they may distort your sample results if you are not careful in their
>>treatment. If you have a 1/10,000 sample of a certain area, in order to
>>estimate the distribution of heights, and stumble on the one and only dwarf
>>in the neighborhood, you may end up estimating that the area is populated by
>>10,000 small people, or (in another example) by 10,000 people with the
>>income of Bill Gates. They may alter the shape of your curve, or disfigure
>>your mean or standard deviation.
>>>From another point of view, if you start again and draw your sample anew,
>>chances are you won't stumble again on the only giant or the only dwarf in
>>town. Of all possible random samples of the same size, just very few will
>>include them, precisely because such subjects are rare, perhaps unique. If
>>you have some grounds to know that they are extremely rare in the general
>>population from which your sample comes, you may decide to exclude them from
>>the sample, though this is seldom advisable without careful statistical
>>analysis.
>>One interesting exercise is considering the impact of their removal on the
>>mean and standard deviation of important variables, and on the slope of key
>>regression coefficients in your research. Suppose you are investigating the
>>relationship between capital and technology, and discover a very strong
>>relationship: more money, more high tech, but then you discover that the
>>whole thing crumbles down when you withdraw Bill Gates from the sample: he
>>was that solitary point to the Northeast of your scatterplot, while all
>>other capitalists in your sample made their money in old fashioned low-tech
>>businesses. Out goes Bill, and your money-tech beta falls into
>>non-significance (just a fictional example, of course). SPSS Regression, for
>>instance, lets you see the impact of removing each case on the overall fit
>>of a regression model. A high-impact case is probably an outlier worth
>>considering for closer inspection (and possible removal if suspect).
>>Even if SPSS may identify high-impact cases, all this requires human
>>intelligence. No surefire statistical device can do it for you.
>>Hope this helps.
>>Hector
>>
>>
>>
>>
>>-----Mensaje original-----
>>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
>>Richard Ristow
>>Enviado el: Sunday, August 20, 2006 5:31 PM
>>Para: [hidden email]
>>Asunto: Re: outliers??
>>
>>At 04:19 AM 8/14/2006, Samuel Solomon wrote:
>>
>>
>>
>>
>>
>>>I am clueless on how to handle outliers (especially when it comes to
>>>prices) through the help of SPSS. Is there any surefire way?
>>>
>>>
>>>
>>>
>>OK, there are better statisticians than I on this list, but to start
>>with:
>>
>>There is not, and never will be, a surefire way, in SPSS or anywhere
>>else. 'Outlier' values may be the most important data, and you may
>>distort your analysis very badly by dropping them. I take the liberty
>>of re-posting and essay on outliers I wrote some time ago; I hope it is
>>at least partly germane to your needs.
>>
>>
>>*Whether*, and *by what standard*, to identify outliers, is at least as
>>important as 'how'.
>>
>>. Outliers that fail consistency checks: For example, an event date
>>prior to the beginning of the study, or later than the present, can be
>>rejected as wrong. (I've got 'event dates' on the brain from a project
>>I'm working on. And, of course, I'm assuming that 'event dates' in the
>>past or future aren't valid; in some study designs, they may be.) Those
>>should be made missing; or they should be checked against the primary
>>data source and corrected, if that is feasible.
>>
>>. Outliers that can't be rejected *a priori*: First, you shouldn't even
>>try to look at those until you reject any demonstrable errors.
>>
>>Second, I would say a good way to look for them is to look at the
>>high-percentile cutpoints in the distribution. Depending on the size of
>>your dataset, 'high-percentile' could be 99%, 99.9%, and 99.99%. (These
>>are not alternatives. If you use, say, 99.9%, you should look at 99% as
>>well. Consider also looking at the 90% or 95% cutpoint, for a sense of
>>the 'normal' range of the distribution. 5% outliers are NOT outliers.
>>And, of course, look at both ends: 1%, 0.1%, 0.01% percentile
>>cutpoints, as well.)
>>
>>Third, I think I'm seeing a trend in the statistics community against
>>removing 'outliers' by internal criteria (n standard deviations, 1st
>>and 99th percentiles). The rationale, and it's a strong one, is that
>>those are observed values of whatever it is that you're measuring. If
>>you eliminate them, you'll get a model based on their rarity; and that
>>model, itself, can become an argument for eliminating them (because
>>they don't fit it), and you can talk yourself into a model that's quite
>>unrepresentative of reality.
>>
>>Fourth, however, the largest values will have a disproportionate,
>>possibly dominant, effect on most linear models -- regression, ANOVA,
>>even taking the arithmetic mean. Depending on your study, you can
>>
>>- Go ahead. In this case, the model's fit will be weighted toward
>>predicting the largest values, and may show little discrimination
>>within the 'cloud' of more-typical values. That, however, may be the
>>right insight to be gained from the data.
>>
>>- If available, use a non-parametric method. That's often favored,
>>because it neither rejects the large values nor gives them
>>disproportionate weight. By the same token, however, if much of the
>>useful information is in the largest values, non-parametric methods can
>>unduly DE-emphasize these values.
>>
>>- There are reasons to reject this as heresy, but if you're doing
>>linear modelling, I'd probably try it both with the largest values
>>retained and with them eliminated. (I'd only do this if the 'largest
>>values' look very far from the 'cloud' of regular values. A scatter
>>plot can be an invaluable tool for this.) If the two models are closely
>>similar, you have an argument that there's a single process going on,
>>with the largest values being part of the same process. If they're very
>>different, you may have two processes, one of which operates
>>occasionally to produce the largest values, the other of which operates
>>'normally' but is swamped when the larger process happens. And if the
>>run without the large values produces a poor R^2, you may have an
>>argument that the observable process is represented by the largest
>>values, and the variation in the 'normal cloud' is mostly noise.
>>
>>- [ADDED] Investigate carefully using a 'bootstrap' method - sampling
>>
>>
>>from your sample. Very large data values occurring in very small
>
>
>>proportion can give you a huge variance in your estimates, that won't
>>be detected by standard analytic methods. With a very small proportion
>>of large values, the expected number in a sample may be very small,
>>with a large variance. (Let's see - Poisson distributed for all
>>practical purposes, I think.) Particularly, because of the 'leverage'
>>of the very large values, the estimates in a sample that includes one
>>or more may be drastically different from those in a sample that
>>happens not to include any. You'll have to see whether that's a problem
>>in your data. If it is, it may help to do a stratified sample in which
>>the large values are over-represented, and then assigned lower weights
>>in proportion. If, that is, you can identify a subgroup of 'large
>>values,' and have access to enough of them to get a significant sample.
>>
>>Onward, and good luck,
>>Richard
>>
>>.........................................
>>(*) Ristow, Richard, "Re: Outliers", on list "SAS(r) Discussion"
>><[hidden email]>, Mon, 15 Nov 2004 14:26:51; reposted as "Re:
>>Question on excluding extremes", SPSSX-L Fri, 17 Feb 2006 11:21:29
>>
>>
>>
>>
>>
>>
>
>
>
>

Art Kendall
Social Research Consultants

Hector Maletta

Re: outliers??

In reply to this post by Peck, Jon

Jon Peck wrote:
"Note that outlier and unusual are not quite the same thing. You might have
a very lonely and suspicious value buried in the middle of your data in a
sparse region. Is that an outlier? It might be equally suspicious."
Indeed. I have stumbled on some recently. And the case leads me to one kind
of unusual data not always specifically recognized, those only revealed in
ratios and relationships. In my case it was a seed rate (amount of seed sown
per hectare). Farmers tend to use very definite and separate amounts
depending on the technology they apply (type of seed, use of irrigation,
type of sowing device, etc.) resulting in discrete seed rates like, say, 50,
100 or 200 kg/Ha. Farmers were not asked to report the seed rate, only the
area planted and the total amount of seed used, and nothing abnormal was
initially detected there. But when the ratio was computed, and most cases
fell on the expected values (such as 50, 100 or 200) or within rounding
error of them, a few cases fell in between, usually at fractional values
(say at 72.8333, 87.42 or 176.5 kg/Ha). On closer inspection most of them
were revealed as data entry mistakes. They were unusual, worth of
inspection, worth correcting, but not outliers in the usual sense of "out of
range", though outliers in the more subtle sense of "a legitimate value, not
necessarily out of range but not in the list of usual or expected values".
Hector

Hector Maletta

Re: outliers??

In reply to this post by Samuel Solomon

Yes, but "outliers" only by this definition of mine, not by the old
fashioned definition of the outlier as a value that is extreme or
out-of-range, i.e. lower than the acceptable minimum or higher than the
acceptable maximum. The seed rates were neither, but they were "outliers",
sort of, by my newfangled definition including unusual intermediate values.
Since the "usual" or "acceptable" values may have a margin of tolerance, it
is always a matter of discretion to define where outlier territory begins,
both in the case of out-of-range or extreme outliers and in the case of
intermediate unusual values. Cases near the cutoff point will always be
doubtful, but their frequency is also a criterion: when you have a judicious
cutoff point of, say, 55, but you have a bunch of, say, two dozen cases at
60, with only solitary cases at higher values, perhaps your inclusive cutoff
point should be shifted to 60, and only those above 60 be regarded as
outliers (on the assumption that it is unlikely that two dozen people make
the same mistake). But then again, you may be wrong in doing so if there is
some mechanism at play producing the mistake systematically (I will send a
separate message telling the story of some tribulations of mine with
numerals in the English transcription of surveys using Arabic script, as a
cautionary tale in this regard).
Hector

-----Mensaje original-----
De: Peck, Jon [mailto:[hidden email]]
Enviado el: Monday, August 28, 2006 10:33 AM
Para: Hector Maletta
Asunto: RE: outliers??

So applying your model, simple though it was, exposed these as outliers with
regard to the model, in a sense.

-Jon

-----Original Message-----
From: Hector Maletta [mailto:[hidden email]]
Sent: Monday, August 28, 2006 8:26 AM
To: Peck, Jon; [hidden email]
Subject: RE: outliers??

Jon Peck wrote:
"Note that outlier and unusual are not quite the same thing. You might have
a very lonely and suspicious value buried in the middle of your data in a
sparse region. Is that an outlier? It might be equally suspicious."
Indeed. I have stumbled on some recently. And the case leads me to one kind
of unusual data not always specifically recognized, those only revealed in
ratios and relationships. In my case it was a seed rate (amount of seed sown
per hectare). Farmers tend to use very definite and separate amounts
depending on the technology they apply (type of seed, use of irrigation,
type of sowing device, etc.) resulting in discrete seed rates like, say, 50,
100 or 200 kg/Ha. Farmers were not asked to report the seed rate, only the
area planted and the total amount of seed used, and nothing abnormal was
initially detected there. But when the ratio was computed, and most cases
fell on the expected values (such as 50, 100 or 200) or within rounding
error of them, a few cases fell in between, usually at fractional values
(say at 72.8333, 87.42 or 176.5 kg/Ha). On closer inspection most of them
were revealed as data entry mistakes. They were unusual, worth of
inspection, worth correcting, but not outliers in the usual sense of "out of
range", though outliers in the more subtle sense of "a legitimate value, not
necessarily out of range but not in the list of usual or expected values".
Hector

Scott Terry

Re: outliers??

In reply to this post by Samuel Solomon

My $.02 ...

It also depends on what your modeling objective is. If, for example,
you're a direct marketer and you're trying to predict responsiveness,
your objective may to predict the MOST responses. So given that
extremes or outliers can skew predictiveness, suppressing or extremes
outliers may be in order.

Another method I've used over the years (granted you have adequate
sample sizes) is to dummy extremes or outliers. Doing so can help
explain their impact and it can help you make a determination to include
or drop the term from your equation.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Peck, Jon
Sent: Monday, August 28, 2006 8:02 AM
To: [hidden email]
Subject: Re: outliers??

To add to all this good advice,

In many cases whether something is an outlier or not depends on a model.
It may be an extreme value not explained by the model. It may be much
more complicated than a univariate extreme. The new Anomaly Detection
procedure in SPSS can help to find these in a multivariate framework,
although it is still up to you to decide what to do about it.

In a context such as regression, it is good to look at the leverage
statistics to see whether potential outliers actually affect your
results much or not.

Finally, consider the process assumed to be generating the data. It is
commonly observed that stock market prices follow a random walk model,
which means that the variance is not finite. Such a fat-tailed
distribution will intrinsically have more outliers than we are probably
accustomed to seeing, but that is part of the phenomenon to model.

Note that outlier and unusual are not quite the same thing. You might
have a very lonely and suspicious value buried in the middle of your
data in a sparse region. Is that an outlier? It might be equally
suspicious.

Regards,
Jon Peck
SPSS