Fw: SPSS-Stats question regarding outliers

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Fw: SPSS-Stats question regarding outliers

Søren Bech
----- Forwarded by Søren Bech/SBE/Bang & Olufsen/DK on 13-08-2007 21:15
-----

Søren Bech/SBE/Bang & Olufsen/DK
13-08-2007 15:42

To
[hidden email]
cc

Subject
Re: SPSS-Stats question regarding outliers





Hi Tom

This is a very complicated question and one which cannot be answered with
a simple rule. Hopefully the list can forgive me for introducing a small
add, but in our book (
http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470869232.html) on
subjective audio evaluation we have a discussion of these issues including
a large number of references to e.g. the food industry where they also use
sensory evaluations and have standards for training of subject panels.
etc. We are also involved in a project that have developed a program
(Panelcheck see
http://www.matforsk.no/web/sampro.nsf/webTemaPE/PanelCheck!OpenDocument)
that can be used to identify outlying subjects.
Please do not hesitate to mail me with further questions.

Regards

Soren



Tom Werner <[hidden email]>
Sent by: "SPSSX(r) Discussion" <[hidden email]>
13-08-2007 15:26
Please respond to
[hidden email]


To
[hidden email]
cc

Subject
Re: SPSS-Stats question regarding outliers






This is a most interesting discussion.

I might suggest that one area in which it may be important to identify
(and
perhaps remove) outliers is ratings by judges.

If we were to have panels of judges judging entries by rating them on
numerical scales (such as in an awards program, skating/gymnastics
judging,
or applicant judging), we would need to identify and manage inter-rater
reliability/agreement.

One way to achieve appropriate inter-rater reliability would be to
identify
and remove outlier ratings (overly strict or overly lenient scores
relative
to other judges' scores on the same entry).

It strikes me that if this isn't done, the awards program (or sports
contest
or application process) risks having variation that is due more to the
judges than to the entries.

(Other steps could also be taken to increase inter-rater reliability, such
as training of the judges. But it seems that identifying and removing
outlier scores would always be a worthwhile additional step.)

I'd be grateful for any thoughts on this.


Tom Werner



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Art
Kendall
Sent: Monday, August 13, 2007 7:46 AM
To: [hidden email]
Subject: Re: SPSS-Stats question regarding outliers

<soapbox>

"Outliers" is a very problematic concept.  There are a wide variety of
meanings ascribed to the term.

Extreme values may be valid. They may be the most important values.
Arbitrary treatment of values as outliers should rarely if ever be done.
Leverage stats,  etc.  only identify potential or suspected outliers.

Based on consulting on stat and methodology for over 30 years, I believe
the
usual explanation when there are suspicious values is failure of the
quality
assurance procedure. .  I think of a *potential* outlier as a surprising
or
suspicious value for a variable  (including residuals).
In my experience, in the vast majority of instances, they indicate data
gathering or data entry errors, i.e., insufficient attention in quality
assurance in data gathering or data entry.  In my experience, rechecking
qa
typically eliminates over 80% of suspicious data values. This is one
reason
I advocate thorough exploration of a  set of data before doing the
analysis.
By thorough exploration I mean things like frequencies, multi-way
crosstabs,
scatterplots, box plots, rechecking scale keys and reliability, etc.

Derived variables such as residuals and rates, should be subjected to the
same thorough examination and understanding-seeking as raw variables.

In cluster analysis, sometimes there are singleton clusters, e.g., Los
Angeles county is distinct from other counties in the western US states.
Some times there are 500 lb persons.  There might be a rose growing in a
cornfield.


The first thing to do with outliers is to *_prevent_ * them by careful
quality assurance procedures in data gathering and data entry.

A thorough search for suspect data values and potentially treating them as
outliers in analysis is an important part of data quality assurance.
Values for a variable are suspect and in need of further review when they
are unusual given the subject matter area, outside the legitimate range of
the response scale, show as isolated on scattergrams, have subjectively
extreme residuals, when the data shows very high order interaction on
ANOVA
analyses, when they result in a case being extremely influential in a
regression, etc.  Recall that researchers consider Murphy a Pollyanna.


The detection of odd/peculiar/suspicious values late in the data analysis
process is one one reason to assure that you can go all the way back and
redo the process.  Keeping all of the data gathering instruments, and
preserving the syntax for all data transformation are important parts of
going back and checking on "outliers".  The occurrence of many outliers
suggests the data entry was sloppy.  There are likely to be incorrectly
entered values that are not "outliers".
Although it is painful, another round of data entry and verification may
be
in order.


*Correcting the data.*

Sometimes you can actually go back to redo the measurements.  (Is there
really a 500 pound 9 year old?).  You should always have all the paper
from
which data were transcribed.
On the rare occasions when there are very good reasons, you might modify
the
value for a particular case. e.g., percent correct entered as 1000% ==>
100%.


*Modifying the data.*
Values of variables  should be trimmed or recoded to "missing" only when
there is a clear rationale.  And then only when it is not possible to redo
the measurement process. (Maybe there really is a six year old who weighs
400 lbs. Go back and look if possible.)

If suspected outliers are recoded or trimmed, the analysis should be done
as
is and as modified  to see what the effect of the modification is.
Changing
the values of variables suspected to be outliers frequently leads to
misleading results. These procedures should be used very sparingly.

Math criteria can identify suspects.  There should be a trial before there
is a verdict and  the presumption should be against outlier status for a
value.


I don't recommend undesirable practices such as cavalierly trimming to 3
SDs.  Having a value beyond 3 SD can be reason to examine a case more
thoroughly.

It is advisable to consult with a statistician before changing the values
of
suspected outliers.

*Multiple analyses.*

If you have re-entered the data, or re-run the experiment, and done very
thorough exploration of the data, you are stuck as a last resort with
doing
multiple analyses: including vs excluding the case(s); changing the values
for the case(s) to hotdeck values, to some central tendency value, or to
max
or min on the response scale (e.g., for achievement, personality,  or
attitude measures), etc.

In the small minority of occasions where the data can not be cleaned up,
the
analysis should be done in  three  or more ways (include the outliers as
is,
trim the values, treat the values as missing, transform to ranks, include
in
the model variables that flag those cases,  or ...).  The reporting
becomes
much more complex.  Consider yourself very lucky if the conclusions do not
vary substantially.

</soapbox>

Art Kendall
Social Research Consultants

Hector Maletta wrote:

>          About the exclusion of outliers outside +/- 3 SD, a big "if"
> concerns the distribution of the variable. In biological variables,
> mostly with a normal distribution, cases outside the -3 to +3 range
> are rare, and more so if outside +/- 4 or 5. Moreover, they are often
> the result of data entry errors or sample flukes. But in other kinds
> of variables it ain't necessarily so. Income, for instance, is clearly
> skewed, and excluding those cases above +3 SD (even in the case of
> using log income) may imply leaving the rich, and with them a big
> chunk of aggregate income, outside the analysis.
>          As a general rule, do not exclude any valid datum.
>          Hector
>
>
>
>          -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf
> Of Ken Belzer
> Sent: 13 August 2007 01:02
> To: [hidden email]
> Subject: Re: SPSS-Stats question regarding outliers
>
>          Thanks very much to Robert for raising this issue and to
> those who responded
>          to my follow-up questions. I had been using a simple
> rule-of-thumb from
>          Tabachnick & Fidell (1996 - probably a bit outdated) of
> excluding ouliers from
>          the model/analysis if their z-scores relative to their
> distribution were equal
>          to or greater than +/-3.28 -- primarily for  univariate
procedures.
>
>          Clearly, there's quite a bit more to consider, and many  more
> ways to examine
>          the nature and impact  of outliers before simply excluding
them.

> I've saved
>          these  responses for future reference -- thanks again.
>
>          Kind regards,
>          Ken
>
>
>
>          ************************************** Get a sneak peek of
> the all-new AOL at
>          http://discover.aol.com/memed/aolcom30tour
>
>
>