outliers detection and exclusion

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

outliers detection and exclusion

Zhicheng Lin
Dear List members,

I have been doing experiments and analyzing data for two years but still
have some basic questions to ask (my field is vision and cognitive
psychology):

1) A general question: When do you exclude outliers and what standards do
you use? I understand that excluding outliers normally won't change much
about Mean but will make SD smaller. But still there are cases we do exclude
outliers not just to make SD look better.

2) A specific one: Suppose we have two conditions A B in the study. When
excluding outliers according to 3SD, do you do this in each condition (i.e.
data in A and B are separated) or on the whole data set (i.e. combination of
A and B)?

Any ideas? Many thanks!

Cheers,

Zhicheng

--
******************************************
Zhicheng Lin
Department of Psychology
University of Minnesota
75 East River Rd, Elliott Hall
Minneapolis, MN 55455


Email: [hidden email]
Phone: 612-625-2470
Fax: 612-626-2079
http://zhichenglin.googlepages.com/
******************************************

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: outliers detection and exclusion

Richard Ristow
At 09:29 PM 3/2/2008, Zhicheng Lin wrote:

>1) A general question: When do you exclude outliers and what
>standards do you use? I understand that excluding outliers normally
>won't change much about Mean but will make SD smaller. But still
>there are cases we do exclude outliers not just to make SD look better.

There's been a lot of correspondence about 'outliers', and excluding
them, on this list, over the years.

The general, short answer is: don't do it, except for 'outliers' that
can confidently be identified as erroneous data. Otherwise, you're
distorting your data and your analysis.

>I understand that excluding outliers normally won't change much
>about Mean but will make SD smaller.

It won't change Mean *only* if the rarer, larger values have the same
mean as the more common values. There's not the least reason this
need be so. (And remember, the large values have a very heavy weight
in computing the mean.)

As for reducing the SD, that's cheating. The SD really is what it is.
What if you threw out everything more than 1 SD from the mean? Your
SD would look really good, but your data would be nowhere near as
precise as that would look.

>Any ideas? Many thanks!

Well, there's a harsh one ...

-Onward, and best wishes,
  Richard


--
No virus found in this outgoing message.
Checked by AVG.
Version: 7.5.518 / Virus Database: 269.21.7/1324 - Release Date: 3/10/2008 7:27 PM

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: outliers detection and exclusion

Katkowski, David
Well, in some areas of specialization (e.g., testing) it is common practice to eliminate outliers. Think about a criterion-related validation study. You are asking incumbents to complete some predictor and their supervisors to make performance ratings on those incumbents. Neither is very excited to spend a couple hours on something they are told has to be done.

As a result, we have many incumbents who don't try very hard. To include those incumbents in the final analysis would distort the picture of how well the predictor does at selecting candidates for the job in question. There are several ways that we look for these folks, but one common standard is to eliminate any one data point that is + or - 3.29 sds from the mean.

Of course, there are many arguments for not engaging in this practice; however, if we are being practical rather than academic, I think it is a responsible, defendable choice.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow
Sent: Tuesday, March 11, 2008 8:48 PM
To: [hidden email]
Subject: Re: outliers detection and exclusion

At 09:29 PM 3/2/2008, Zhicheng Lin wrote:

>1) A general question: When do you exclude outliers and what
>standards do you use? I understand that excluding outliers normally
>won't change much about Mean but will make SD smaller. But still
>there are cases we do exclude outliers not just to make SD look better.

There's been a lot of correspondence about 'outliers', and excluding
them, on this list, over the years.

The general, short answer is: don't do it, except for 'outliers' that
can confidently be identified as erroneous data. Otherwise, you're
distorting your data and your analysis.

>I understand that excluding outliers normally won't change much
>about Mean but will make SD smaller.

It won't change Mean *only* if the rarer, larger values have the same
mean as the more common values. There's not the least reason this
need be so. (And remember, the large values have a very heavy weight
in computing the mean.)

As for reducing the SD, that's cheating. The SD really is what it is.
What if you threw out everything more than 1 SD from the mean? Your
SD would look really good, but your data would be nowhere near as
precise as that would look.

>Any ideas? Many thanks!

Well, there's a harsh one ...

-Onward, and best wishes,
  Richard


--
No virus found in this outgoing message.
Checked by AVG.
Version: 7.5.518 / Virus Database: 269.21.7/1324 - Release Date: 3/10/2008 7:27 PM

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


No virus found in this incoming message.
Checked by AVG.
Version: 7.5.518 / Virus Database: 269.21.7/1325 - Release Date: 3/11/2008 1:41 PM


No virus found in this outgoing message.
Checked by AVG.
Version: 7.5.518 / Virus Database: 269.21.7/1325 - Release Date: 3/11/2008 1:41 PM

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: outliers detection and exclusion

Richard Ristow
At 09:05 AM 3/12/2008, Katkowski, David wrote:

>Well, in some areas of specialization (e.g., testing) it is common
>practice to eliminate outliers. [For example,] you are asking
>incumbents to complete some predictor and their supervisors to make
>performance ratings on those incumbents. Neither is very excited to
>spend a couple hours on something they are told has to be done.
>
>As a result, we have many incumbents who don't try very hard. To
>include those incumbents in the final analysis would distort the
>picture of how well the predictor does at selecting candidates.

An important point. This is the case where "you may have two
processes, one of which operates occasionally to produce the
[outlier] values, the other of which operates 'normally' but is
swamped when the larger process happens."(*) In your case, the
'larger process' is the respondents' decision to blow off the
questionnaire. ('Large' means a large effect relative to what's
usually seen. It can include values much nearer zero than the usual ones.)

In this case, there is indeed reason to exclude 'outliers', by some
reasonable heuristic if you can't observe the "larger process."

It stands, though: In the absence of a specific argument that your
'outliers' are from a different population than the 'normal' cases,
it's not good methodology to drop them.
................
(*) I'm quoting myself:
Date:    Sun, 20 Aug 2006 16:30:50 -0400
From:    Richard Ristow <[hidden email]>
Subject: Re: outliers??
To:      [hidden email]


--
No virus found in this outgoing message.
Checked by AVG.
Version: 7.5.518 / Virus Database: 269.21.7/1327 - Release Date: 3/12/2008 1:27 PM

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: outliers detection and exclusion

Swank, Paul R
And should outliers be dropped, it is good practice to compare results
with and without outliers in the model to demonstrate the effect the
outliers have on the results.

Paul R. Swank, Ph.D.
Professor and Director of Research
Children's Learning Institute
University of Texas Health Science Center - Houston


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Richard Ristow
Sent: Wednesday, March 12, 2008 3:13 PM
To: [hidden email]
Subject: Re: outliers detection and exclusion

At 09:05 AM 3/12/2008, Katkowski, David wrote:

>Well, in some areas of specialization (e.g., testing) it is common
>practice to eliminate outliers. [For example,] you are asking
>incumbents to complete some predictor and their supervisors to make
>performance ratings on those incumbents. Neither is very excited to
>spend a couple hours on something they are told has to be done.
>
>As a result, we have many incumbents who don't try very hard. To
>include those incumbents in the final analysis would distort the
>picture of how well the predictor does at selecting candidates.

An important point. This is the case where "you may have two
processes, one of which operates occasionally to produce the
[outlier] values, the other of which operates 'normally' but is
swamped when the larger process happens."(*) In your case, the
'larger process' is the respondents' decision to blow off the
questionnaire. ('Large' means a large effect relative to what's
usually seen. It can include values much nearer zero than the usual
ones.)

In this case, there is indeed reason to exclude 'outliers', by some
reasonable heuristic if you can't observe the "larger process."

It stands, though: In the absence of a specific argument that your
'outliers' are from a different population than the 'normal' cases,
it's not good methodology to drop them.
................
(*) I'm quoting myself:
Date:    Sun, 20 Aug 2006 16:30:50 -0400
From:    Richard Ristow <[hidden email]>
Subject: Re: outliers??
To:      [hidden email]


--
No virus found in this outgoing message.
Checked by AVG.
Version: 7.5.518 / Virus Database: 269.21.7/1327 - Release Date:
3/12/2008 1:27 PM

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD