Outlier Issue

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Outlier Issue

Kuan
Hi Everyone,

I used to detect outliers by convert  raw scores to z-scores and see if the z value of each survey item is greater than 3.29 or less than -3.29.
My colleague suggested me to just calculate the mean for each survey and convert the mean values into z score since.  
Is the colleague right?

Guan

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Outlier Issue

Maguin, Eugene
Restated: Given a set of raw scores for an item, YOU convert the raw scores to z-scores and look for values outside the interval -3.29 to 3.29. I'm not quite sure I understand what your colleague is advising you to do. Restated, it seems that he/she computes the mean of the score set and then divides that mean by the SD to get a standardized difference from 0.0. If that is true, then that is wrong. Outliers refer to individual values given a set of scores or a mean given a set of means.
Gene Maguin




-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Guan
Sent: Friday, October 03, 2014 1:10 PM
To: [hidden email]
Subject: Outlier Issue

Hi Everyone,

I used to detect outliers by convert  raw scores to z-scores and see if the z value of each survey item is greater than 3.29 or less than -3.29.
My colleague suggested me to just calculate the mean for each survey and convert the mean values into z score since.  
Is the colleague right?

Guan

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Outlier Issue

Art Kendall
In reply to this post by Kuan
Why are you concerned about outliers?

What do you intend to with values that are out on the  tails?

The term  "outlier" can have different denotation and implications.

"Outlier" often means that a  value is suspicious, anomalous, unusual, to-be-checked-on, etc.

What constructs are the variables under consideration designed to measure?

Are the items designed to be used in a summative scale?

Please give a more detailed explanation of the context in which the question arises.

Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Outlier Issue

David Marso
Administrator
In reply to this post by Kuan
The notion of examining item by item for outliers is quite ludicrous!
----
Kuan wrote
Hi Everyone,

I used to detect outliers by convert  raw scores to z-scores and see if the z value of each survey item is greater than 3.29 or less than -3.29.
My colleague suggested me to just calculate the mean for each survey and convert the mean values into z score since.  
Is the colleague right?

Guan

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Outlier Issue

Richard Ristow
At 10:54 AM 10/7/2014, David Marso wrote:

>The notion of examining item by item for outliers is quite ludicrous!

I'm not sure I agree with that. 'Outliers' may be erroneous data; or
they may indicate a large effect that's present in a minority of your
cases. Either can be illuminating; the first, crucial.

Now, the original poster asked,
>I used to detect outliers by convert  raw scores to z-scores and see
>if the z value of each survey item is greater than 3.29 or less than -3.29.

That's based on the assumption of a normal distribution in the data;
under that assumption, only one case in 10^4 will have a Z-score
outside that range.

Unfortunately, real distributions frequently are not normal; and one
of the more common deviations from normality is long 'tails', more
frequent occurrences of large values than would occur for a normal
distribution.

Below, something I wrote about outliers quite a while ago:

>*Whether*, and *by what standard*, to identify outliers, is at least
>as important as 'how'.
>
>. Outliers that fail consistency checks: For example, an event date
>prior to the beginning of the study, or later than the present, can
>be rejected as wrong. (Of course, this assumes that 'event dates' in
>the past or future aren't valid; in some study designs, they may
>be.) Those should be checked against the primary data source and
>corrected, if that is possible; and made missing, if it isn't.
>
>. Outliers that can't be rejected *a priori*: First, you shouldn't
>even try to look at those until you reject any demonstrable errors.
>
>Second, I would say a good way to look for them is to look at the
>high-percentile cutpoints in the distribution. Depending on the size
>of your dataset, 'high-percentile' could be 99%, 99.9%, and 99.99%.
>(These are not alternatives. If you use, say, 99.9%, you should look
>at 99% as well. Consider also looking at the 90% or 95% cutpoint,
>for a sense of the 'normal' range of the distribution. 5% outliers
>are NOT outliers. And, of course, look at both ends: 1%, 0.1%, 0.01%
>percentile cutpoints, as well.)
>
>Third, I think I'm seeing a trend in the statistics community
>against removing 'outliers' by internal criteria (n standard
>deviations, 1st and 99th percentiles). The rationale, and it's a
>strong one, is that those are observed values of whatever it is that
>you're measuring. If you eliminate them, you'll get a model based on
>their rarity; and that model, itself, can become an argument for
>eliminating them (because they don't fit it), and you can talk
>yourself into a model that's quite unrepresentative of reality.
>
>Fourth, however, the largest values will have a disproportionate,
>possibly dominant, effect on most linear models -- regression,
>ANOVA, even taking the arithmetic mean. Depending on your study, you can
>
>- Go ahead. In this case, the model's fit will be weighted toward
>predicting the largest values, and may show little discrimination
>within the 'cloud' of more-typical values. That, however, may be the
>right insight to be gained from the data.
>
>- If available, use a non-parametric method. That's often favored,
>because it neither rejects the large values nor gives them
>disproportionate weight. By the same token, however, if much of the
>useful information is in the largest values, non-parametric methods
>can unduly DE-emphasize these values.
>
>- There are reasons to reject this as heresy, but if you're doing
>linear modelling, I'd probably try it both with the largest values
>retained and with them eliminated. (I'd only do this if the 'largest
>values' look very far from the 'cloud' of regular values. A scatter
>plot can be an invaluable tool for this.) If the two models are
>closely similar, you have an argument that there's a single process
>going on, with the largest values being part of the same process. If
>they're very different, you may have two processes, one of which
>operates occasionally to produce the largest values, the other of
>which operates 'normally' but is swamped when the larger process
>happens. And if the run without the large values produces a poor
>R^2, you may have an argument that the observable process is
>represented by the largest values, and the variation in the 'normal
>cloud' is mostly noise.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD