Hi Everyone,
I used to detect outliers by convert raw scores to z-scores and see if the z value of each survey item is greater than 3.29 or less than -3.29. My colleague suggested me to just calculate the mean for each survey and convert the mean values into z score since. Is the colleague right? Guan ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Restated: Given a set of raw scores for an item, YOU convert the raw scores to z-scores and look for values outside the interval -3.29 to 3.29. I'm not quite sure I understand what your colleague is advising you to do. Restated, it seems that he/she computes the mean of the score set and then divides that mean by the SD to get a standardized difference from 0.0. If that is true, then that is wrong. Outliers refer to individual values given a set of scores or a mean given a set of means.
Gene Maguin -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Guan Sent: Friday, October 03, 2014 1:10 PM To: [hidden email] Subject: Outlier Issue Hi Everyone, I used to detect outliers by convert raw scores to z-scores and see if the z value of each survey item is greater than 3.29 or less than -3.29. My colleague suggested me to just calculate the mean for each survey and convert the mean values into z score since. Is the colleague right? Guan ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Kuan
Why are you concerned about outliers?
What do you intend to with values that are out on the tails? The term "outlier" can have different denotation and implications. "Outlier" often means that a value is suspicious, anomalous, unusual, to-be-checked-on, etc. What constructs are the variables under consideration designed to measure? Are the items designed to be used in a summative scale? Please give a more detailed explanation of the context in which the question arises.
Art Kendall
Social Research Consultants |
Administrator
|
In reply to this post by Kuan
The notion of examining item by item for outliers is quite ludicrous!
----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
At 10:54 AM 10/7/2014, David Marso wrote:
>The notion of examining item by item for outliers is quite ludicrous! I'm not sure I agree with that. 'Outliers' may be erroneous data; or they may indicate a large effect that's present in a minority of your cases. Either can be illuminating; the first, crucial. Now, the original poster asked, >I used to detect outliers by convert raw scores to z-scores and see >if the z value of each survey item is greater than 3.29 or less than -3.29. That's based on the assumption of a normal distribution in the data; under that assumption, only one case in 10^4 will have a Z-score outside that range. Unfortunately, real distributions frequently are not normal; and one of the more common deviations from normality is long 'tails', more frequent occurrences of large values than would occur for a normal distribution. Below, something I wrote about outliers quite a while ago: >*Whether*, and *by what standard*, to identify outliers, is at least >as important as 'how'. > >. Outliers that fail consistency checks: For example, an event date >prior to the beginning of the study, or later than the present, can >be rejected as wrong. (Of course, this assumes that 'event dates' in >the past or future aren't valid; in some study designs, they may >be.) Those should be checked against the primary data source and >corrected, if that is possible; and made missing, if it isn't. > >. Outliers that can't be rejected *a priori*: First, you shouldn't >even try to look at those until you reject any demonstrable errors. > >Second, I would say a good way to look for them is to look at the >high-percentile cutpoints in the distribution. Depending on the size >of your dataset, 'high-percentile' could be 99%, 99.9%, and 99.99%. >(These are not alternatives. If you use, say, 99.9%, you should look >at 99% as well. Consider also looking at the 90% or 95% cutpoint, >for a sense of the 'normal' range of the distribution. 5% outliers >are NOT outliers. And, of course, look at both ends: 1%, 0.1%, 0.01% >percentile cutpoints, as well.) > >Third, I think I'm seeing a trend in the statistics community >against removing 'outliers' by internal criteria (n standard >deviations, 1st and 99th percentiles). The rationale, and it's a >strong one, is that those are observed values of whatever it is that >you're measuring. If you eliminate them, you'll get a model based on >their rarity; and that model, itself, can become an argument for >eliminating them (because they don't fit it), and you can talk >yourself into a model that's quite unrepresentative of reality. > >Fourth, however, the largest values will have a disproportionate, >possibly dominant, effect on most linear models -- regression, >ANOVA, even taking the arithmetic mean. Depending on your study, you can > >- Go ahead. In this case, the model's fit will be weighted toward >predicting the largest values, and may show little discrimination >within the 'cloud' of more-typical values. That, however, may be the >right insight to be gained from the data. > >- If available, use a non-parametric method. That's often favored, >because it neither rejects the large values nor gives them >disproportionate weight. By the same token, however, if much of the >useful information is in the largest values, non-parametric methods >can unduly DE-emphasize these values. > >- There are reasons to reject this as heresy, but if you're doing >linear modelling, I'd probably try it both with the largest values >retained and with them eliminated. (I'd only do this if the 'largest >values' look very far from the 'cloud' of regular values. A scatter >plot can be an invaluable tool for this.) If the two models are >closely similar, you have an argument that there's a single process >going on, with the largest values being part of the same process. If >they're very different, you may have two processes, one of which >operates occasionally to produce the largest values, the other of >which operates 'normally' but is swamped when the larger process >happens. And if the run without the large values produces a poor >R^2, you may have an argument that the observable process is >represented by the largest values, and the variation in the 'normal >cloud' is mostly noise. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |