|
----- Forwarded by Søren Bech/SBE/Bang & Olufsen/DK on 13-08-2007 21:15
----- Søren Bech/SBE/Bang & Olufsen/DK 13-08-2007 15:42 To [hidden email] cc Subject Re: SPSS-Stats question regarding outliers Hi Tom This is a very complicated question and one which cannot be answered with a simple rule. Hopefully the list can forgive me for introducing a small add, but in our book ( http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470869232.html) on subjective audio evaluation we have a discussion of these issues including a large number of references to e.g. the food industry where they also use sensory evaluations and have standards for training of subject panels. etc. We are also involved in a project that have developed a program (Panelcheck see http://www.matforsk.no/web/sampro.nsf/webTemaPE/PanelCheck!OpenDocument) that can be used to identify outlying subjects. Please do not hesitate to mail me with further questions. Regards Soren Tom Werner <[hidden email]> Sent by: "SPSSX(r) Discussion" <[hidden email]> 13-08-2007 15:26 Please respond to [hidden email] To [hidden email] cc Subject Re: SPSS-Stats question regarding outliers This is a most interesting discussion. I might suggest that one area in which it may be important to identify (and perhaps remove) outliers is ratings by judges. If we were to have panels of judges judging entries by rating them on numerical scales (such as in an awards program, skating/gymnastics judging, or applicant judging), we would need to identify and manage inter-rater reliability/agreement. One way to achieve appropriate inter-rater reliability would be to identify and remove outlier ratings (overly strict or overly lenient scores relative to other judges' scores on the same entry). It strikes me that if this isn't done, the awards program (or sports contest or application process) risks having variation that is due more to the judges than to the entries. (Other steps could also be taken to increase inter-rater reliability, such as training of the judges. But it seems that identifying and removing outlier scores would always be a worthwhile additional step.) I'd be grateful for any thoughts on this. Tom Werner -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Art Kendall Sent: Monday, August 13, 2007 7:46 AM To: [hidden email] Subject: Re: SPSS-Stats question regarding outliers <soapbox> "Outliers" is a very problematic concept. There are a wide variety of meanings ascribed to the term. Extreme values may be valid. They may be the most important values. Arbitrary treatment of values as outliers should rarely if ever be done. Leverage stats, etc. only identify potential or suspected outliers. Based on consulting on stat and methodology for over 30 years, I believe the usual explanation when there are suspicious values is failure of the quality assurance procedure. . I think of a *potential* outlier as a surprising or suspicious value for a variable (including residuals). In my experience, in the vast majority of instances, they indicate data gathering or data entry errors, i.e., insufficient attention in quality assurance in data gathering or data entry. In my experience, rechecking qa typically eliminates over 80% of suspicious data values. This is one reason I advocate thorough exploration of a set of data before doing the analysis. By thorough exploration I mean things like frequencies, multi-way crosstabs, scatterplots, box plots, rechecking scale keys and reliability, etc. Derived variables such as residuals and rates, should be subjected to the same thorough examination and understanding-seeking as raw variables. In cluster analysis, sometimes there are singleton clusters, e.g., Los Angeles county is distinct from other counties in the western US states. Some times there are 500 lb persons. There might be a rose growing in a cornfield. The first thing to do with outliers is to *_prevent_ * them by careful quality assurance procedures in data gathering and data entry. A thorough search for suspect data values and potentially treating them as outliers in analysis is an important part of data quality assurance. Values for a variable are suspect and in need of further review when they are unusual given the subject matter area, outside the legitimate range of the response scale, show as isolated on scattergrams, have subjectively extreme residuals, when the data shows very high order interaction on ANOVA analyses, when they result in a case being extremely influential in a regression, etc. Recall that researchers consider Murphy a Pollyanna. The detection of odd/peculiar/suspicious values late in the data analysis process is one one reason to assure that you can go all the way back and redo the process. Keeping all of the data gathering instruments, and preserving the syntax for all data transformation are important parts of going back and checking on "outliers". The occurrence of many outliers suggests the data entry was sloppy. There are likely to be incorrectly entered values that are not "outliers". Although it is painful, another round of data entry and verification may be in order. *Correcting the data.* Sometimes you can actually go back to redo the measurements. (Is there really a 500 pound 9 year old?). You should always have all the paper from which data were transcribed. On the rare occasions when there are very good reasons, you might modify the value for a particular case. e.g., percent correct entered as 1000% ==> 100%. *Modifying the data.* Values of variables should be trimmed or recoded to "missing" only when there is a clear rationale. And then only when it is not possible to redo the measurement process. (Maybe there really is a six year old who weighs 400 lbs. Go back and look if possible.) If suspected outliers are recoded or trimmed, the analysis should be done as is and as modified to see what the effect of the modification is. Changing the values of variables suspected to be outliers frequently leads to misleading results. These procedures should be used very sparingly. Math criteria can identify suspects. There should be a trial before there is a verdict and the presumption should be against outlier status for a value. I don't recommend undesirable practices such as cavalierly trimming to 3 SDs. Having a value beyond 3 SD can be reason to examine a case more thoroughly. It is advisable to consult with a statistician before changing the values of suspected outliers. *Multiple analyses.* If you have re-entered the data, or re-run the experiment, and done very thorough exploration of the data, you are stuck as a last resort with doing multiple analyses: including vs excluding the case(s); changing the values for the case(s) to hotdeck values, to some central tendency value, or to max or min on the response scale (e.g., for achievement, personality, or attitude measures), etc. In the small minority of occasions where the data can not be cleaned up, the analysis should be done in three or more ways (include the outliers as is, trim the values, treat the values as missing, transform to ranks, include in the model variables that flag those cases, or ...). The reporting becomes much more complex. Consider yourself very lucky if the conclusions do not vary substantially. </soapbox> Art Kendall Social Research Consultants Hector Maletta wrote: > About the exclusion of outliers outside +/- 3 SD, a big "if" > concerns the distribution of the variable. In biological variables, > mostly with a normal distribution, cases outside the -3 to +3 range > are rare, and more so if outside +/- 4 or 5. Moreover, they are often > the result of data entry errors or sample flukes. But in other kinds > of variables it ain't necessarily so. Income, for instance, is clearly > skewed, and excluding those cases above +3 SD (even in the case of > using log income) may imply leaving the rich, and with them a big > chunk of aggregate income, outside the analysis. > As a general rule, do not exclude any valid datum. > Hector > > > > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf > Of Ken Belzer > Sent: 13 August 2007 01:02 > To: [hidden email] > Subject: Re: SPSS-Stats question regarding outliers > > Thanks very much to Robert for raising this issue and to > those who responded > to my follow-up questions. I had been using a simple > rule-of-thumb from > Tabachnick & Fidell (1996 - probably a bit outdated) of > excluding ouliers from > the model/analysis if their z-scores relative to their > distribution were equal > to or greater than +/-3.28 -- primarily for univariate > > Clearly, there's quite a bit more to consider, and many more > ways to examine > the nature and impact of outliers before simply excluding them. > I've saved > these responses for future reference -- thanks again. > > Kind regards, > Ken > > > > ************************************** Get a sneak peek of > the all-new AOL at > http://discover.aol.com/memed/aolcom30tour > > > |
| Free forum by Nabble | Edit this page |
