|
Hi list,
I have a SPSS-STATS question. SPSS shows outliers using the BOX PLOT feature. These outliers look to me to be based on the data's frequency distribution. SPSS also shows outliers in the form of residuals. In my work, I removed any data points with residuals greater than > 3 SD. There were only three outliers according to an analysis of residuals. When I look at the box plot, there are about twice as many outliers. I don't really know the practical difference between the outliers. Which way should I be examining outliers, by frequency distribution or by residuals? Any help greatly appreciated. Robert |
|
Robert,
Nothing weird is happening. It is just that the new outliers are computed relative to the new standard deviation. They are still at more than 3 SD from the mean; only the new std dev is smaller than before. If your data have a near normal distribution, for instance, with more and more cases as you approach the mean, the more outliers you exclude the more new outliers will appear, because the >3SD tail of the distribution will be more densely populated as the SD diminishes in size. Hector -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Robert Marshall Sent: 11 August 2007 14:07 To: [hidden email] Subject: SPSS-Stats question regarding outliers Hi list, I have a SPSS-STATS question. SPSS shows outliers using the BOX PLOT feature. These outliers look to me to be based on the data's frequency distribution. SPSS also shows outliers in the form of residuals. In my work, I removed any data points with residuals greater than > 3 SD. There were only three outliers according to an analysis of residuals. When I look at the box plot, there are about twice as many outliers. I don't really know the practical difference between the outliers. Which way should I be examining outliers, by frequency distribution or by residuals? Any help greatly appreciated. Robert |
|
In reply to this post by Robert Marshall-7
Thank you so much Hector. Now I get it. :-)
-------------- Original message -------------- From: Hector Maletta <[hidden email]> > Robert, > Nothing weird is happening. It is just that the new outliers are > computed relative to the new standard deviation. They are still at more than > 3 SD from the mean; only the new std dev is smaller than before. If your > data have a near normal distribution, for instance, with more and more cases > as you approach the mean, the more outliers you exclude the more new > outliers will appear, because the >3SD tail of the distribution will be more > densely populated as the SD diminishes in size. > > Hector > > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of > Robert Marshall > Sent: 11 August 2007 14:07 > To: SPSS [hidden email] > Subject: SPSS-Stats question regarding outliers > > Hi list, > > I have a SPSS-STATS question. SPSS shows outliers using the BOX > PLOT feature. These outliers look to me to be based on the data's frequency > distribution. SPSS also shows outliers in the form of residuals. In my > work, I removed any data points with residuals greater than > 3 SD. There > were only three outliers according to an analysis of residuals. When I look > at the box plot, there are about twice as many outliers. I don't really > know the practical difference between the outliers. > > Which way should I be examining outliers, by frequency distribution > or by residuals? > > Any help greatly appreciated. > > Robert |
|
In reply to this post by Robert Marshall-7
Hi,
I just had two brief follow-up questions to Robert's concerning outliers in SPSS. First, Robert mentioned "SPSS also shows outliers in the form of residuals." Where is this found in SPSS? Is it derived from DESCRIPTIVES, EXPLORE, or a REGRESSION procedure? Second, is > 3 standard deviations the commonly accepted -- or default definition for outliers -- that is used in SPSS, particularly with the boxplots that are provided in the EXPLORE procedure? Thanks very much in advance. Regards, Ken In a message dated 8/11/2007 2:02:12 PM Eastern Daylight Time, [hidden email] writes: Robert, Nothing weird is happening. It is just that the new outliers are computed relative to the new standard deviation. They are still at more than 3 SD from the mean; only the new std dev is smaller than before. If your data have a near normal distribution, for instance, with more and more cases as you approach the mean, the more outliers you exclude the more new outliers will appear, because the >3SD tail of the distribution will be more densely populated as the SD diminishes in size. Hector -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Robert Marshall Sent: 11 August 2007 14:07 To: [hidden email] Subject: SPSS-Stats question regarding outliers Hi list, I have a SPSS-STATS question. SPSS shows outliers using the BOX PLOT feature. These outliers look to me to be based on the data's frequency distribution. SPSS also shows outliers in the form of residuals. In my work, I removed any data points with residuals greater than > 3 SD. There were only three outliers according to an analysis of residuals. When I look at the box plot, there are about twice as many outliers. I don't really know the practical difference between the outliers. Which way should I be examining outliers, by frequency distribution or by residuals? Any help greatly appreciated. Robert ************************************** Get a sneak peek of the all-new AOL at http://discover.aol.com/memed/aolcom30tour |
|
Large regression residuals can be tabulated with the /CASEWISE OUTLIERS(n) syntax in REGRESSION, and, of course residuals can be saved and analyzed with EXAMINE/EXPLORE and other procedures.
I don't think as a general matter, though, that automatically removing residuals larger than 3 sd is a good idea. They are evidence against your model and ought to be carefully considered. If they have large leverage in the regression, this is especially important. I would look at leverage and residual size, but also plots like residuals vs fitted values to see if there is a pattern that can be discerned. If you do remove them, you might want to require a higher significance level or, at least, to document this when reporting results. Ultimately this is a judgment call that often is not easy. Regards, Jon Peck -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Ken Belzer Sent: Saturday, August 11, 2007 3:42 PM To: [hidden email] Subject: Re: [SPSSX-L] SPSS-Stats question regarding outliers Hi, I just had two brief follow-up questions to Robert's concerning outliers in SPSS. First, Robert mentioned "SPSS also shows outliers in the form of residuals." Where is this found in SPSS? Is it derived from DESCRIPTIVES, EXPLORE, or a REGRESSION procedure? Second, is > 3 standard deviations the commonly accepted -- or default definition for outliers -- that is used in SPSS, particularly with the boxplots that are provided in the EXPLORE procedure? Thanks very much in advance. Regards, Ken In a message dated 8/11/2007 2:02:12 PM Eastern Daylight Time, [hidden email] writes: Robert, Nothing weird is happening. It is just that the new outliers are computed relative to the new standard deviation. They are still at more than 3 SD from the mean; only the new std dev is smaller than before. If your data have a near normal distribution, for instance, with more and more cases as you approach the mean, the more outliers you exclude the more new outliers will appear, because the >3SD tail of the distribution will be more densely populated as the SD diminishes in size. Hector -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Robert Marshall Sent: 11 August 2007 14:07 To: [hidden email] Subject: SPSS-Stats question regarding outliers Hi list, I have a SPSS-STATS question. SPSS shows outliers using the BOX PLOT feature. These outliers look to me to be based on the data's frequency distribution. SPSS also shows outliers in the form of residuals. In my work, I removed any data points with residuals greater than > 3 SD. There were only three outliers according to an analysis of residuals. When I look at the box plot, there are about twice as many outliers. I don't really know the practical difference between the outliers. Which way should I be examining outliers, by frequency distribution or by residuals? Any help greatly appreciated. Robert ************************************** Get a sneak peek of the all-new AOL at http://discover.aol.com/memed/aolcom30tour |
|
I agree with Jon's idea that removing outliers is not generally a
sound idea. The only case I would go for it is when the outlier is manifestly a wrong datum (say, age=542), in which case the outlier should be recoded to some missing value (or in some particular instances replaced by some imputed value as the case might be). Hector -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Peck, Jon Sent: 11 August 2007 17:59 To: [hidden email] Subject: Re: SPSS-Stats question regarding outliers Large regression residuals can be tabulated with the /CASEWISE OUTLIERS(n) syntax in REGRESSION, and, of course residuals can be saved and analyzed with EXAMINE/EXPLORE and other procedures. I don't think as a general matter, though, that automatically removing residuals larger than 3 sd is a good idea. They are evidence against your model and ought to be carefully considered. If they have large leverage in the regression, this is especially important. I would look at leverage and residual size, but also plots like residuals vs fitted values to see if there is a pattern that can be discerned. If you do remove them, you might want to require a higher significance level or, at least, to document this when reporting results. Ultimately this is a judgment call that often is not easy. Regards, Jon Peck -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Ken Belzer Sent: Saturday, August 11, 2007 3:42 PM To: [hidden email] Subject: Re: [SPSSX-L] SPSS-Stats question regarding outliers Hi, I just had two brief follow-up questions to Robert's concerning outliers in SPSS. First, Robert mentioned "SPSS also shows outliers in the form of residuals." Where is this found in SPSS? Is it derived from DESCRIPTIVES, EXPLORE, or a REGRESSION procedure? Second, is > 3 standard deviations the commonly accepted -- or default definition for outliers -- that is used in SPSS, particularly with the boxplots that are provided in the EXPLORE procedure? Thanks very much in advance. Regards, Ken In a message dated 8/11/2007 2:02:12 PM Eastern Daylight Time, [hidden email] writes: Robert, Nothing weird is happening. It is just that the new outliers are computed relative to the new standard deviation. They are still at more than 3 SD from the mean; only the new std dev is smaller than before. If your data have a near normal distribution, for instance, with more and more cases as you approach the mean, the more outliers you exclude the more new outliers will appear, because the >3SD tail of the distribution will be more densely populated as the SD diminishes in size. Hector -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Robert Marshall Sent: 11 August 2007 14:07 To: [hidden email] Subject: SPSS-Stats question regarding outliers Hi list, I have a SPSS-STATS question. SPSS shows outliers using the BOX PLOT feature. These outliers look to me to be based on the data's frequency distribution. SPSS also shows outliers in the form of residuals. In my work, I removed any data points with residuals greater than > 3 SD. There were only three outliers according to an analysis of residuals. When I look at the box plot, there are about twice as many outliers. I don't really know the practical difference between the outliers. Which way should I be examining outliers, by frequency distribution or by residuals? Any help greatly appreciated. Robert ************************************** Get a sneak peek of the all-new AOL at http://discover.aol.com/memed/aolcom30tour |
|
Hello List,
I agree that detecting outliers which are + / - 3 SD around the mean is not a good idea. I also agree that outliers deserve a closer look and that automatic suppression isn't a good idea. However, an outlier is likely to alter both the Type I and Type II error rates. According to McClelland (2000), linear models are quite robust against many assumptions violations. However, he thinks the most "dangerous" violation is distribution "thick tails" caused by outlliers. There are many ways to detect outliers, according to the dimension on which they differ from the expected distribution. For instance, an observation can be extreme on a continous criterion, but also on continuous predictors, or on both of them. For instance, you expect a positive relation between the criterion and the predictor, an observation can be very high on both of them, but can still fit the model, and then, wouldn't be a real outlier. However, using the M+/-3SD detection rule, it is likely to be detectedd as an outlier. Conversely, an observation can be around the mean on the criterion, but very high (or very low) on the predictor, thus, it would'nt be detected as an outlier with the M+/-3SD rule. However, this observation truly is an outlier. One problem with using z-scores is that an outlier may distort the estimated mean and standard deviation so that the outlier no longer looks extreme, as Hector highlighted. A solution is then to leave out an observation, recalculate the mean and standard deviation of the remaining observations, and then calculate the z-score. McClelland recommand to use the Studentized Deleted Residual, as an indicator for outliers. SDR compares the distribution with all the observations with the distribution with all the observations minus one (the observation for which SPSS gives the SDR value). SDR can be compared with a t-ditribution of (n-2) degree of freedom. However, as SDR implies making n analysis, you must adjust your alpha level using Bonferroni adjustment. Concerning the closer look to give to outliers, McClelland makes no recommandations about it. In my opinion, you can look to outliers like you should do with missing values. I would recommand to recode outliers with 0 (not outlier) and 1 (outlier) values, and then to regress this new variable on your model in order to assess if your outliers are randomly distributed or not. If they are, I think there is no problem with deleting them. If they are not, as pointed by Jon, I think you should be more careful with your results interpretation. My opinion is also that you can replace outlier values like you should do with missing values (as Hector proposed, see also Tabachnick and Fidell, 2004, or Cohen, Cohen, West, & Aiken, 2003) use this 0/1 variable as a predictor in your model. This will prevent from a loss of power due to observation deletion.
|
|
In reply to this post by Robert Marshall-7
My understanding of SPSS's Boxplot feature is that it produces a boxplot by
calculating the Interquartile Range (the middle half of the sample around the median). The Interquartile Range is the "box" in an SPSS boxplot. SPSS then identifies outliers as data points that are more than one-and-a-half box-lengths from each end of the box. (See the bottom of the page at http://www.maths.murdoch.edu.au/units/statsnotes/samplestats/boxplot.html.) So, in drawing a boxplot, SPSS is using median and interquartile range (rather than mean and standard deviation). Am I right about this? Do others have the same understanding? Tom Werner -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Robert Marshall Sent: Saturday, August 11, 2007 1:07 PM To: [hidden email] Subject: SPSS-Stats question regarding outliers Hi list, I have a SPSS-STATS question. SPSS shows outliers using the BOX PLOT feature. These outliers look to me to be based on the data's frequency distribution. SPSS also shows outliers in the form of residuals. In my work, I removed any data points with residuals greater than > 3 SD. There were only three outliers according to an analysis of residuals. When I look at the box plot, there are about twice as many outliers. I don't really know the practical difference between the outliers. Which way should I be examining outliers, by frequency distribution or by residuals? Any help greatly appreciated. Robert |
|
You can find the exact details in the Help/Algorithms/Examine/Plots/Boxplots topic
Outliers are indeed based on 1.5 IQR. IQR = Q3 - Q1 STEP = 1.5 IQR outlier if Q3 + STEP <= y(i) < Q3 + 2 STEP (high case) extreme if further out than that. Regression outliers use a moment-based calculation and will generally give different results, but both are useful. Of course, the boxplot does not know that the values are residuals, so it does not make adjustments for that. -Jon Peck -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Tom Werner Sent: Sunday, August 12, 2007 10:37 AM To: [hidden email] Subject: Re: [SPSSX-L] SPSS-Stats question regarding outliers My understanding of SPSS's Boxplot feature is that it produces a boxplot by calculating the Interquartile Range (the middle half of the sample around the median). The Interquartile Range is the "box" in an SPSS boxplot. SPSS then identifies outliers as data points that are more than one-and-a-half box-lengths from each end of the box. (See the bottom of the page at http://www.maths.murdoch.edu.au/units/statsnotes/samplestats/boxplot.html.) So, in drawing a boxplot, SPSS is using median and interquartile range (rather than mean and standard deviation). Am I right about this? Do others have the same understanding? Tom Werner -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Robert Marshall Sent: Saturday, August 11, 2007 1:07 PM To: [hidden email] Subject: SPSS-Stats question regarding outliers Hi list, I have a SPSS-STATS question. SPSS shows outliers using the BOX PLOT feature. These outliers look to me to be based on the data's frequency distribution. SPSS also shows outliers in the form of residuals. In my work, I removed any data points with residuals greater than > 3 SD. There were only three outliers according to an analysis of residuals. When I look at the box plot, there are about twice as many outliers. I don't really know the practical difference between the outliers. Which way should I be examining outliers, by frequency distribution or by residuals? Any help greatly appreciated. Robert |
|
In reply to this post by Robert Marshall-7
Thanks very much to Robert for raising this issue and to those who responded
to my follow-up questions. I had been using a simple rule-of-thumb from Tabachnick & Fidell (1996 - probably a bit outdated) of excluding ouliers from the model/analysis if their z-scores relative to their distribution were equal to or greater than +/-3.28 -- primarily for univariate procedures. Clearly, there's quite a bit more to consider, and many more ways to examine the nature and impact of outliers before simply excluding them. I've saved these responses for future reference -- thanks again. Kind regards, Ken ************************************** Get a sneak peek of the all-new AOL at http://discover.aol.com/memed/aolcom30tour |
|
About the exclusion of outliers outside +/- 3 SD, a big "if"
concerns the distribution of the variable. In biological variables, mostly with a normal distribution, cases outside the -3 to +3 range are rare, and more so if outside +/- 4 or 5. Moreover, they are often the result of data entry errors or sample flukes. But in other kinds of variables it ain't necessarily so. Income, for instance, is clearly skewed, and excluding those cases above +3 SD (even in the case of using log income) may imply leaving the rich, and with them a big chunk of aggregate income, outside the analysis. As a general rule, do not exclude any valid datum. Hector -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Ken Belzer Sent: 13 August 2007 01:02 To: [hidden email] Subject: Re: SPSS-Stats question regarding outliers Thanks very much to Robert for raising this issue and to those who responded to my follow-up questions. I had been using a simple rule-of-thumb from Tabachnick & Fidell (1996 - probably a bit outdated) of excluding ouliers from the model/analysis if their z-scores relative to their distribution were equal to or greater than +/-3.28 -- primarily for univariate procedures. Clearly, there's quite a bit more to consider, and many more ways to examine the nature and impact of outliers before simply excluding them. I've saved these responses for future reference -- thanks again. Kind regards, Ken ************************************** Get a sneak peek of the all-new AOL at http://discover.aol.com/memed/aolcom30tour |
|
<soapbox>
"Outliers" is a very problematic concept. There are a wide variety of meanings ascribed to the term. Extreme values may be valid. They may be the most important values. Arbitrary treatment of values as outliers should rarely if ever be done. Leverage stats, etc. only identify potential or suspected outliers. Based on consulting on stat and methodology for over 30 years, I believe the usual explanation when there are suspicious values is failure of the quality assurance procedure. . I think of a *potential* outlier as a surprising or suspicious value for a variable (including residuals). In my experience, in the vast majority of instances, they indicate data gathering or data entry errors, i.e., insufficient attention in quality assurance in data gathering or data entry. In my experience, rechecking qa typically eliminates over 80% of suspicious data values. This is one reason I advocate thorough exploration of a set of data before doing the analysis. By thorough exploration I mean things like frequencies, multi-way crosstabs, scatterplots, box plots, rechecking scale keys and reliability, etc. Derived variables such as residuals and rates, should be subjected to the same thorough examination and understanding-seeking as raw variables. In cluster analysis, sometimes there are singleton clusters, e.g., Los Angeles county is distinct from other counties in the western US states. Some times there are 500 lb persons. There might be a rose growing in a cornfield. The first thing to do with outliers is to *_prevent_ * them by careful quality assurance procedures in data gathering and data entry. A thorough search for suspect data values and potentially treating them as outliers in analysis is an important part of data quality assurance. Values for a variable are suspect and in need of further review when they are unusual given the subject matter area, outside the legitimate range of the response scale, show as isolated on scattergrams, have subjectively extreme residuals, when the data shows very high order interaction on ANOVA analyses, when they result in a case being extremely influential in a regression, etc. Recall that researchers consider Murphy a Pollyanna. The detection of odd/peculiar/suspicious values late in the data analysis process is one one reason to assure that you can go all the way back and redo the process. Keeping all of the data gathering instruments, and preserving the syntax for all data transformation are important parts of going back and checking on "outliers". The occurrence of many outliers suggests the data entry was sloppy. There are likely to be incorrectly entered values that are not "outliers". Although it is painful, another round of data entry and verification may be in order. *Correcting the data.* Sometimes you can actually go back to redo the measurements. (Is there really a 500 pound 9 year old?). You should always have all the paper from which data were transcribed. On the rare occasions when there are very good reasons, you might modify the value for a particular case. e.g., percent correct entered as 1000% ==> 100%. *Modifying the data.* Values of variables should be trimmed or recoded to "missing" only when there is a clear rationale. And then only when it is not possible to redo the measurement process. (Maybe there really is a six year old who weighs 400 lbs. Go back and look if possible.) If suspected outliers are recoded or trimmed, the analysis should be done as is and as modified to see what the effect of the modification is. Changing the values of variables suspected to be outliers frequently leads to misleading results. These procedures should be used very sparingly. Math criteria can identify suspects. There should be a trial before there is a verdict and the presumption should be against outlier status for a value. I don't recommend undesirable practices such as cavalierly trimming to 3 SDs. Having a value beyond 3 SD can be reason to examine a case more thoroughly. It is advisable to consult with a statistician before changing the values of suspected outliers. *Multiple analyses.* If you have re-entered the data, or re-run the experiment, and done very thorough exploration of the data, you are stuck as a last resort with doing multiple analyses: including vs excluding the case(s); changing the values for the case(s) to hotdeck values, to some central tendency value, or to max or min on the response scale (e.g., for achievement, personality, or attitude measures), etc. In the small minority of occasions where the data can not be cleaned up, the analysis should be done in three or more ways (include the outliers as is, trim the values, treat the values as missing, transform to ranks, include in the model variables that flag those cases, or ...). The reporting becomes much more complex. Consider yourself very lucky if the conclusions do not vary substantially. </soapbox> Art Kendall Social Research Consultants Hector Maletta wrote: > About the exclusion of outliers outside +/- 3 SD, a big "if" > concerns the distribution of the variable. In biological variables, mostly > with a normal distribution, cases outside the -3 to +3 range are rare, and > more so if outside +/- 4 or 5. Moreover, they are often the result of data > entry errors or sample flukes. But in other kinds of variables it ain't > necessarily so. Income, for instance, is clearly skewed, and excluding those > cases above +3 SD (even in the case of using log income) may imply leaving > the rich, and with them a big chunk of aggregate income, outside the > analysis. > As a general rule, do not exclude any valid datum. > Hector > > > > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Ken > Belzer > Sent: 13 August 2007 01:02 > To: [hidden email] > Subject: Re: SPSS-Stats question regarding outliers > > Thanks very much to Robert for raising this issue and to those who > responded > to my follow-up questions. I had been using a simple rule-of-thumb > from > Tabachnick & Fidell (1996 - probably a bit outdated) of excluding > ouliers from > the model/analysis if their z-scores relative to their > distribution were equal > to or greater than +/-3.28 -- primarily for univariate procedures. > > Clearly, there's quite a bit more to consider, and many more ways > to examine > the nature and impact of outliers before simply excluding them. > I've saved > these responses for future reference -- thanks again. > > Kind regards, > Ken > > > > ************************************** Get a sneak peek of the > all-new AOL at > http://discover.aol.com/memed/aolcom30tour > > > |
|
This is a most interesting discussion.
I might suggest that one area in which it may be important to identify (and perhaps remove) outliers is ratings by judges. If we were to have panels of judges judging entries by rating them on numerical scales (such as in an awards program, skating/gymnastics judging, or applicant judging), we would need to identify and manage inter-rater reliability/agreement. One way to achieve appropriate inter-rater reliability would be to identify and remove outlier ratings (overly strict or overly lenient scores relative to other judges' scores on the same entry). It strikes me that if this isn't done, the awards program (or sports contest or application process) risks having variation that is due more to the judges than to the entries. (Other steps could also be taken to increase inter-rater reliability, such as training of the judges. But it seems that identifying and removing outlier scores would always be a worthwhile additional step.) I'd be grateful for any thoughts on this. Tom Werner -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Art Kendall Sent: Monday, August 13, 2007 7:46 AM To: [hidden email] Subject: Re: SPSS-Stats question regarding outliers <soapbox> "Outliers" is a very problematic concept. There are a wide variety of meanings ascribed to the term. Extreme values may be valid. They may be the most important values. Arbitrary treatment of values as outliers should rarely if ever be done. Leverage stats, etc. only identify potential or suspected outliers. Based on consulting on stat and methodology for over 30 years, I believe the usual explanation when there are suspicious values is failure of the quality assurance procedure. . I think of a *potential* outlier as a surprising or suspicious value for a variable (including residuals). In my experience, in the vast majority of instances, they indicate data gathering or data entry errors, i.e., insufficient attention in quality assurance in data gathering or data entry. In my experience, rechecking qa typically eliminates over 80% of suspicious data values. This is one reason I advocate thorough exploration of a set of data before doing the analysis. By thorough exploration I mean things like frequencies, multi-way crosstabs, scatterplots, box plots, rechecking scale keys and reliability, etc. Derived variables such as residuals and rates, should be subjected to the same thorough examination and understanding-seeking as raw variables. In cluster analysis, sometimes there are singleton clusters, e.g., Los Angeles county is distinct from other counties in the western US states. Some times there are 500 lb persons. There might be a rose growing in a cornfield. The first thing to do with outliers is to *_prevent_ * them by careful quality assurance procedures in data gathering and data entry. A thorough search for suspect data values and potentially treating them as outliers in analysis is an important part of data quality assurance. Values for a variable are suspect and in need of further review when they are unusual given the subject matter area, outside the legitimate range of the response scale, show as isolated on scattergrams, have subjectively extreme residuals, when the data shows very high order interaction on ANOVA analyses, when they result in a case being extremely influential in a regression, etc. Recall that researchers consider Murphy a Pollyanna. The detection of odd/peculiar/suspicious values late in the data analysis process is one one reason to assure that you can go all the way back and redo the process. Keeping all of the data gathering instruments, and preserving the syntax for all data transformation are important parts of going back and checking on "outliers". The occurrence of many outliers suggests the data entry was sloppy. There are likely to be incorrectly entered values that are not "outliers". Although it is painful, another round of data entry and verification may be in order. *Correcting the data.* Sometimes you can actually go back to redo the measurements. (Is there really a 500 pound 9 year old?). You should always have all the paper from which data were transcribed. On the rare occasions when there are very good reasons, you might modify the value for a particular case. e.g., percent correct entered as 1000% ==> 100%. *Modifying the data.* Values of variables should be trimmed or recoded to "missing" only when there is a clear rationale. And then only when it is not possible to redo the measurement process. (Maybe there really is a six year old who weighs 400 lbs. Go back and look if possible.) If suspected outliers are recoded or trimmed, the analysis should be done as is and as modified to see what the effect of the modification is. Changing the values of variables suspected to be outliers frequently leads to misleading results. These procedures should be used very sparingly. Math criteria can identify suspects. There should be a trial before there is a verdict and the presumption should be against outlier status for a value. I don't recommend undesirable practices such as cavalierly trimming to 3 SDs. Having a value beyond 3 SD can be reason to examine a case more thoroughly. It is advisable to consult with a statistician before changing the values of suspected outliers. *Multiple analyses.* If you have re-entered the data, or re-run the experiment, and done very thorough exploration of the data, you are stuck as a last resort with doing multiple analyses: including vs excluding the case(s); changing the values for the case(s) to hotdeck values, to some central tendency value, or to max or min on the response scale (e.g., for achievement, personality, or attitude measures), etc. In the small minority of occasions where the data can not be cleaned up, the analysis should be done in three or more ways (include the outliers as is, trim the values, treat the values as missing, transform to ranks, include in the model variables that flag those cases, or ...). The reporting becomes much more complex. Consider yourself very lucky if the conclusions do not vary substantially. </soapbox> Art Kendall Social Research Consultants Hector Maletta wrote: > About the exclusion of outliers outside +/- 3 SD, a big "if" > concerns the distribution of the variable. In biological variables, > mostly with a normal distribution, cases outside the -3 to +3 range > are rare, and more so if outside +/- 4 or 5. Moreover, they are often > the result of data entry errors or sample flukes. But in other kinds > of variables it ain't necessarily so. Income, for instance, is clearly > skewed, and excluding those cases above +3 SD (even in the case of > using log income) may imply leaving the rich, and with them a big > chunk of aggregate income, outside the analysis. > As a general rule, do not exclude any valid datum. > Hector > > > > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf > Of Ken Belzer > Sent: 13 August 2007 01:02 > To: [hidden email] > Subject: Re: SPSS-Stats question regarding outliers > > Thanks very much to Robert for raising this issue and to > those who responded > to my follow-up questions. I had been using a simple > rule-of-thumb from > Tabachnick & Fidell (1996 - probably a bit outdated) of > excluding ouliers from > the model/analysis if their z-scores relative to their > distribution were equal > to or greater than +/-3.28 -- primarily for univariate > > Clearly, there's quite a bit more to consider, and many more > ways to examine > the nature and impact of outliers before simply excluding them. > I've saved > these responses for future reference -- thanks again. > > Kind regards, > Ken > > > > ************************************** Get a sneak peek of > the all-new AOL at > http://discover.aol.com/memed/aolcom30tour > > > |
|
In reply to this post by Art Kendall-2
When you can check the data and eliminate outliers that way, it's a pure win. Many times you can't, though. And sometimes the outliers are what really tell the story. If what you observe is actually a mixture of two different processes, it may be outliers that allow you to separate them.
At the same time, extreme values tend to be the high leverage values, so their treatment is most important. If your model isn't too sensitive to the outliers, then keep them. In the presence of unrejectable and influential outliers, you may want to use robust methods instead of typical least squares methods (starting with medians instead of means), but looking at the outlier pattern may reveal a model misspecification that can be fixed and will eliminate them. When we benchmark our software, we often see some extreme time differences for a few runs that obscure the effect of some change we have made. In those cases, we can generally assume that other things happening in the computer and not measured or controlled can be blamed and the case eliminated without much worry. But I always advocate nonparametrics as the right way to summarize benchmark runs. -Jon Peck -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Art Kendall Sent: Monday, August 13, 2007 6:46 AM To: [hidden email] Subject: Re: [SPSSX-L] SPSS-Stats question regarding outliers <soapbox> "Outliers" is a very problematic concept. There are a wide variety of meanings ascribed to the term. Extreme values may be valid. They may be the most important values. Arbitrary treatment of values as outliers should rarely if ever be done. Leverage stats, etc. only identify potential or suspected outliers. Based on consulting on stat and methodology for over 30 years, I believe the usual explanation when there are suspicious values is failure of the quality assurance procedure. . I think of a *potential* outlier as a surprising or suspicious value for a variable (including residuals). In my experience, in the vast majority of instances, they indicate data gathering or data entry errors, i.e., insufficient attention in quality assurance in data gathering or data entry. In my experience, rechecking qa typically eliminates over 80% of suspicious data values. This is one reason I advocate thorough exploration of a set of data before doing the analysis. By thorough exploration I mean things like frequencies, multi-way crosstabs, scatterplots, box plots, rechecking scale keys and reliability, etc. [>>>Peck, Jon] [snip] |
| Free forum by Nabble | Edit this page |
