Hi List,
I am clueless on how to handle outlier (especially when it comes to prices) through the help of SPSS. Is there any surefire way? I heard the HB method will do the job well but I remain clueless in this issue too. Would any one highlight me? Thanks, Samuel |
At 04:19 AM 8/14/2006, Samuel Solomon wrote:
>I am clueless on how to handle outliers (especially when it comes to >prices) through the help of SPSS. Is there any surefire way? OK, there are better statisticians than I on this list, but to start with: There is not, and never will be, a surefire way, in SPSS or anywhere else. 'Outlier' values may be the most important data, and you may distort your analysis very badly by dropping them. I take the liberty of re-posting and essay on outliers I wrote some time ago; I hope it is at least partly germane to your needs. *Whether*, and *by what standard*, to identify outliers, is at least as important as 'how'. . Outliers that fail consistency checks: For example, an event date prior to the beginning of the study, or later than the present, can be rejected as wrong. (I've got 'event dates' on the brain from a project I'm working on. And, of course, I'm assuming that 'event dates' in the past or future aren't valid; in some study designs, they may be.) Those should be made missing; or they should be checked against the primary data source and corrected, if that is feasible. . Outliers that can't be rejected *a priori*: First, you shouldn't even try to look at those until you reject any demonstrable errors. Second, I would say a good way to look for them is to look at the high-percentile cutpoints in the distribution. Depending on the size of your dataset, 'high-percentile' could be 99%, 99.9%, and 99.99%. (These are not alternatives. If you use, say, 99.9%, you should look at 99% as well. Consider also looking at the 90% or 95% cutpoint, for a sense of the 'normal' range of the distribution. 5% outliers are NOT outliers. And, of course, look at both ends: 1%, 0.1%, 0.01% percentile cutpoints, as well.) Third, I think I'm seeing a trend in the statistics community against removing 'outliers' by internal criteria (n standard deviations, 1st and 99th percentiles). The rationale, and it's a strong one, is that those are observed values of whatever it is that you're measuring. If you eliminate them, you'll get a model based on their rarity; and that model, itself, can become an argument for eliminating them (because they don't fit it), and you can talk yourself into a model that's quite unrepresentative of reality. Fourth, however, the largest values will have a disproportionate, possibly dominant, effect on most linear models -- regression, ANOVA, even taking the arithmetic mean. Depending on your study, you can - Go ahead. In this case, the model's fit will be weighted toward predicting the largest values, and may show little discrimination within the 'cloud' of more-typical values. That, however, may be the right insight to be gained from the data. - If available, use a non-parametric method. That's often favored, because it neither rejects the large values nor gives them disproportionate weight. By the same token, however, if much of the useful information is in the largest values, non-parametric methods can unduly DE-emphasize these values. - There are reasons to reject this as heresy, but if you're doing linear modelling, I'd probably try it both with the largest values retained and with them eliminated. (I'd only do this if the 'largest values' look very far from the 'cloud' of regular values. A scatter plot can be an invaluable tool for this.) If the two models are closely similar, you have an argument that there's a single process going on, with the largest values being part of the same process. If they're very different, you may have two processes, one of which operates occasionally to produce the largest values, the other of which operates 'normally' but is swamped when the larger process happens. And if the run without the large values produces a poor R^2, you may have an argument that the observable process is represented by the largest values, and the variation in the 'normal cloud' is mostly noise. - [ADDED] Investigate carefully using a 'bootstrap' method - sampling from your sample. Very large data values occurring in very small proportion can give you a huge variance in your estimates, that won't be detected by standard analytic methods. With a very small proportion of large values, the expected number in a sample may be very small, with a large variance. (Let's see - Poisson distributed for all practical purposes, I think.) Particularly, because of the 'leverage' of the very large values, the estimates in a sample that includes one or more may be drastically different from those in a sample that happens not to include any. You'll have to see whether that's a problem in your data. If it is, it may help to do a stratified sample in which the large values are over-represented, and then assigned lower weights in proportion. If, that is, you can identify a subgroup of 'large values,' and have access to enough of them to get a significant sample. Onward, and good luck, Richard ......................................... (*) Ristow, Richard, "Re: Outliers", on list "SAS(r) Discussion" <[hidden email]>, Mon, 15 Nov 2004 14:26:51; reposted as "Re: Question on excluding extremes", SPSSX-L Fri, 17 Feb 2006 11:21:29 |
"Outliers" have never been defined satisfactorily, and the concept is seldom
used in a consistent way. Outliers are not "impossible" values, such as a widower who is 4 years old, or a mother who is younger than her daughter. Those are most likely data-entry or data-taking errors. Outliers are, most properly, extreme values. In a sample about heights they are individuals measuring over seven feet, or dwarfs. In an income sample they are people like Bill Gates. They are not impossible, they are simply rare. Of course, and extreme value may also be a simple mistake: a person 112 years old may be just 12, and someone who is 7'11" tall may be a more common 5'11" just wrongly written or typed. But they just might exist, extremely old, extremely tall, extremely wealthy. Now, what is wrong with finding rare cases? If they exist, they should be dutifully recorded in your data, not hidden under the carpet. The problem is that they may distort your sample results if you are not careful in their treatment. If you have a 1/10,000 sample of a certain area, in order to estimate the distribution of heights, and stumble on the one and only dwarf in the neighborhood, you may end up estimating that the area is populated by 10,000 small people, or (in another example) by 10,000 people with the income of Bill Gates. They may alter the shape of your curve, or disfigure your mean or standard deviation. From another point of view, if you start again and draw your sample anew, chances are you won't stumble again on the only giant or the only dwarf in town. Of all possible random samples of the same size, just very few will include them, precisely because such subjects are rare, perhaps unique. If you have some grounds to know that they are extremely rare in the general population from which your sample comes, you may decide to exclude them from the sample, though this is seldom advisable without careful statistical analysis. One interesting exercise is considering the impact of their removal on the mean and standard deviation of important variables, and on the slope of key regression coefficients in your research. Suppose you are investigating the relationship between capital and technology, and discover a very strong relationship: more money, more high tech, but then you discover that the whole thing crumbles down when you withdraw Bill Gates from the sample: he was that solitary point to the Northeast of your scatterplot, while all other capitalists in your sample made their money in old fashioned low-tech businesses. Out goes Bill, and your money-tech beta falls into non-significance (just a fictional example, of course). SPSS Regression, for instance, lets you see the impact of removing each case on the overall fit of a regression model. A high-impact case is probably an outlier worth considering for closer inspection (and possible removal if suspect). Even if SPSS may identify high-impact cases, all this requires human intelligence. No surefire statistical device can do it for you. Hope this helps. Hector -----Mensaje original----- De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Richard Ristow Enviado el: Sunday, August 20, 2006 5:31 PM Para: [hidden email] Asunto: Re: outliers?? At 04:19 AM 8/14/2006, Samuel Solomon wrote: >I am clueless on how to handle outliers (especially when it comes to >prices) through the help of SPSS. Is there any surefire way? OK, there are better statisticians than I on this list, but to start with: There is not, and never will be, a surefire way, in SPSS or anywhere else. 'Outlier' values may be the most important data, and you may distort your analysis very badly by dropping them. I take the liberty of re-posting and essay on outliers I wrote some time ago; I hope it is at least partly germane to your needs. *Whether*, and *by what standard*, to identify outliers, is at least as important as 'how'. . Outliers that fail consistency checks: For example, an event date prior to the beginning of the study, or later than the present, can be rejected as wrong. (I've got 'event dates' on the brain from a project I'm working on. And, of course, I'm assuming that 'event dates' in the past or future aren't valid; in some study designs, they may be.) Those should be made missing; or they should be checked against the primary data source and corrected, if that is feasible. . Outliers that can't be rejected *a priori*: First, you shouldn't even try to look at those until you reject any demonstrable errors. Second, I would say a good way to look for them is to look at the high-percentile cutpoints in the distribution. Depending on the size of your dataset, 'high-percentile' could be 99%, 99.9%, and 99.99%. (These are not alternatives. If you use, say, 99.9%, you should look at 99% as well. Consider also looking at the 90% or 95% cutpoint, for a sense of the 'normal' range of the distribution. 5% outliers are NOT outliers. And, of course, look at both ends: 1%, 0.1%, 0.01% percentile cutpoints, as well.) Third, I think I'm seeing a trend in the statistics community against removing 'outliers' by internal criteria (n standard deviations, 1st and 99th percentiles). The rationale, and it's a strong one, is that those are observed values of whatever it is that you're measuring. If you eliminate them, you'll get a model based on their rarity; and that model, itself, can become an argument for eliminating them (because they don't fit it), and you can talk yourself into a model that's quite unrepresentative of reality. Fourth, however, the largest values will have a disproportionate, possibly dominant, effect on most linear models -- regression, ANOVA, even taking the arithmetic mean. Depending on your study, you can - Go ahead. In this case, the model's fit will be weighted toward predicting the largest values, and may show little discrimination within the 'cloud' of more-typical values. That, however, may be the right insight to be gained from the data. - If available, use a non-parametric method. That's often favored, because it neither rejects the large values nor gives them disproportionate weight. By the same token, however, if much of the useful information is in the largest values, non-parametric methods can unduly DE-emphasize these values. - There are reasons to reject this as heresy, but if you're doing linear modelling, I'd probably try it both with the largest values retained and with them eliminated. (I'd only do this if the 'largest values' look very far from the 'cloud' of regular values. A scatter plot can be an invaluable tool for this.) If the two models are closely similar, you have an argument that there's a single process going on, with the largest values being part of the same process. If they're very different, you may have two processes, one of which operates occasionally to produce the largest values, the other of which operates 'normally' but is swamped when the larger process happens. And if the run without the large values produces a poor R^2, you may have an argument that the observable process is represented by the largest values, and the variation in the 'normal cloud' is mostly noise. - [ADDED] Investigate carefully using a 'bootstrap' method - sampling from your sample. Very large data values occurring in very small proportion can give you a huge variance in your estimates, that won't be detected by standard analytic methods. With a very small proportion of large values, the expected number in a sample may be very small, with a large variance. (Let's see - Poisson distributed for all practical purposes, I think.) Particularly, because of the 'leverage' of the very large values, the estimates in a sample that includes one or more may be drastically different from those in a sample that happens not to include any. You'll have to see whether that's a problem in your data. If it is, it may help to do a stratified sample in which the large values are over-represented, and then assigned lower weights in proportion. If, that is, you can identify a subgroup of 'large values,' and have access to enough of them to get a significant sample. Onward, and good luck, Richard ......................................... (*) Ristow, Richard, "Re: Outliers", on list "SAS(r) Discussion" <[hidden email]>, Mon, 15 Nov 2004 14:26:51; reposted as "Re: Question on excluding extremes", SPSSX-L Fri, 17 Feb 2006 11:21:29 |
I left the previous two excellent responses in this message.
I have pasted one of my soapbox statements below which gives another perspective. Art Kendall Social Research Consultants <soapbox> "Outliers" is a very problematic concept. There are a wide variety of meanings ascribed to the term. Based on consulting on stat and methodology for over 30 years, I believe the usual explanation when there are suspicious values is failure of the quality assurance procedure. . I think of a potential outlier as a surprising or suspicious value for a variable (including residuals). In my experience, in the vast majority of instances, they indicate data gathering or data entry errors, i.e., insufficient attention in quality assurance in data gathering or data entry. In my experience, rechecking qa typically eliminates over 80% of suspicious data values. This is one reason I advocate thorough exploration of a set of data before doing the analysis. By thorough exploration I mean things like frequencies, multi-way crosstabs, scatterplots, box plots, rechecking scale keys and reliability, etc. Derived variables such as residuals and rates, should be subjected to the same thorough examination and understanding-seeking as raw variables. This identifies suspicious values. Unusual values may be "real". They should not be simply tossed. In cluster analysis, sometimes there are singleton clusters, e.g., Los Angeles county is distinct from other counties in the western states. Some times there are 500 lb persons. There might be a rose growing in a cornfield. There may be strong interaction (synergy) effects. The first thing to do about outliers is to prevent them by careful quality assurance procedures in data gathering and handling. A thorough search for suspect data values and potentially treating them as outliers in analysis is an important part of data quality assurance. Values for a variable are suspect and in need of further review when they are unusual given the subject matter area, outside the legitimate range of the response scale, show as isolated on scattergrams, have subjectively extreme residuals, when the data shows very high order interaction on ANOVA analyses, when they result in a case being extremely influential in a regression, etc. Recall that researchers consider Murphy a Pollyanna. The detection of odd/peculiar/suspicious values late in the data analysis process is one one reason to assure that you can go all the way back and redo the process. Keeping all of the data gathering instruments, and preserving the syntax for all data transformation are important parts of going back and checking on "outliers". The occurrence of many outliers suggests the data entry was sloppy. There are likely to be incorrectly entered values that are not "outliers". Although it is painful, another round of data entry and verification may be in order. Correcting the data. Sometimes you can actually go back to redo the measurements. (Is there really a 500 pound 9 year old?). You should always have all the paper from which data were transcribed. On the rare occasions when there are very good reasons, you might modify the value for a particular case. e.g., percent correct entered as 1000% ==> 100%. Modifying the data. Values of variables should be trimmed or recoded to "missing" only when there is a clear rationale. And then only when it is not possible to redo the measurement process. (Maybe there really is a six year old who weighs 400 lbs. Go back and look if possible.) If suspected outliers are recoded or trimmed, the analysis should be done as is and as modified to see what the effect of the modification is. Changing the values of variables suspected to be outliers frequently leads to misleading results. These procedures should be used very sparingly. Math criteria can identify suspects. There should be a trial before there is a verdict and the presumption should be against outlier status for a value. I don't recommend undesirable practices such as cavalierly trimming to 3 SDs. Having a value beyond 3 SD can be reason to examine a case more thoroughly. It is advisable to consult with a statistician before changing the values of suspected outliers. Multiple analyses. If you have re-entered the data, or re-run the experiment, and done very thorough exploration of the data, you are stuck as a last resort with doing multiple analyses: including vs excluding the case(s); changing the values for the case(s) to hotdeck values, to some central tendency value, or to max or min on the response scale (e.g., for achievement, personality, or attitude measures), modeling the specialness of the particular value, etc. In the small minority of occasions where the data can not be cleaned up, the analysis should be done in three or more ways (include the outliers as is, trim the values, treat the values as missing, transform to ranks, include in the model variables that flag those cases, or ...). The reporting becomes much more complex. Consider yourself very lucky if the conclusions do not vary substantially. Art Kendall Social Research Consultants Hector Maletta wrote: >"Outliers" have never been defined satisfactorily, and the concept is seldom >used in a consistent way. Outliers are not "impossible" values, such as a >widower who is 4 years old, or a mother who is younger than her daughter. >Those are most likely data-entry or data-taking errors. >Outliers are, most properly, extreme values. In a sample about heights they >are individuals measuring over seven feet, or dwarfs. In an income sample >they are people like Bill Gates. They are not impossible, they are simply >rare. >Of course, and extreme value may also be a simple mistake: a person 112 >years old may be just 12, and someone who is 7'11" tall may be a more common >5'11" just wrongly written or typed. But they just might exist, extremely >old, extremely tall, extremely wealthy. >Now, what is wrong with finding rare cases? If they exist, they should be >dutifully recorded in your data, not hidden under the carpet. The problem is >that they may distort your sample results if you are not careful in their >treatment. If you have a 1/10,000 sample of a certain area, in order to >estimate the distribution of heights, and stumble on the one and only dwarf >in the neighborhood, you may end up estimating that the area is populated by >10,000 small people, or (in another example) by 10,000 people with the >income of Bill Gates. They may alter the shape of your curve, or disfigure >your mean or standard deviation. >>From another point of view, if you start again and draw your sample anew, >chances are you won't stumble again on the only giant or the only dwarf in >town. Of all possible random samples of the same size, just very few will >include them, precisely because such subjects are rare, perhaps unique. If >you have some grounds to know that they are extremely rare in the general >population from which your sample comes, you may decide to exclude them from >the sample, though this is seldom advisable without careful statistical >analysis. >One interesting exercise is considering the impact of their removal on the >mean and standard deviation of important variables, and on the slope of key >regression coefficients in your research. Suppose you are investigating the >relationship between capital and technology, and discover a very strong >relationship: more money, more high tech, but then you discover that the >whole thing crumbles down when you withdraw Bill Gates from the sample: he >was that solitary point to the Northeast of your scatterplot, while all >other capitalists in your sample made their money in old fashioned low-tech >businesses. Out goes Bill, and your money-tech beta falls into >non-significance (just a fictional example, of course). SPSS Regression, for >instance, lets you see the impact of removing each case on the overall fit >of a regression model. A high-impact case is probably an outlier worth >considering for closer inspection (and possible removal if suspect). >Even if SPSS may identify high-impact cases, all this requires human >intelligence. No surefire statistical device can do it for you. >Hope this helps. >Hector > > > > >-----Mensaje original----- >De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de >Richard Ristow >Enviado el: Sunday, August 20, 2006 5:31 PM >Para: [hidden email] >Asunto: Re: outliers?? > >At 04:19 AM 8/14/2006, Samuel Solomon wrote: > > > >>I am clueless on how to handle outliers (especially when it comes to >>prices) through the help of SPSS. Is there any surefire way? >> >> > >OK, there are better statisticians than I on this list, but to start >with: > >There is not, and never will be, a surefire way, in SPSS or anywhere >else. 'Outlier' values may be the most important data, and you may >distort your analysis very badly by dropping them. I take the liberty >of re-posting and essay on outliers I wrote some time ago; I hope it is >at least partly germane to your needs. > > >*Whether*, and *by what standard*, to identify outliers, is at least as >important as 'how'. > >. Outliers that fail consistency checks: For example, an event date >prior to the beginning of the study, or later than the present, can be >rejected as wrong. (I've got 'event dates' on the brain from a project >I'm working on. And, of course, I'm assuming that 'event dates' in the >past or future aren't valid; in some study designs, they may be.) Those >should be made missing; or they should be checked against the primary >data source and corrected, if that is feasible. > >. Outliers that can't be rejected *a priori*: First, you shouldn't even >try to look at those until you reject any demonstrable errors. > >Second, I would say a good way to look for them is to look at the >high-percentile cutpoints in the distribution. Depending on the size of >your dataset, 'high-percentile' could be 99%, 99.9%, and 99.99%. (These >are not alternatives. If you use, say, 99.9%, you should look at 99% as >well. Consider also looking at the 90% or 95% cutpoint, for a sense of >the 'normal' range of the distribution. 5% outliers are NOT outliers. >And, of course, look at both ends: 1%, 0.1%, 0.01% percentile >cutpoints, as well.) > >Third, I think I'm seeing a trend in the statistics community against >removing 'outliers' by internal criteria (n standard deviations, 1st >and 99th percentiles). The rationale, and it's a strong one, is that >those are observed values of whatever it is that you're measuring. If >you eliminate them, you'll get a model based on their rarity; and that >model, itself, can become an argument for eliminating them (because >they don't fit it), and you can talk yourself into a model that's quite >unrepresentative of reality. > >Fourth, however, the largest values will have a disproportionate, >possibly dominant, effect on most linear models -- regression, ANOVA, >even taking the arithmetic mean. Depending on your study, you can > >- Go ahead. In this case, the model's fit will be weighted toward >predicting the largest values, and may show little discrimination >within the 'cloud' of more-typical values. That, however, may be the >right insight to be gained from the data. > >- If available, use a non-parametric method. That's often favored, >because it neither rejects the large values nor gives them >disproportionate weight. By the same token, however, if much of the >useful information is in the largest values, non-parametric methods can >unduly DE-emphasize these values. > >- There are reasons to reject this as heresy, but if you're doing >linear modelling, I'd probably try it both with the largest values >retained and with them eliminated. (I'd only do this if the 'largest >values' look very far from the 'cloud' of regular values. A scatter >plot can be an invaluable tool for this.) If the two models are closely >similar, you have an argument that there's a single process going on, >with the largest values being part of the same process. If they're very >different, you may have two processes, one of which operates >occasionally to produce the largest values, the other of which operates >'normally' but is swamped when the larger process happens. And if the >run without the large values produces a poor R^2, you may have an >argument that the observable process is represented by the largest >values, and the variation in the 'normal cloud' is mostly noise. > >- [ADDED] Investigate carefully using a 'bootstrap' method - sampling >from your sample. Very large data values occurring in very small >proportion can give you a huge variance in your estimates, that won't >be detected by standard analytic methods. With a very small proportion >of large values, the expected number in a sample may be very small, >with a large variance. (Let's see - Poisson distributed for all >practical purposes, I think.) Particularly, because of the 'leverage' >of the very large values, the estimates in a sample that includes one >or more may be drastically different from those in a sample that >happens not to include any. You'll have to see whether that's a problem >in your data. If it is, it may help to do a stratified sample in which >the large values are over-represented, and then assigned lower weights >in proportion. If, that is, you can identify a subgroup of 'large >values,' and have access to enough of them to get a significant sample. > >Onward, and good luck, >Richard > >......................................... >(*) Ristow, Richard, "Re: Outliers", on list "SAS(r) Discussion" ><[hidden email]>, Mon, 15 Nov 2004 14:26:51; reposted as "Re: >Question on excluding extremes", SPSSX-L Fri, 17 Feb 2006 11:21:29 > > > >
Art Kendall
Social Research Consultants |
To add to all this good advice,
In many cases whether something is an outlier or not depends on a model. It may be an extreme value not explained by the model. It may be much more complicated than a univariate extreme. The new Anomaly Detection procedure in SPSS can help to find these in a multivariate framework, although it is still up to you to decide what to do about it. In a context such as regression, it is good to look at the leverage statistics to see whether potential outliers actually affect your results much or not. Finally, consider the process assumed to be generating the data. It is commonly observed that stock market prices follow a random walk model, which means that the variance is not finite. Such a fat-tailed distribution will intrinsically have more outliers than we are probably accustomed to seeing, but that is part of the phenomenon to model. Note that outlier and unusual are not quite the same thing. You might have a very lonely and suspicious value buried in the middle of your data in a sparse region. Is that an outlier? It might be equally suspicious. Regards, Jon Peck SPSS -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Art Kendall Sent: Sunday, August 27, 2006 8:06 AM To: [hidden email] Subject: Re: [SPSSX-L] outliers?? I left the previous two excellent responses in this message. I have pasted one of my soapbox statements below which gives another perspective. Art Kendall Social Research Consultants <soapbox> "Outliers" is a very problematic concept. There are a wide variety of meanings ascribed to the term. Based on consulting on stat and methodology for over 30 years, I believe the usual explanation when there are suspicious values is failure of the quality assurance procedure. . I think of a potential outlier as a surprising or suspicious value for a variable (including residuals). In my experience, in the vast majority of instances, they indicate data gathering or data entry errors, i.e., insufficient attention in quality assurance in data gathering or data entry. In my experience, rechecking qa typically eliminates over 80% of suspicious data values. This is one reason I advocate thorough exploration of a set of data before doing the analysis. By thorough exploration I mean things like frequencies, multi-way crosstabs, scatterplots, box plots, rechecking scale keys and reliability, etc. Derived variables such as residuals and rates, should be subjected to the same thorough examination and understanding-seeking as raw variables. This identifies suspicious values. Unusual values may be "real". They should not be simply tossed. In cluster analysis, sometimes there are singleton clusters, e.g., Los Angeles county is distinct from other counties in the western states. Some times there are 500 lb persons. There might be a rose growing in a cornfield. There may be strong interaction (synergy) effects. The first thing to do about outliers is to prevent them by careful quality assurance procedures in data gathering and handling. A thorough search for suspect data values and potentially treating them as outliers in analysis is an important part of data quality assurance. Values for a variable are suspect and in need of further review when they are unusual given the subject matter area, outside the legitimate range of the response scale, show as isolated on scattergrams, have subjectively extreme residuals, when the data shows very high order interaction on ANOVA analyses, when they result in a case being extremely influential in a regression, etc. Recall that researchers consider Murphy a Pollyanna. The detection of odd/peculiar/suspicious values late in the data analysis process is one one reason to assure that you can go all the way back and redo the process. Keeping all of the data gathering instruments, and preserving the syntax for all data transformation are important parts of going back and checking on "outliers". The occurrence of many outliers suggests the data entry was sloppy. There are likely to be incorrectly entered values that are not "outliers". Although it is painful, another round of data entry and verification may be in order. Correcting the data. Sometimes you can actually go back to redo the measurements. (Is there really a 500 pound 9 year old?). You should always have all the paper from which data were transcribed. On the rare occasions when there are very good reasons, you might modify the value for a particular case. e.g., percent correct entered as 1000% ==> 100%. Modifying the data. Values of variables should be trimmed or recoded to "missing" only when there is a clear rationale. And then only when it is not possible to redo the measurement process. (Maybe there really is a six year old who weighs 400 lbs. Go back and look if possible.) If suspected outliers are recoded or trimmed, the analysis should be done as is and as modified to see what the effect of the modification is. Changing the values of variables suspected to be outliers frequently leads to misleading results. These procedures should be used very sparingly. Math criteria can identify suspects. There should be a trial before there is a verdict and the presumption should be against outlier status for a value. I don't recommend undesirable practices such as cavalierly trimming to 3 SDs. Having a value beyond 3 SD can be reason to examine a case more thoroughly. It is advisable to consult with a statistician before changing the values of suspected outliers. Multiple analyses. If you have re-entered the data, or re-run the experiment, and done very thorough exploration of the data, you are stuck as a last resort with doing multiple analyses: including vs excluding the case(s); changing the values for the case(s) to hotdeck values, to some central tendency value, or to max or min on the response scale (e.g., for achievement, personality, or attitude measures), modeling the specialness of the particular value, etc. In the small minority of occasions where the data can not be cleaned up, the analysis should be done in three or more ways (include the outliers as is, trim the values, treat the values as missing, transform to ranks, include in the model variables that flag those cases, or ...). The reporting becomes much more complex. Consider yourself very lucky if the conclusions do not vary substantially. Art Kendall Social Research Consultants Hector Maletta wrote: >"Outliers" have never been defined satisfactorily, and the concept is seldom >used in a consistent way. Outliers are not "impossible" values, such as a >widower who is 4 years old, or a mother who is younger than her daughter. >Those are most likely data-entry or data-taking errors. >Outliers are, most properly, extreme values. In a sample about heights they >are individuals measuring over seven feet, or dwarfs. In an income sample >they are people like Bill Gates. They are not impossible, they are simply >rare. >Of course, and extreme value may also be a simple mistake: a person 112 >years old may be just 12, and someone who is 7'11" tall may be a more common >5'11" just wrongly written or typed. But they just might exist, extremely >old, extremely tall, extremely wealthy. >Now, what is wrong with finding rare cases? If they exist, they should be >dutifully recorded in your data, not hidden under the carpet. The problem is >that they may distort your sample results if you are not careful in their >treatment. If you have a 1/10,000 sample of a certain area, in order to >estimate the distribution of heights, and stumble on the one and only dwarf >in the neighborhood, you may end up estimating that the area is populated by >10,000 small people, or (in another example) by 10,000 people with the >income of Bill Gates. They may alter the shape of your curve, or disfigure >your mean or standard deviation. >>From another point of view, if you start again and draw your sample anew, >chances are you won't stumble again on the only giant or the only dwarf in >town. Of all possible random samples of the same size, just very few will >include them, precisely because such subjects are rare, perhaps unique. If >you have some grounds to know that they are extremely rare in the general >population from which your sample comes, you may decide to exclude them from >the sample, though this is seldom advisable without careful statistical >analysis. >One interesting exercise is considering the impact of their removal on the >mean and standard deviation of important variables, and on the slope of key >regression coefficients in your research. Suppose you are investigating the >relationship between capital and technology, and discover a very strong >relationship: more money, more high tech, but then you discover that the >whole thing crumbles down when you withdraw Bill Gates from the sample: he >was that solitary point to the Northeast of your scatterplot, while all >other capitalists in your sample made their money in old fashioned low-tech >businesses. Out goes Bill, and your money-tech beta falls into >non-significance (just a fictional example, of course). SPSS Regression, for >instance, lets you see the impact of removing each case on the overall fit >of a regression model. A high-impact case is probably an outlier worth >considering for closer inspection (and possible removal if suspect). >Even if SPSS may identify high-impact cases, all this requires human >intelligence. No surefire statistical device can do it for you. >Hope this helps. >Hector > > > > >-----Mensaje original----- >De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de >Richard Ristow >Enviado el: Sunday, August 20, 2006 5:31 PM >Para: [hidden email] >Asunto: Re: outliers?? > >At 04:19 AM 8/14/2006, Samuel Solomon wrote: > > > >>I am clueless on how to handle outliers (especially when it comes to >>prices) through the help of SPSS. Is there any surefire way? >> >> > >OK, there are better statisticians than I on this list, but to start >with: > >There is not, and never will be, a surefire way, in SPSS or anywhere >else. 'Outlier' values may be the most important data, and you may >distort your analysis very badly by dropping them. I take the liberty >of re-posting and essay on outliers I wrote some time ago; I hope it is >at least partly germane to your needs. > > >*Whether*, and *by what standard*, to identify outliers, is at least as >important as 'how'. > >. Outliers that fail consistency checks: For example, an event date >prior to the beginning of the study, or later than the present, can be >rejected as wrong. (I've got 'event dates' on the brain from a project >I'm working on. And, of course, I'm assuming that 'event dates' in the >past or future aren't valid; in some study designs, they may be.) Those >should be made missing; or they should be checked against the primary >data source and corrected, if that is feasible. > >. Outliers that can't be rejected *a priori*: First, you shouldn't even >try to look at those until you reject any demonstrable errors. > >Second, I would say a good way to look for them is to look at the >high-percentile cutpoints in the distribution. Depending on the size of >your dataset, 'high-percentile' could be 99%, 99.9%, and 99.99%. (These >are not alternatives. If you use, say, 99.9%, you should look at 99% as >well. Consider also looking at the 90% or 95% cutpoint, for a sense of >the 'normal' range of the distribution. 5% outliers are NOT outliers. >And, of course, look at both ends: 1%, 0.1%, 0.01% percentile >cutpoints, as well.) > >Third, I think I'm seeing a trend in the statistics community against >removing 'outliers' by internal criteria (n standard deviations, 1st >and 99th percentiles). The rationale, and it's a strong one, is that >those are observed values of whatever it is that you're measuring. If >you eliminate them, you'll get a model based on their rarity; and that >model, itself, can become an argument for eliminating them (because >they don't fit it), and you can talk yourself into a model that's quite >unrepresentative of reality. > >Fourth, however, the largest values will have a disproportionate, >possibly dominant, effect on most linear models -- regression, ANOVA, >even taking the arithmetic mean. Depending on your study, you can > >- Go ahead. In this case, the model's fit will be weighted toward >predicting the largest values, and may show little discrimination >within the 'cloud' of more-typical values. That, however, may be the >right insight to be gained from the data. > >- If available, use a non-parametric method. That's often favored, >because it neither rejects the large values nor gives them >disproportionate weight. By the same token, however, if much of the >useful information is in the largest values, non-parametric methods can >unduly DE-emphasize these values. > >- There are reasons to reject this as heresy, but if you're doing >linear modelling, I'd probably try it both with the largest values >retained and with them eliminated. (I'd only do this if the 'largest >values' look very far from the 'cloud' of regular values. A scatter >plot can be an invaluable tool for this.) If the two models are closely >similar, you have an argument that there's a single process going on, >with the largest values being part of the same process. If they're very >different, you may have two processes, one of which operates >occasionally to produce the largest values, the other of which operates >'normally' but is swamped when the larger process happens. And if the >run without the large values produces a poor R^2, you may have an >argument that the observable process is represented by the largest >values, and the variation in the 'normal cloud' is mostly noise. > >- [ADDED] Investigate carefully using a 'bootstrap' method - sampling >from your sample. Very large data values occurring in very small >proportion can give you a huge variance in your estimates, that won't >be detected by standard analytic methods. With a very small proportion >of large values, the expected number in a sample may be very small, >with a large variance. (Let's see - Poisson distributed for all >practical purposes, I think.) Particularly, because of the 'leverage' >of the very large values, the estimates in a sample that includes one >or more may be drastically different from those in a sample that >happens not to include any. You'll have to see whether that's a problem >in your data. If it is, it may help to do a stratified sample in which >the large values are over-represented, and then assigned lower weights >in proportion. If, that is, you can identify a subgroup of 'large >values,' and have access to enough of them to get a significant sample. > >Onward, and good luck, >Richard > >......................................... >(*) Ristow, Richard, "Re: Outliers", on list "SAS(r) Discussion" ><[hidden email]>, Mon, 15 Nov 2004 14:26:51; reposted as "Re: >Question on excluding extremes", SPSSX-L Fri, 17 Feb 2006 11:21:29 > > > > |
Very well put. On a case selection model, a measurement
operationalization model, and an analytic model. The concept of an "inlier" is also critical to understanding. SPSS is to be commended for including the anomaly detection (AD) procedure and find duplicate cases (FDC). Inclusion of these helps to reinforce the idea that the data needs to be cleaned, checked and explored. AD and FDC facilitate quality assurance. Although it is possible to workaround to compare files that are supposed to be double keying, it is not as straight forward as it should be. I strongly urge SPSS to implement a single syntax command procedure that compares 2 files and reports differences 1) in the dictionary 2) in the data. Double keying is a venerable QA procedure When SPSS was run on card images in 1972, it was routine to compare the input data cards and the output from WRITE FILEINFO using routines from the operating system. Whereas FDC looks for situations where there is duplication and should not be, the procedure I am urging looks for situations where there is NOT duplication. Art Kendall Social Research Consultants Peck, Jon wrote: >To add to all this good advice, > >In many cases whether something is an outlier or not depends on a model. It may be an extreme value not explained by the model. It may be much more complicated than a univariate extreme. The new Anomaly Detection procedure in SPSS can help to find these in a multivariate framework, although it is still up to you to decide what to do about it. > >In a context such as regression, it is good to look at the leverage statistics to see whether potential outliers actually affect your results much or not. > >Finally, consider the process assumed to be generating the data. It is commonly observed that stock market prices follow a random walk model, which means that the variance is not finite. Such a fat-tailed distribution will intrinsically have more outliers than we are probably accustomed to seeing, but that is part of the phenomenon to model. > >Note that outlier and unusual are not quite the same thing. You might have a very lonely and suspicious value buried in the middle of your data in a sparse region. Is that an outlier? It might be equally suspicious. > >Regards, >Jon Peck >SPSS > >-----Original Message----- >From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Art Kendall >Sent: Sunday, August 27, 2006 8:06 AM >To: [hidden email] >Subject: Re: [SPSSX-L] outliers?? > >I left the previous two excellent responses in this message. >I have pasted one of my soapbox statements below which gives another >perspective. > >Art Kendall >Social Research Consultants > ><soapbox> > >"Outliers" is a very problematic concept. There are a wide variety of >meanings ascribed to the term. > >Based on consulting on stat and methodology for over 30 years, I believe >the usual explanation when there are suspicious values is failure of the >quality assurance procedure. . I think of a potential outlier as a >surprising or suspicious value for a variable (including residuals). >In my experience, in the vast majority of instances, they indicate data >gathering or data entry errors, i.e., insufficient attention in quality >assurance in data gathering or data entry. In my experience, rechecking >qa typically eliminates over 80% of suspicious data values. This is one >reason I advocate thorough exploration of a set of data before doing >the analysis. By thorough exploration I mean things like frequencies, >multi-way crosstabs, scatterplots, box plots, rechecking scale keys and >reliability, etc. > >Derived variables such as residuals and rates, should be subjected to >the same thorough examination and understanding-seeking as raw >variables. This identifies suspicious values. > >Unusual values may be "real". They should not be simply tossed. In >cluster analysis, sometimes there are singleton clusters, e.g., Los >Angeles county is distinct from other counties in the western states. >Some times there are 500 lb persons. There might be a rose growing in a >cornfield. There may be strong interaction (synergy) effects. > > >The first thing to do about outliers is to prevent them by careful >quality assurance procedures in data gathering and handling. > >A thorough search for suspect data values and potentially treating them >as outliers in analysis is an important part of data quality assurance. >Values for a variable are suspect and in need of further review when >they are unusual given the subject matter area, outside the legitimate >range of the response scale, show as isolated on scattergrams, have >subjectively extreme residuals, when the data shows very high order >interaction on ANOVA analyses, when they result in a case being >extremely influential in a regression, etc. Recall that researchers >consider Murphy a Pollyanna. > > >The detection of odd/peculiar/suspicious values late in the data >analysis process is one one reason to assure that you can go all the way >back and redo the process. Keeping all of the data gathering >instruments, and preserving the syntax for all data transformation are >important parts of going back and checking on "outliers". The >occurrence of many outliers suggests the data entry was sloppy. There >are likely to be incorrectly entered values that are not "outliers". >Although it is painful, another round of data entry and verification may >be in order. > > >Correcting the data. > >Sometimes you can actually go back to redo the measurements. (Is there >really a 500 pound 9 year old?). You should always have all the paper >from which data were transcribed. >On the rare occasions when there are very good reasons, you might modify >the value for a particular case. e.g., percent correct entered as 1000% >==> 100%. > > >Modifying the data. >Values of variables should be trimmed or recoded to "missing" only when >there is a clear rationale. And then only when it is not possible to >redo the measurement process. (Maybe there really is a six year old who >weighs 400 lbs. Go back and look if possible.) > >If suspected outliers are recoded or trimmed, the analysis should be >done as is and as modified to see what the effect of the modification >is. Changing the values of variables suspected to be outliers frequently >leads to misleading results. These procedures should be used very >sparingly. > >Math criteria can identify suspects. There should be a trial before >there is a verdict and the presumption should be against outlier status >for a value. > > >I don't recommend undesirable practices such as cavalierly trimming to 3 >SDs. Having a value beyond 3 SD can be reason to examine a case more >thoroughly. > >It is advisable to consult with a statistician before changing the >values of suspected outliers. > >Multiple analyses. > >If you have re-entered the data, or re-run the experiment, and done very >thorough exploration of the data, you are stuck as a last resort with >doing multiple analyses: including vs excluding the case(s); changing >the values for the case(s) to hotdeck values, to some central tendency >value, or to max or min on the response scale (e.g., for achievement, >personality, or attitude measures), modeling the specialness of the >particular value, etc. > >In the small minority of occasions where the data can not be cleaned up, >the analysis should be done in three or more ways (include the >outliers as is, trim the values, treat the values as missing, transform >to ranks, include in the model variables that flag those cases, or >...). The reporting becomes much more complex. Consider yourself very >lucky if the conclusions do not vary substantially. > >Art Kendall >Social Research Consultants > > >Hector Maletta wrote: > > > >>"Outliers" have never been defined satisfactorily, and the concept is seldom >>used in a consistent way. Outliers are not "impossible" values, such as a >>widower who is 4 years old, or a mother who is younger than her daughter. >>Those are most likely data-entry or data-taking errors. >>Outliers are, most properly, extreme values. In a sample about heights they >>are individuals measuring over seven feet, or dwarfs. In an income sample >>they are people like Bill Gates. They are not impossible, they are simply >>rare. >>Of course, and extreme value may also be a simple mistake: a person 112 >>years old may be just 12, and someone who is 7'11" tall may be a more common >>5'11" just wrongly written or typed. But they just might exist, extremely >>old, extremely tall, extremely wealthy. >>Now, what is wrong with finding rare cases? If they exist, they should be >>dutifully recorded in your data, not hidden under the carpet. The problem is >>that they may distort your sample results if you are not careful in their >>treatment. If you have a 1/10,000 sample of a certain area, in order to >>estimate the distribution of heights, and stumble on the one and only dwarf >>in the neighborhood, you may end up estimating that the area is populated by >>10,000 small people, or (in another example) by 10,000 people with the >>income of Bill Gates. They may alter the shape of your curve, or disfigure >>your mean or standard deviation. >>>From another point of view, if you start again and draw your sample anew, >>chances are you won't stumble again on the only giant or the only dwarf in >>town. Of all possible random samples of the same size, just very few will >>include them, precisely because such subjects are rare, perhaps unique. If >>you have some grounds to know that they are extremely rare in the general >>population from which your sample comes, you may decide to exclude them from >>the sample, though this is seldom advisable without careful statistical >>analysis. >>One interesting exercise is considering the impact of their removal on the >>mean and standard deviation of important variables, and on the slope of key >>regression coefficients in your research. Suppose you are investigating the >>relationship between capital and technology, and discover a very strong >>relationship: more money, more high tech, but then you discover that the >>whole thing crumbles down when you withdraw Bill Gates from the sample: he >>was that solitary point to the Northeast of your scatterplot, while all >>other capitalists in your sample made their money in old fashioned low-tech >>businesses. Out goes Bill, and your money-tech beta falls into >>non-significance (just a fictional example, of course). SPSS Regression, for >>instance, lets you see the impact of removing each case on the overall fit >>of a regression model. A high-impact case is probably an outlier worth >>considering for closer inspection (and possible removal if suspect). >>Even if SPSS may identify high-impact cases, all this requires human >>intelligence. No surefire statistical device can do it for you. >>Hope this helps. >>Hector >> >> >> >> >>-----Mensaje original----- >>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de >>Richard Ristow >>Enviado el: Sunday, August 20, 2006 5:31 PM >>Para: [hidden email] >>Asunto: Re: outliers?? >> >>At 04:19 AM 8/14/2006, Samuel Solomon wrote: >> >> >> >> >> >>>I am clueless on how to handle outliers (especially when it comes to >>>prices) through the help of SPSS. Is there any surefire way? >>> >>> >>> >>> >>OK, there are better statisticians than I on this list, but to start >>with: >> >>There is not, and never will be, a surefire way, in SPSS or anywhere >>else. 'Outlier' values may be the most important data, and you may >>distort your analysis very badly by dropping them. I take the liberty >>of re-posting and essay on outliers I wrote some time ago; I hope it is >>at least partly germane to your needs. >> >> >>*Whether*, and *by what standard*, to identify outliers, is at least as >>important as 'how'. >> >>. Outliers that fail consistency checks: For example, an event date >>prior to the beginning of the study, or later than the present, can be >>rejected as wrong. (I've got 'event dates' on the brain from a project >>I'm working on. And, of course, I'm assuming that 'event dates' in the >>past or future aren't valid; in some study designs, they may be.) Those >>should be made missing; or they should be checked against the primary >>data source and corrected, if that is feasible. >> >>. Outliers that can't be rejected *a priori*: First, you shouldn't even >>try to look at those until you reject any demonstrable errors. >> >>Second, I would say a good way to look for them is to look at the >>high-percentile cutpoints in the distribution. Depending on the size of >>your dataset, 'high-percentile' could be 99%, 99.9%, and 99.99%. (These >>are not alternatives. If you use, say, 99.9%, you should look at 99% as >>well. Consider also looking at the 90% or 95% cutpoint, for a sense of >>the 'normal' range of the distribution. 5% outliers are NOT outliers. >>And, of course, look at both ends: 1%, 0.1%, 0.01% percentile >>cutpoints, as well.) >> >>Third, I think I'm seeing a trend in the statistics community against >>removing 'outliers' by internal criteria (n standard deviations, 1st >>and 99th percentiles). The rationale, and it's a strong one, is that >>those are observed values of whatever it is that you're measuring. If >>you eliminate them, you'll get a model based on their rarity; and that >>model, itself, can become an argument for eliminating them (because >>they don't fit it), and you can talk yourself into a model that's quite >>unrepresentative of reality. >> >>Fourth, however, the largest values will have a disproportionate, >>possibly dominant, effect on most linear models -- regression, ANOVA, >>even taking the arithmetic mean. Depending on your study, you can >> >>- Go ahead. In this case, the model's fit will be weighted toward >>predicting the largest values, and may show little discrimination >>within the 'cloud' of more-typical values. That, however, may be the >>right insight to be gained from the data. >> >>- If available, use a non-parametric method. That's often favored, >>because it neither rejects the large values nor gives them >>disproportionate weight. By the same token, however, if much of the >>useful information is in the largest values, non-parametric methods can >>unduly DE-emphasize these values. >> >>- There are reasons to reject this as heresy, but if you're doing >>linear modelling, I'd probably try it both with the largest values >>retained and with them eliminated. (I'd only do this if the 'largest >>values' look very far from the 'cloud' of regular values. A scatter >>plot can be an invaluable tool for this.) If the two models are closely >>similar, you have an argument that there's a single process going on, >>with the largest values being part of the same process. If they're very >>different, you may have two processes, one of which operates >>occasionally to produce the largest values, the other of which operates >>'normally' but is swamped when the larger process happens. And if the >>run without the large values produces a poor R^2, you may have an >>argument that the observable process is represented by the largest >>values, and the variation in the 'normal cloud' is mostly noise. >> >>- [ADDED] Investigate carefully using a 'bootstrap' method - sampling >> >> >>from your sample. Very large data values occurring in very small > > >>proportion can give you a huge variance in your estimates, that won't >>be detected by standard analytic methods. With a very small proportion >>of large values, the expected number in a sample may be very small, >>with a large variance. (Let's see - Poisson distributed for all >>practical purposes, I think.) Particularly, because of the 'leverage' >>of the very large values, the estimates in a sample that includes one >>or more may be drastically different from those in a sample that >>happens not to include any. You'll have to see whether that's a problem >>in your data. If it is, it may help to do a stratified sample in which >>the large values are over-represented, and then assigned lower weights >>in proportion. If, that is, you can identify a subgroup of 'large >>values,' and have access to enough of them to get a significant sample. >> >>Onward, and good luck, >>Richard >> >>......................................... >>(*) Ristow, Richard, "Re: Outliers", on list "SAS(r) Discussion" >><[hidden email]>, Mon, 15 Nov 2004 14:26:51; reposted as "Re: >>Question on excluding extremes", SPSSX-L Fri, 17 Feb 2006 11:21:29 >> >> >> >> >> >> > > > >
Art Kendall
Social Research Consultants |
In reply to this post by Peck, Jon
Jon Peck wrote:
"Note that outlier and unusual are not quite the same thing. You might have a very lonely and suspicious value buried in the middle of your data in a sparse region. Is that an outlier? It might be equally suspicious." Indeed. I have stumbled on some recently. And the case leads me to one kind of unusual data not always specifically recognized, those only revealed in ratios and relationships. In my case it was a seed rate (amount of seed sown per hectare). Farmers tend to use very definite and separate amounts depending on the technology they apply (type of seed, use of irrigation, type of sowing device, etc.) resulting in discrete seed rates like, say, 50, 100 or 200 kg/Ha. Farmers were not asked to report the seed rate, only the area planted and the total amount of seed used, and nothing abnormal was initially detected there. But when the ratio was computed, and most cases fell on the expected values (such as 50, 100 or 200) or within rounding error of them, a few cases fell in between, usually at fractional values (say at 72.8333, 87.42 or 176.5 kg/Ha). On closer inspection most of them were revealed as data entry mistakes. They were unusual, worth of inspection, worth correcting, but not outliers in the usual sense of "out of range", though outliers in the more subtle sense of "a legitimate value, not necessarily out of range but not in the list of usual or expected values". Hector |
In reply to this post by Samuel Solomon
Yes, but "outliers" only by this definition of mine, not by the old
fashioned definition of the outlier as a value that is extreme or out-of-range, i.e. lower than the acceptable minimum or higher than the acceptable maximum. The seed rates were neither, but they were "outliers", sort of, by my newfangled definition including unusual intermediate values. Since the "usual" or "acceptable" values may have a margin of tolerance, it is always a matter of discretion to define where outlier territory begins, both in the case of out-of-range or extreme outliers and in the case of intermediate unusual values. Cases near the cutoff point will always be doubtful, but their frequency is also a criterion: when you have a judicious cutoff point of, say, 55, but you have a bunch of, say, two dozen cases at 60, with only solitary cases at higher values, perhaps your inclusive cutoff point should be shifted to 60, and only those above 60 be regarded as outliers (on the assumption that it is unlikely that two dozen people make the same mistake). But then again, you may be wrong in doing so if there is some mechanism at play producing the mistake systematically (I will send a separate message telling the story of some tribulations of mine with numerals in the English transcription of surveys using Arabic script, as a cautionary tale in this regard). Hector -----Mensaje original----- De: Peck, Jon [mailto:[hidden email]] Enviado el: Monday, August 28, 2006 10:33 AM Para: Hector Maletta Asunto: RE: outliers?? So applying your model, simple though it was, exposed these as outliers with regard to the model, in a sense. -Jon -----Original Message----- From: Hector Maletta [mailto:[hidden email]] Sent: Monday, August 28, 2006 8:26 AM To: Peck, Jon; [hidden email] Subject: RE: outliers?? Jon Peck wrote: "Note that outlier and unusual are not quite the same thing. You might have a very lonely and suspicious value buried in the middle of your data in a sparse region. Is that an outlier? It might be equally suspicious." Indeed. I have stumbled on some recently. And the case leads me to one kind of unusual data not always specifically recognized, those only revealed in ratios and relationships. In my case it was a seed rate (amount of seed sown per hectare). Farmers tend to use very definite and separate amounts depending on the technology they apply (type of seed, use of irrigation, type of sowing device, etc.) resulting in discrete seed rates like, say, 50, 100 or 200 kg/Ha. Farmers were not asked to report the seed rate, only the area planted and the total amount of seed used, and nothing abnormal was initially detected there. But when the ratio was computed, and most cases fell on the expected values (such as 50, 100 or 200) or within rounding error of them, a few cases fell in between, usually at fractional values (say at 72.8333, 87.42 or 176.5 kg/Ha). On closer inspection most of them were revealed as data entry mistakes. They were unusual, worth of inspection, worth correcting, but not outliers in the usual sense of "out of range", though outliers in the more subtle sense of "a legitimate value, not necessarily out of range but not in the list of usual or expected values". Hector |
In reply to this post by Samuel Solomon
My $.02 ...
It also depends on what your modeling objective is. If, for example, you're a direct marketer and you're trying to predict responsiveness, your objective may to predict the MOST responses. So given that extremes or outliers can skew predictiveness, suppressing or extremes outliers may be in order. Another method I've used over the years (granted you have adequate sample sizes) is to dummy extremes or outliers. Doing so can help explain their impact and it can help you make a determination to include or drop the term from your equation. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Peck, Jon Sent: Monday, August 28, 2006 8:02 AM To: [hidden email] Subject: Re: outliers?? To add to all this good advice, In many cases whether something is an outlier or not depends on a model. It may be an extreme value not explained by the model. It may be much more complicated than a univariate extreme. The new Anomaly Detection procedure in SPSS can help to find these in a multivariate framework, although it is still up to you to decide what to do about it. In a context such as regression, it is good to look at the leverage statistics to see whether potential outliers actually affect your results much or not. Finally, consider the process assumed to be generating the data. It is commonly observed that stock market prices follow a random walk model, which means that the variance is not finite. Such a fat-tailed distribution will intrinsically have more outliers than we are probably accustomed to seeing, but that is part of the phenomenon to model. Note that outlier and unusual are not quite the same thing. You might have a very lonely and suspicious value buried in the middle of your data in a sparse region. Is that an outlier? It might be equally suspicious. Regards, Jon Peck SPSS |
Free forum by Nabble | Edit this page |