|
I d´appreciate if somebody can answer a novice doubt. Box plots mark a series of observations as outliers. That is clear in a normal distribution: those cases that are more than 1.5 interquartile ranges above P75 or below P25 are considered outliers (some authors say 2.2 IR instead of 1.5 IR). That makes sense for me. But I don´t know how to consider outliers in skewed distributions. The meaning of outliers comes from lie outside: we are trying to analyse if observations belong to a distribution. But in a skewed distribution a lot of observations are above 1.5 or 2.2 or more interquartile ranges and belong to the distribution... I feel confused. Does it make any sense to talk about outliers in skewed distributions? How to identify them? I´d appreciate any help. Thanks in advance. Florentino Menéndez. |
|
The question of "outliers" has come up many times over the last few decades
on this discussion list. take a look at those discussions in the archives. On the first page of this list on the right side sea "search" and then "advanced search". A better term might be "suspicious values" or "values that should be checked". Remember assumptions of distribution are about residuals not the raw data. The "anomalous values" tool can help identify mis-keyed or unreasonable values. Most rules of thumb are highly questionable when blindly applied. ----- Art Kendall Social Research Consultants -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
|
In reply to this post by fjmenendez
There is testing, and there is model-fitting. We do, usually, want to have tests
on the models, so we almost always want to meet the condition for testing.
Measures with extreme skewness need to be transformed for least-squares statistics (ANOVA) or to be fitted with a non-linear maximum likelihood models.
Remember that the assumption for least-squares testing is that equal intervals of the scale should be equal in their influence (or in being influenced) regardless of where they fall on the scale, be it the middle or an extreme. ("Equal interval"
describes a /relationship/, not the character of a single measure.)
Tukey gave a rule of thumb -- if the largest of a natural measurement (non-negative) is 20 times the smallest, you almost always should use a transformation. IIRC, "10 times" the smallest suggests that you should consider one. What you want to look at first in
choosing a transformation is not the skewness, however, but is the mechanism for
generating the numbers. For the first choices, counts imply square roots; intensities
imply logs (or logistic transforms); distances imply reciprocals.
-- Rich Ulrich From: SPSSX(r) Discussion <[hidden email]> on behalf of Florentino Jorge Menendez <[hidden email]>
Sent: Monday, May 21, 2018 4:01:51 PM To: [hidden email] Subject: Fwd: A basic question about outliers I d´appreciate if somebody can answer a novice doubt.
Box plots mark a series of observations as outliers. That is clear in a normal distribution: those cases that are more than 1.5 interquartile ranges above P75 or below P25 are considered outliers (some authors say 2.2 IR instead of 1.5 IR). That makes
sense for me.
But I don´t know how to consider outliers in skewed distributions. The meaning of outliers comes from lie outside: we are trying to analyse if observations belong to a distribution.
But in a skewed distribution a lot of observations are above 1.5 or 2.2 or more interquartile ranges and belong to the distribution... I feel confused. Does it make any sense to talk about outliers in skewed distributions? How to identify them?
I´d appreciate any help. Thanks in advance.
Florentino Menéndez.
|
|
In reply to this post by fjmenendez
Florentino When a data distribution is skewed, you might summarize it through percentiles such as: 75 90 95 99 99.5 etc. Tony Babinec |
|
In reply to this post by Rich Ulrich
Thanks Art, thanks Rich for your kindness and your knowledge :) I read the posts about outliers in the list, and I have benefited from them. The idea of thinking about them as suspicious values that need additional checking before decision makes a lot of sense for me. Perhaps I should think this topic using different words. I don't know the anomalous values tool more than superficially. Perhaps it is a good idea reread about it. Also transformations deserve attention. I feel a little shy about them because of problems of interpretation. Again, thanks Art, thanks Rich :) On Mon, May 21, 2018 at 11:34 PM, Rich Ulrich <[hidden email]> wrote:
|
|
You are right - when using transformations, "interpretation" is the main snag.
Sometimes you can report the medians for group, or percentiles (as someone suggested).
Sometimes the original means are still meaningful, and you can use those -- By the way, when the means do NOT seem like appropriate measures for a group, that is a sure sign that ANOVA is not appropriate.
- You can back-transform to get the so-called "geometric mean" after log transformation.
- Some versions of reciprocal make sense when you invert the descriptive units. For instance in the USA, we talk about MPG, miles per gallon, whereas analyses are often better scaled by the European convention of Liters per 100 kilometers.
and a unified presentation across distances by using meters-per-second instead of using the very-
different times for different distances, like, for instance, "9.80 seconds for the 100 meter dash."
There are another couple of skewed-data models where transformation is the second consideration.
- When there are a large number of zeros, it is sometimes /logically/ appropriate to make the
break into two variables, e.g., AnyIncome (yes/no), and then, perhaps analyzing the subset with
income, AveIncome. The non-zero data might or might not have notable skew.
- When the measures, as collected, represent counts or amounts, it is proper to ask if there
should be a denominator to make rates (ratios). So we analyze crime rates, birth rates, etc.,
instead of "total crimes" or "total births" (highly skewed data) across cities or countries of
different sizes.
--
Rich Ulrich
From: Florentino Jorge Menendez <[hidden email]>
Sent: Tuesday, May 22, 2018 11:35 AM To: Rich Ulrich Cc: [hidden email] Subject: Re: A basic question about outliers ...
Also transformations deserve attention. I feel a little shy about them because of problems of interpretation.
...
|
|
Administrator
|
In reply to this post by fjmenendez
I think that generalized linear models with appropriate error distributions &
link functions can often yield results that are more interpretable. (I think this is what Rich was getting at when he mentioned "non-linear maximum likelihood models".) Here's an example for the case where the outcome variable is positive and positively skewed: http://rstudio-pubs-static.s3.amazonaws.com/5691_192685385fc445c9b3fb1619960a20e2.html Notice especially the Differences and Similarities section, where the author says this: "Thus, if the outcome is log transformed before entering the linear regression model, the inference about the geometric mean. In contrast, the generalized linear model approach allows inference about the arithmetic mean on the original scale." Finally, the models estimated on that page using R can also be estimated using GENLIN, as it allows one to select a Gamma error distribution. https://www.ibm.com/support/knowledgecenter/en/SSLVMB_25.0.0/statistics_reference_project_ddita/spss/advanced/syn_genlin_model.html HTH. fjmenendez wrote > Thanks Art, thanks Rich for your kindness and your knowledge :) > > I read the posts about outliers in the list, and I have benefited from > them. The idea of thinking about them as suspicious values that need > additional checking before decision makes a lot of sense for me. Perhaps I > should think this topic using different words. > I don't know the anomalous values tool more than superficially. Perhaps it > is a good idea reread about it. > > Also transformations deserve attention. I feel a little shy about them > because of problems of interpretation. > > Again, thanks Art, thanks Rich :) > > On Mon, May 21, 2018 at 11:34 PM, Rich Ulrich < > rich-ulrich@ > > wrote: > >> There is testing, and there is model-fitting. We do, usually, want to >> have >> tests >> >> on the models, so we almost always want to meet the condition for >> testing. >> >> >> Measures with extreme skewness need to be transformed for least-squares >> >> statistics (ANOVA) or to be fitted with a non-linear maximum likelihood >> models. >> >> >> Remember that the assumption for least-squares testing is that equal >> intervals >> >> of the scale should be equal in their influence (or in being influenced) >> regardless >> >> of where they fall on the scale, be it the middle or an extreme. ("Equal >> interval" >> >> describes a /relationship/, not the character of a single measure.) >> >> >> Tukey gave a rule of thumb -- if the largest of a natural measurement >> (non-negative) >> >> is 20 times the smallest, you almost always should use a transformation. >> IIRC, "10 times" >> >> the smallest suggests that you should consider one. What you want to look >> at first in >> >> choosing a transformation is not the skewness, however, but is the >> mechanism for >> >> generating the numbers. For the first choices, counts imply square roots; >> intensities >> >> imply logs (or logistic transforms); distances imply reciprocals. >> >> >> -- >> >> Rich Ulrich >> ------------------------------ >> *From:* SPSSX(r) Discussion < > SPSSX-L@.UGA > > on behalf of >> Florentino Jorge Menendez < > fjmenendez@ > > >> *Sent:* Monday, May 21, 2018 4:01:51 PM >> *To:* > SPSSX-L@.UGA >> *Subject:* Fwd: A basic question about outliers >> >> >> I d´appreciate if somebody can answer a novice doubt. >> >> Box plots mark a series of observations as outliers. That is clear in a >> normal distribution: those cases that are more than 1.5 interquartile >> ranges above P75 or below P25 are considered outliers (some authors say >> 2.2 >> IR instead of 1.5 IR). That makes sense for me. >> >> But I don´t know how to consider outliers in skewed distributions. The >> meaning of outliers comes from lie outside: we are trying to analyse if >> observations belong to a distribution. >> >> But in a skewed distribution a lot of observations are above 1.5 or 2.2 >> or >> more interquartile ranges and belong to the distribution... I feel >> confused. Does it make any sense to talk about outliers in skewed >> distributions? How to identify them? >> >> I´d appreciate any help. Thanks in advance. >> Florentino Menéndez. >> >> >> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >> Libre >> de virus. www.avg.com >> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >> <#m_-4902436190330674248_x_m_6938251499252016520_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >> >> ===================== To manage your subscription to SPSSX-L, send a >> message to > LISTSERV@.UGA > (not to SPSSX-L), with no body text >> except the command. To leave the list, send the command SIGNOFF SPSSX-L >> For >> a list of commands to manage subscriptions, send the command INFO REFCARD >> > > ===================== > To manage your subscription to SPSSX-L, send a message to > LISTSERV@.UGA > (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD ----- -- Bruce Weaver [hidden email] http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." NOTE: My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
| Free forum by Nabble | Edit this page |
