So I have detected some outliers in my data and removed them, however when I removed them all, new ones appeared in bloxplot, should I keep removing until there is none or just once is enough?
|
Administrator
|
What's the variable (or variables)? Have you ruled out data entry errors? What kind of analysis are you doing, some kind of regression model? If so, I'd be more concerned about multivariate outliers and influential points than univariate outliers. Cook's distance is one well-known measure of influence you could look at. HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
I did not see the original post so cannot CC the OP.
The usual answer is _before removing the first one_. In my experience having a value flagged as an "outlier" only means that you should look at it more closely. In addition to Cook's distance for influential values, doing <data> <identify unusual cases> is a much better approach to locating data to check more carefully. many times users worry unnecessarily about extreme values. I concur with Bruce that we need more info to give better feedback. Why are extreme values a problem for you? What kind of analysis do you have in mind? Are the extreme values in independent variables, dependent variables, or covariates? How many IVs, DVs, covariates are there in your data? Often clearly stating the substantive questions you are trying to answer enable list members to give better feedback. Art Kendall Social Research Consultants On 4/16/2012 10:34 AM, Bruce Weaver wrote: > noxeon wrote >> So I have detected some outliers in my data and removed them, however when >> I removed them all, new ones appeared in bloxplot, should I keep removing >> until there is none or just once is enough? >> > What's the variable (or variables)? Have you ruled out data entry errors? > > What kind of analysis are you doing, some kind of regression model? If so, > I'd be more concerned about multivariate outliers and influential points > than univariate outliers. Cook's distance is one well-known measure of > influence you could look at. > > HTH. > > > ----- > -- > Bruce Weaver > [hidden email] > http://sites.google.com/a/lakeheadu.ca/bweaver/ > > "When all else fails, RTFM." > > NOTE: My Hotmail account is not monitored regularly. > To send me an e-mail, please use the address shown above. > > -- > View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Detecting-outliers-when-to-stop-tp5642233p5643995.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
Administrator
|
The original post was made via Nabble, where it is still marked as "not accepted by the mailing list". So it's visible in the Nabble archive, but has not gone out to the mailing list. I'm not sure why. Perhaps the OP has not actually joined the list?
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
In reply to this post by Art Kendall
You might want to make sure that the case part of the population you are trying to generalize the results to.
I work with hospital length of stay (LOS) data. When we have cases with LOS of say 100 days, the case has to be examined to determine if it is part of the original inclusion critera intended for the study or if the case should have been excluded it to begin with. Using this criteria, we usually have some outliers in the data and I end up categorizing the data versus leaving it continuous. If the study is on a hospital practice and we toss the case just because it is an outlier, we might be throwing out legitimate data. So, we go back to "what is the population we are sampling from" question. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Art Kendall Sent: Monday, April 16, 2012 8:36 AM To: [hidden email] Subject: Re: Detecting outliers; when to stop? I did not see the original post so cannot CC the OP. The usual answer is _before removing the first one_. In my experience having a value flagged as an "outlier" only means that you should look at it more closely. In addition to Cook's distance for influential values, doing <data> <identify unusual cases> is a much better approach to locating data to check more carefully. many times users worry unnecessarily about extreme values. I concur with Bruce that we need more info to give better feedback. Why are extreme values a problem for you? What kind of analysis do you have in mind? Are the extreme values in independent variables, dependent variables, or covariates? How many IVs, DVs, covariates are there in your data? Often clearly stating the substantive questions you are trying to answer enable list members to give better feedback. Art Kendall Social Research Consultants On 4/16/2012 10:34 AM, Bruce Weaver wrote: > noxeon wrote >> So I have detected some outliers in my data and removed them, however >> when I removed them all, new ones appeared in bloxplot, should I keep >> removing until there is none or just once is enough? >> > What's the variable (or variables)? Have you ruled out data entry errors? > > What kind of analysis are you doing, some kind of regression model? > If so, I'd be more concerned about multivariate outliers and > influential points than univariate outliers. Cook's distance is one > well-known measure of influence you could look at. > > HTH. > > > ----- > -- > Bruce Weaver > [hidden email] > http://sites.google.com/a/lakeheadu.ca/bweaver/ > > "When all else fails, RTFM." > > NOTE: My Hotmail account is not monitored regularly. > To send me an e-mail, please use the address shown above. > > -- > View this message in context: > http://spssx-discussion.1045642.n5.nabble.com/Detecting-outliers-when- > to-stop-tp5642233p5643995.html Sent from the SPSSX Discussion mailing > list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except > the command. To leave the list, send the command SIGNOFF SPSSX-L For a > list of commands to manage subscriptions, send the command INFO > REFCARD > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
It's also not uncommon in such situations to fold back numbers to a value that reflects the top of the distribution, say 2 SD's. So if the distribution has 50 days as the 2SD point, then anyone over 50 days is recoded to 50 days, and the value's new meaning is 50+. It allows its use as a continuous variable without too much loss of information or interpretation. Often when doing this (or even categorizing/binning the values) it makes sense to consider dummy coding outlier cases. That way, if you don't want cases that fell 3 or 4 SD's out from the mean to have the same model weight as cases that were only 2SD's, you can dummy code just those cases, and model the effect (or even use it for weighting).
I'll also note that when dealing with counts, it's not uncommon for people to erroneously begin looking at and cutting/recoding "outliers" when in fact, the distribution is simply non-normal, with enough cases in the tail to necessitate an alternative approach. Income and spending are both distributed in what is called "log-normal" meaning that a log transformation will make them normally distributed. In most research I've done or read using these factors, you do a log transformation, and then discuss them in log-income or log-spending terms. I've also had situations were days had a similar issue. Often this is a better approach than binning or recoding. Another approach is to replace extreme values with more appropriate values based on the sample, but I would warn that this should be based on some smart approach, taking into account the data source. So Days in a Hospital or Money are both things that we might say are unlikely to be wrong, so even if extreme, they are good values. Extreme values in a survey however could simply be human error or other factors not of interest may have led to an extreme value, thus under more normal circumstances we might expect a more reasonable response from the person (I could even see maybe hospital stays that way, if say someone was of such a personality that they could somehow increase their length of stay well beyond normal?). In this case you treat it like it's missing and impute, with hot decking be a common acceptable approach, but multiple imputation also being good. Finally, I'd say that using methods that don't assume normality but which are still strong, such as resampling (boot strapping) could be used, and the values left as is. Matthew J Poes Research Data Specialist Center for Prevention Research and Development University of Illinois 510 Devonshire Dr. Champaign, IL 61820 Phone: 217-265-4576 email: [hidden email] -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Parise, Carol A. Sent: Tuesday, April 17, 2012 10:57 AM To: [hidden email] Subject: Re: Detecting outliers; when to stop? You might want to make sure that the case part of the population you are trying to generalize the results to. I work with hospital length of stay (LOS) data. When we have cases with LOS of say 100 days, the case has to be examined to determine if it is part of the original inclusion critera intended for the study or if the case should have been excluded it to begin with. Using this criteria, we usually have some outliers in the data and I end up categorizing the data versus leaving it continuous. If the study is on a hospital practice and we toss the case just because it is an outlier, we might be throwing out legitimate data. So, we go back to "what is the population we are sampling from" question. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Art Kendall Sent: Monday, April 16, 2012 8:36 AM To: [hidden email] Subject: Re: Detecting outliers; when to stop? I did not see the original post so cannot CC the OP. The usual answer is _before removing the first one_. In my experience having a value flagged as an "outlier" only means that you should look at it more closely. In addition to Cook's distance for influential values, doing <data> <identify unusual cases> is a much better approach to locating data to check more carefully. many times users worry unnecessarily about extreme values. I concur with Bruce that we need more info to give better feedback. Why are extreme values a problem for you? What kind of analysis do you have in mind? Are the extreme values in independent variables, dependent variables, or covariates? How many IVs, DVs, covariates are there in your data? Often clearly stating the substantive questions you are trying to answer enable list members to give better feedback. Art Kendall Social Research Consultants On 4/16/2012 10:34 AM, Bruce Weaver wrote: > noxeon wrote >> So I have detected some outliers in my data and removed them, however >> when I removed them all, new ones appeared in bloxplot, should I keep >> removing until there is none or just once is enough? >> > What's the variable (or variables)? Have you ruled out data entry errors? > > What kind of analysis are you doing, some kind of regression model? > If so, I'd be more concerned about multivariate outliers and > influential points than univariate outliers. Cook's distance is one > well-known measure of influence you could look at. > > HTH. > > > ----- > -- > Bruce Weaver > [hidden email] > http://sites.google.com/a/lakeheadu.ca/bweaver/ > > "When all else fails, RTFM." > > NOTE: My Hotmail account is not monitored regularly. > To send me an e-mail, please use the address shown above. > > -- > View this message in context: > http://spssx-discussion.1045642.n5.nabble.com/Detecting-outliers-when- > to-stop-tp5642233p5643995.html Sent from the SPSSX Discussion mailing > list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except > the command. To leave the list, send the command SIGNOFF SPSSX-L For a > list of commands to manage subscriptions, send the command INFO > REFCARD > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |