I am currently cleaning my data in SPSS to prepare for the later logistic regression analysis. I first identified univariate outliers with z scores > 3, and winsorized it using 1.5*IQR rules. Z scores of winsorized variables were double-checked again using casewise diagnostics (in linear regression) to ensure no std dev were > 3. Then, only one multivariate outlier was identified using Mahalanobis Distance and I have deleted it.
When I tried to run logistic regression using this set of "no outliers" data, the output showed, "the casewise plot is not produced because no outliers were found". But, the computed standardized residuals (ZRE_1) variable showed that there were still 29 cases which z scores > 3! How can it be? I still unable to figure out. Aren't casewise listing of residuals and standardized residuals use the same measurement (z scores)? In fact, the model fit has increased from 10.3% to 38.9% if I deleting all 29 cases! So, how should I deal with these? |
The raw variables have certain standard deviations, and
those provided your first z-scores. Given outliers, winsorizing is what I may retreat to if I cannot improve the interval-level of measurement (and homogeneity of variances) by taking a transformation. But if the extreme scores have equally extreme results, it is simply *wrong* to tamper with them. If you are trying to achieve "normality" so that your test statistics are most robust, you need to remember that the "condition" of normality in Least Squares applies to residuals, not to the predictors. The similar condition is slightly less important in Logistic regression. Why are there larger residuals? The *residuals* from the prediction have a different S.D. The extremes are not different from all-values, but are apparently different from cases with similar outcomes. What it says when you have outliers here is that you have a number of badly-fitted cases, so your "model fit" naturally looks better if you discard them. Is the 10% vs. 39% a comparison of a couple of values of a pseudo-R^2? -- Rich Ulrich ---------------------------------------- > Date: Fri, 18 Mar 2011 08:27:37 -0700 > From: [hidden email] > Subject: outliers - casewise listing of residuals and standardized residuals > To: [hidden email] > > I am currently cleaning my data in SPSS to prepare for the later logistic > regression analysis. I first identified univariate outliers with z scores > > 3, and winsorized it using 1.5*IQR rules. Z scores of winsorized variables > were double-checked again using casewise diagnostics (in linear regression) > to ensure no std dev were > 3. Then, only one multivariate outlier was > identified using Mahalanobis Distance and I have deleted it. > > When I tried to run logistic regression using this set of "no outliers" > data, the output showed, "the casewise plot is not produced because no > outliers were found". But, the computed standardized residuals (ZRE_1) > variable showed that there were still 29 cases which z scores > 3! > > How can it be? I still unable to figure out. Aren't casewise listing of > residuals and standardized residuals use the same measurement (z scores)? > > In fact, the model fit has increased from 10.3% to 38.9% if I deleting all > 29 cases! So, how should I deal with these? > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by lcl23
I am leery of blindly transforming or deleting data. <soapbox off> Identifying suspicious values should result in going back and reviewing the data. Deleting and or transforming should only be done after one is very confident that there was not a data entry error or substantive cause. Since the early 70's it has been my experience, that in excess of a subjective 80% of outliers are data entry errors. Of course a lot depends on how you obtained the data, whether it was proofread, double keyed, cases actually being outside the pop of interest, etc. Usually one worries about the distributions of residuals (x-xhat) rather than the distributions of predictors (x's) so that was a good idea. Is it possible that some variable was not in the model that could influence the result? In my experience the looking at extreme values of variables is useful in detecting unusual values. Are all of the values within the legitimate domain of measurement of the construct? Looking for multivariate suspicious values is a good idea. <data> <identify unusual cases> can be useful to identify values to look into? Specific ways to check on data anomalies depend on the nature of the data and what phenomena you are looking at. E.g., if you have self-reporting you might have "pattern responses". Did you go back and look at cases with suspicious values to see if there might be reasons for extremeness? Art Kendall Social Research Consultants On 3/18/2011 11:27 AM, lcl23 wrote: ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARDI am currently cleaning my data in SPSS to prepare for the later logistic regression analysis. I first identified univariate outliers with z scores > 3, and winsorized it using 1.5*IQR rules. Z scores of winsorized variables were double-checked again using casewise diagnostics (in linear regression) to ensure no std dev were > 3. Then, only one multivariate outlier was identified using Mahalanobis Distance and I have deleted it. When I tried to run logistic regression using this set of "no outliers" data, the output showed, "the casewise plot is not produced because no outliers were found". But, the computed standardized residuals (ZRE_1) variable showed that there were still 29 cases which z scores > 3! How can it be? I still unable to figure out. Aren't casewise listing of residuals and standardized residuals use the same measurement (z scores)? In fact, the model fit has increased from 10.3% to 38.9% if I deleting all 29 cases! So, how should I deal with these? -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/outliers-casewise-listing-of-residuals-and-standardized-residuals-tp3967514p3967514.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
In reply to this post by lcl23
> Date: Fri, 18 Mar 2011 22:26:50 -0700
> From: [hidden email] > Subject: RE: outliers - casewise listing of residuals and standardized > residuals > To: [hidden email] > > Most of my predictors have severe positive skew & transformation > doesn't help. I don't know why you mention "positive skew" since a long tail to the right, giving positive skew, is what is most common. Those are the outliers that are corrected in one step by the most common transformations (log, square root, reciprocal), for data scored starting at or above zero. For the simplest case of negative skew, the problem is that one needs to take the MAXIMUM score as the new zero: Instead of giving someone 100 on a test, you may re-score them as "0 errors" for the sake of analysis. And if the scores approach both the natural (or scaled) maximum and minimum, then you would want to consider a symmetrical transformation such as the logit. But transformations should be "natural" to the data, whenever feasible. The best information for deciding what is natural is to consider "What is measured?" and "How is it measured?" And you should consider, at the same time, what you are trying to predict. - A natural transformation for one criterion is not necessarily the same as for another. My main conclusion, here, is that you need to say what your variables actually are, so you can get appropriate advice for transformations. > So, I winsorized it. Although normality for predictors is > not required, but it will make for stronger assessment if it exists. By > the way, could you explain more on "if the extreme scores have equally > extreme results"? I do agree that, as a habit, you get a better prediction equation when the scores are normal. But that is predicated on implicit assumptions that "normal" is going to be equal-interval ... especially in its relation to the outcome. So, as a counter-example: for a continuous predictor, an "outlier" actually gives strength to the R^2 whenever it predicts an outcome that is also an outlier (in the right direction). Similarly, as a counter-example to requiring normality, for logistic regression, a case can be "over-predicted" as group-1 because it is extreme and outlying on the predictors; this will not hurt the maximum likelihood fit *if* the group assignment is correct. However, this same case is apt to show up as an outlier when it looks like an extreme instance of group-1, but it is actually a member of group-2. The logistic regression is said to be more robust against bad distributions than is the corresponding 2-group discriminant function, *because* over-prediction need not hurt. > > After identified univariate & multivariate outliers, I still need to > face the z scores pertinent to residuals. So, they are 2 different z > scores here which measuring outliers at different stage? > > The 29 cases I mentioned previously have Cook's distance < 1, which > means I need not discard them all? Yes, the 10% (retain) vs 39% > (discard) is the Nagelkerke R^2. Let me ask this: What is your N? And: Is the original result even statistically significant? What do the 29 have in common? What excuse do you have to drop them? I don't think you can publish, or impress any boss, if the best argument you have is something that comes down to, "We can achieve a pretty good fit only if we throw away a bunch of cases that do not fit." > > Sorry for so many questions, still in the process of learning stat........ > --- On Sat, 19/3/11, Rich Ulrich wrote: [snip, previous] Your reply came to me in private mail, though I did not notice that at first. I hope that it is okay that I have replied both privately and to the List. -- Rich Ulrich ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Art Kendall
|
In reply to this post by lcl23
> Date: Sun, 20 Mar 2011 07:39:19 -0700
> From: [hidden email] > Subject: RE: outliers - casewise listing of residuals and standardized > residuals > To: [hidden email]; [hidden email] > > > Three problematic predictors: > > (1) No. of board meeting – skewness (3.4) > > (2) Firm’s asset – in dollar, skewness (17.3) > > (3) No. of subsidiary – skewness (7.7) > > > > N is 800. In general, the model is significant but R^2 is quite > low(10%). I have yet to decide what to do about those extreme cases. > Different textbooks give different suggestions. > It might be because I groom my variables before I look at skewness, but I've usually been concerned with skewness between 0.35 and 1.5. I'm not even convinced that it is possible to have skewness of 17.3 or 7.7. - Would you describe one set of those values? "Board meetings" - In a model, I think I would be expect to see groupings of, say, "daily/ weekly/ monthly/ annually/ less". That grouping not going to come out very weird for skewness, but I *might* suspect that it is non-linear and *needs* categories, depending on the outcome variable. "Firm's assets" - Are you trying to model across multinationals, down to the corner grocery? Assets seems a natural for taking a log, but it also seems natural to create a model without the world's full range of possibilities. "No. of subsidiaries" - This also seems like something that deserves to put into manageable groups, either to analyze as categories or to create a reasonable "scale" on which I might expect a linear relation to outcome. -- Rich Ulrich ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Hi,
I will be out of the office till Monday, 3/28/11. I will reply to your email on Monday.
Thank you.
Joanne Han Marketing Research Manager UCLA Extension
|
Free forum by Nabble | Edit this page |