SPSSX Discussion

outliers - casewise listing of residuals and standardized residuals

Classic

List

Threaded

8 messages Options

lcl23

outliers - casewise listing of residuals and standardized residuals

I am currently cleaning my data in SPSS to prepare for the later logistic regression analysis. I first identified univariate outliers with z scores > 3, and winsorized it using 1.5*IQR rules. Z scores of winsorized variables were double-checked again using casewise diagnostics (in linear regression) to ensure no std dev were > 3. Then, only one multivariate outlier was identified using Mahalanobis Distance and I have deleted it.

When I tried to run logistic regression using this set of "no outliers" data, the output showed, "the casewise plot is not produced because no outliers were found". But, the computed standardized residuals (ZRE_1) variable showed that there were still 29 cases which z scores > 3!

How can it be? I still unable to figure out. Aren't casewise listing of residuals and standardized residuals use the same measurement (z scores)?

In fact, the model fit has increased from 10.3% to 38.9% if I deleting all 29 cases! So, how should I deal with these?

Rich Ulrich

Re: outliers - casewise listing of residuals and standardized residuals

The raw variables have certain standard deviations, and
those provided your first z-scores.

Given outliers, winsorizing is what I may retreat to if I
cannot improve the interval-level of measurement (and
homogeneity of variances) by taking a transformation. But
if the extreme scores have equally extreme results, it is
simply *wrong* to tamper with them. If you are trying to
achieve "normality" so that your test statistics are most
robust, you need to remember that the "condition" of
normality in Least Squares applies to residuals, not to the
predictors. The similar condition is slightly less important
in Logistic regression.

Why are there larger residuals?
The *residuals* from the prediction have a different S.D.
The extremes are not different from all-values, but are
apparently different from cases with similar outcomes.

What it says when you have outliers here is that you have a
number of badly-fitted cases, so your "model fit" naturally
looks better if you discard them. Is the 10% vs. 39% a
comparison of a couple of values of a pseudo-R^2?

--
Rich Ulrich

----------------------------------------

> Date: Fri, 18 Mar 2011 08:27:37 -0700
> From: [hidden email]
> Subject: outliers - casewise listing of residuals and standardized residuals
> To: [hidden email]
>
> I am currently cleaning my data in SPSS to prepare for the later logistic
> regression analysis. I first identified univariate outliers with z scores >
> 3, and winsorized it using 1.5*IQR rules. Z scores of winsorized variables
> were double-checked again using casewise diagnostics (in linear regression)
> to ensure no std dev were > 3. Then, only one multivariate outlier was
> identified using Mahalanobis Distance and I have deleted it.
>
> When I tried to run logistic regression using this set of "no outliers"
> data, the output showed, "the casewise plot is not produced because no
> outliers were found". But, the computed standardized residuals (ZRE_1)
> variable showed that there were still 29 cases which z scores > 3!
>
> How can it be? I still unable to figure out. Aren't casewise listing of
> residuals and standardized residuals use the same measurement (z scores)?
>
> In fact, the model fit has increased from 10.3% to 38.9% if I deleting all
> 29 cases! So, how should I deal with these?
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Art Kendall

Re: outliers - casewise listing of residuals and standardized residuals

In reply to this post by lcl23

<soapbox on>
I am leery of blindly transforming or deleting data.
<soapbox off>

Identifying suspicious values should result in going back and reviewing the data. Deleting and or transforming should only be done after one is very confident that there was not a data entry error or substantive cause. Since the early 70's it has been my experience, that in excess of a subjective 80% of outliers are data entry errors. Of course a lot depends on how you obtained the data, whether it was proofread, double keyed, cases actually being outside the pop of interest, etc.

Usually one worries about the distributions of residuals (x-xhat) rather than the distributions of predictors (x's) so that was a good idea. Is it possible that some variable was not in the model that could influence the result?

In my experience the looking at extreme values of variables is useful in detecting unusual values.

Are all of the values within the legitimate domain of measurement of the construct?

Looking for multivariate suspicious values is a good idea.
<data> <identify unusual cases> can be useful to identify values to look into?

Specific ways to check on data anomalies depend on the nature of the data and what phenomena you are looking at.
E.g., if you have self-reporting you might have "pattern responses".

Did you go back and look at cases with suspicious values to see if there might be reasons for extremeness?

Art Kendall
Social Research Consultants

On 3/18/2011 11:27 AM, lcl23 wrote:

I am currently cleaning my data in SPSS to prepare for the later logistic
regression analysis. I first identified univariate outliers with z scores >
3, and winsorized it using 1.5*IQR rules. Z scores of winsorized variables
were double-checked again using casewise diagnostics (in linear regression)
to ensure no std dev were > 3. Then, only one multivariate outlier was
identified using Mahalanobis Distance and I have deleted it.

When I tried to run logistic regression using this set of "no outliers"
data, the output showed, "the casewise plot is not produced because no
outliers were found". But, the computed standardized residuals (ZRE_1)
variable showed that there were still 29 cases which z scores > 3!

How can it be? I still unable to figure out. Aren't casewise listing of
residuals and standardized residuals use the same measurement (z scores)?

In fact, the model fit has increased from 10.3% to 38.9% if I deleting all
29 cases! So, how should I deal with these?

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/outliers-casewise-listing-of-residuals-and-standardized-residuals-tp3967514p3967514.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Art Kendall
Social Research Consultants

Rich Ulrich

Re: outliers - casewise listing of residuals and standardized residuals

In reply to this post by lcl23

> Date: Fri, 18 Mar 2011 22:26:50 -0700
> From: [hidden email]
> Subject: RE: outliers - casewise listing of residuals and standardized
> residuals
> To: [hidden email]
>
> Most of my predictors have severe positive skew & transformation
> doesn't help.

I don't know why you mention "positive skew" since a long tail to
the right, giving positive skew, is what is most common. Those are
the outliers that are corrected in one step by the most common
transformations (log, square root, reciprocal), for data scored
starting at or above zero.

For the simplest case of negative skew, the problem is that one
needs to take the MAXIMUM score as the new zero: Instead of
giving someone 100 on a test, you may re-score them as "0 errors"
for the sake of analysis. And if the scores approach both the
natural (or scaled) maximum and minimum, then you would want to
consider a symmetrical transformation such as the logit.

But transformations should be "natural" to the data, whenever
feasible. The best information for deciding what is natural is to
consider "What is measured?" and "How is it measured?" And you
should consider, at the same time, what you are trying to predict.
- A natural transformation for one criterion is not necessarily the
same as for another.

My main conclusion, here, is that you need to say what your variables
actually are, so you can get appropriate advice for transformations.

> So, I winsorized it. Although normality for predictors is
> not required, but it will make for stronger assessment if it exists. By
> the way, could you explain more on "if the extreme scores have equally
> extreme results"?

I do agree that, as a habit, you get a better prediction equation when
the scores are normal. But that is predicated on implicit assumptions
that "normal" is going to be equal-interval ... especially in its
relation to the outcome. So, as a counter-example: for a continuous
predictor, an "outlier" actually gives strength to the R^2 whenever it
predicts an outcome that is also an outlier (in the right direction).
Similarly, as a counter-example to requiring normality, for logistic
regression, a case can be "over-predicted" as group-1 because it is
extreme and outlying on the predictors; this will not hurt the maximum
likelihood fit *if* the group assignment is correct. However, this
same case is apt to show up as an outlier when it looks like an extreme
instance of group-1, but it is actually a member of group-2.

The logistic regression is said to be more robust against bad distributions
than is the corresponding 2-group discriminant function, *because*
over-prediction need not hurt.

>
> After identified univariate & multivariate outliers, I still need to
> face the z scores pertinent to residuals. So, they are 2 different z
> scores here which measuring outliers at different stage?
>
> The 29 cases I mentioned previously have Cook's distance < 1, which
> means I need not discard them all? Yes, the 10% (retain) vs 39%
> (discard) is the Nagelkerke R^2.

Let me ask this: What is your N? And: Is the original result
even statistically significant?

What do the 29 have in common? What excuse do you have to drop them?
I don't think you can publish, or impress any boss, if the best argument
you have is something that comes down to, "We can achieve a pretty good
fit only if we throw away a bunch of cases that do not fit."

>
> Sorry for so many questions, still in the process of learning stat........
> --- On Sat, 19/3/11, Rich Ulrich wrote:
[snip, previous]

Your reply came to me in private mail, though I did not notice that at
first. I hope that it is okay that I have replied both privately and
to the List.

--
Rich Ulrich

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

lcl23

Re: outliers - casewise listing of residuals and standardized residuals

Three problematic predictors:

(1) No. of board meeting – skewness (3.4)

(2) Firm’s asset – in dollar, skewness (17.3)

(3) No. of subsidiary – skewness (7.7)

N is 800. In general, the model is significant but R^2 is quite low(10%). I have yet to decide what to do about those extreme cases. Different textbooks give different suggestions.

Oh ya, I think I have clicked “reply” instead of “reply all”. Sorry for that.

--- On Sun, 20/3/11, Rich Ulrich <[hidden email]> wrote:

From: Rich Ulrich <[hidden email]>
Subject: RE: outliers - casewise listing of residuals and standardized residuals
To: [hidden email], [hidden email]
Date: Sunday, 20 March, 2011, 6:47 AM

> Date: Fri, 18 Mar 2011 22:26:50 -0700
> From: chuiling23@...
> Subject: RE: outliers - casewise listing of residuals and standardized
> residuals
> To: rich-ulrich@...
>
> Most of my predictors have severe positive skew & transformation
> doesn't help.

I don't know why you mention "positive skew" since a long tail to
the right, giving positive skew, is what is most common. Those are
the outliers that are corrected in one step by the most common
transformations (log, square root, reciprocal), for data scored
starting at or above zero.

For the simplest case of negative skew, the problem is that one
needs to take the MAXIMUM score as the new zero: Instead of
giving someone 100 on a test, you may re-score them as "0 errors"
for the sake of analysis. And if the scores approach both the
natural (or scaled) maximum and minimum, then you would want to
consider a symmetrical transformation such as the logit.

But transformations should be "natural" to the data, whenever
feasible. The best information for deciding what is natural is to
consider "What is measured?" and "How is it measured?" And you
should consider, at the same time, what you are trying to predict.
- A natural transformation for one criterion is not necessarily the
same as for another.

My main conclusion, here, is that you need to say what your variables
actually are, so you can get appropriate advice for transformations.

> So, I winsorized it. Although normality for predictors is
> not required, but it will make for stronger assessment if it exists. By
> the way, could you explain more on "if the extreme scores have equally
> extreme results"?

I do agree that, as a habit, you get a better prediction equation when
the scores are normal. But that is predicated on implicit assumptions
that "normal" is going to be equal-interval ... especially in its
relation to the outcome. So, as a counter-example: for a continuous
predictor, an "outlier" actually gives strength to the R^2 whenever it
predicts an outcome that is also an outlier (in the right direction).
Similarly, as a counter-example to requiring normality, for logistic
regression, a case can be "over-predicted" as group-1 because it is
extreme and outlying on the predictors; this will not hurt the maximum
likelihood fit *if* the group assignment is correct. However, this
same case is apt to show up as an outlier when it looks like an extreme
instance of group-1, but it is actually a member of group-2.

The logistic regression is said to be more robust against bad distributions
than is the corresponding 2-group discriminant function, *because*
over-prediction need not hurt.

>
> After identified univariate & multivariate outliers, I still need to
> face the z scores pertinent to residuals. So, they are 2 different z
> scores here which measuring outliers at different stage?
>
> The 29 cases I mentioned previously have Cook's distance < 1, which
> means I need not discard them all? Yes, the 10% (retain) vs 39%
> (discard) is the Nagelkerke R^2.

Let me ask this: What is your N? And: Is the original result
even statistically significant?

What do the 29 have in common? What excuse do you have to drop them?
I don't think you can publish, or impress any boss, if the best argument
you have is something that comes down to, "We can achieve a pretty good
fit only if we throw away a bunch of cases that do not fit."

>
> Sorry for so many questions, still in the process of learning stat........
> --- On Sat, 19/3/11, Rich Ulrich wrote:
[snip, previous]

Your reply came to me in private mail, though I did not notice that at
first. I hope that it is okay that I have replied both privately and
to the List.

--
Rich Ulrich

lcl23

Re: outliers - casewise listing of residuals and standardized residuals

In reply to this post by Art Kendall

Is it possible that some variable was not in the model that could influence the result?

~ I need to investigate further

Are all of the values within the legitimate domain of measurement of the construct?

~ Yes

Did you go back and look at cases with suspicious values to see if there might be reasons for extremeness?

~ The outliers are part of the population I wanted & I have confirmed that those are not data entry error

--- On Sat, 19/3/11, Art Kendall [via SPSSX Discussion] <[hidden email]> wrote:

From: Art Kendall [via SPSSX Discussion] <[hidden email]>
Subject: Re: outliers - casewise listing of residuals and standardized residuals
To: "lcl23" <[hidden email]>
Date: Saturday, 19 March, 2011, 10:26 PM
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <soapbox on>
I am leery of blindly transforming or deleting data.
<soapbox off>

Identifying suspicious values should result in going back and reviewing the data. Deleting and or transforming should only be done after one is very confident that there was not a data entry error or substantive cause. Since the early 70's it has been my experience, that in excess of a subjective 80% of outliers are data entry errors. Of course a lot depends on how you obtained the data, whether it was proofread, double keyed, cases actually being outside the pop of interest, etc.

Usually one worries about the distributions of residuals (x-xhat) rather than the distributions of predictors (x's) so that was a good idea. Is it possible that some variable was not in the model that could influence the result?

In my experience the looking at extreme values of variables is useful in detecting unusual values.

Are all of the values within the legitimate domain of measurement of the construct?

Looking for multivariate suspicious values is a good idea.
<data> <identify unusual cases> can be useful to identify values to look into?

Specific ways to check on data anomalies depend on the nature of the data and what phenomena you are looking at.
E.g., if you have self-reporting you might have "pattern responses".

Did you go back and look at cases with suspicious values to see if there might be reasons for extremeness?

Art Kendall
Social Research Consultants

On 3/18/2011 11:27 AM, lcl23 wrote:
I am currently cleaning my data in SPSS to prepare for the later logistic
regression analysis. I first identified univariate outliers with z scores >
3, and winsorized it using 1.5*IQR rules. Z scores of winsorized variables
were double-checked again using casewise diagnostics (in linear regression)
to ensure no std dev were > 3. Then, only one multivariate outlier was
identified using Mahalanobis Distance and I have deleted it.

When I tried to run logistic regression using this set of "no outliers"
data, the output showed, "the casewise plot is not produced because no
outliers were found". But, the computed standardized residuals (ZRE_1)
variable showed that there were still 29 cases which z scores > 3!

How can it be? I still unable to figure out. Aren't casewise listing of
residuals and standardized residuals use the same measurement (z scores)?

In fact, the model fit has increased from 10.3% to 38.9% if I deleting all
29 cases! So, how should I deal with these?

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/outliers-casewise-listing-of-residuals-and-standardized-residuals-tp3967514p3967514.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/outliers-casewise-listing-of-residuals-and-standardized-residuals-tp3967514p4070469.html

To unsubscribe from outliers - casewise listing of residuals and standardized residuals, click here.

Rich Ulrich

Re: outliers - casewise listing of residuals and standardized residuals

In reply to this post by lcl23

> Date: Sun, 20 Mar 2011 07:39:19 -0700
> From: [hidden email]
> Subject: RE: outliers - casewise listing of residuals and standardized
> residuals
> To: [hidden email]; [hidden email]
>
>
> Three problematic predictors:
>
> (1) No. of board meeting – skewness (3.4)
>
> (2) Firm’s asset – in dollar, skewness (17.3)
>
> (3) No. of subsidiary – skewness (7.7)
>
>
>
> N is 800. In general, the model is significant but R^2 is quite
> low(10%). I have yet to decide what to do about those extreme cases.
> Different textbooks give different suggestions.
>

It might be because I groom my variables before I look
at skewness, but I've usually been concerned with skewness
between 0.35 and 1.5. I'm not even convinced that it is
possible to have skewness of 17.3 or 7.7. - Would you
describe one set of those values?

"Board meetings" - In a model, I think I would be expect to see
groupings of, say, "daily/ weekly/ monthly/ annually/ less".
That grouping not going to come out very weird for skewness,
but I *might* suspect that it is non-linear and *needs*
categories, depending on the outcome variable.

"Firm's assets" - Are you trying to model across multinationals,
down to the corner grocery? Assets seems a natural for taking
a log, but it also seems natural to create a model without the
world's full range of possibilities.

"No. of subsidiaries" - This also seems like something that
deserves to put into manageable groups, either to analyze as
categories or to create a reasonable "scale" on which I might
expect a linear relation to outcome.

--
Rich Ulrich

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Automatic reply: outliers - casewise listing of residuals and standardized residuals

Hi,

I will be out of the office till Monday, 3/28/11. I will reply to your email on Monday.

Thank you.

Joanne Han

Marketing Research Manager

UCLA Extension