SPSSX Discussion

Mahalanobis Distance

Classic

List

Threaded

14 messages Options

Salbod

Oct 19, 2022; 8:29pm

Mahalanobis Distance

Your help is needed on a simple Mahalanobis Distance (MD) problem.

I have three moderation models that I’m testing. They all have the same outcome variable (Y) and moderator variable (M), but each predictor variable is one of three subscales from a scale (X1,X2, & X3).

I planned to use MD to eliminate multivariate outliers but, I’m not sure what is the best elimination procedure to use. Should I use the MD on each model and eliminate cases found significant only within the models or should I eliminate cases model wise (model vs modelwise).

Any suggestions or references will be greatly appreciated.

Thank you.

Art Kendall

Oct 19, 2022; 10:16pm

Re: Mahalanobis Distance

"one of three subscales from a scale (X1,X2, & X3)".

It sounds like you may mean "one of three items in a summative scale".
Or might you mean 3 scales in a battery?

Is this correct?

What is the response scale of the items/subscales? Likert (strongly disagee ... strongly agree)?

Often including the data definition of the variables makes it easier to understand the question.

Art Kendall
Social Research Consultants

jkpeck

Oct 20, 2022; 1:49am

Re: Mahalanobis Distance

In reply to this post by Salbod

Try using DETECTANOMALY (Data > Identify Unusual Cases). It clusters the data using TWOSTEP CLUSTER with noise handling, identifies multivariate outliers, and reports reasons and other useful information.

Rich-Ulrich

Oct 20, 2022; 2:38am

Re: Mahalanobis Distance

In reply to this post by Salbod

I don't know what you have in mind when you say 'eliminate multivariate outliers.'

Are you thinking of outliers in the space of {X1, X2, X3, M and Y}?
That would be looking (somewhat) at prediction.
It seems more reasonable to look, first, at outliers before you look at prediction,
so you would not include Y.

But if your concern is about the scaling of X1, X2, and X3, perhaps you should
look, first, at those alone. I would start with checking the univariate scaling:
Are there outliers? Is there notable skew? (Does taking log or sqrt fix it?)
(Check M and Y, too ... assuming Y is continuous.) In my own data sets, fixing
the scaling (and repairing obvious data errors) often fixed most outliers.

If this is a serious case of data-cleaning, I suppose it does make sense to check
each {Xi, M} separately, dropping the worst, before checking them together.
On the other hand, if Case N_K has an M that is off, perhaps the case should
be dropped because of M. You haven't given us much to go on.

Of course, your final report should explain all the cases dropped and Why, so
you do have incentive to keep cases IN, rather than drop them. (Do not be
bound to an arbitrary, predetermined cutoff for What is an outlier.)

--
Rich Ulrich

Bruce Weaver

Oct 20, 2022; 12:41pm

Re: Mahalanobis Distance

Administrator

I am in general agreement with what Rich said. I would add that I always think measures of influence are for more telling than measures of distance or leverage on their own. In other words, I would be more interested in things like Cook's D and DFBETAS. And bearing in mind Rich's warning about absolute cut-offs, I would look at plots (e.g., index plots) to identify observations that merit further inspection.

Rich-Ulrich wrote

I don't know what you have in mind when you say 'eliminate multivariate outliers.'

Are you thinking of outliers in the space of {X1, X2, X3, M and Y}?
That would be looking (somewhat) at prediction.
It seems more reasonable to look, first, at outliers before you look at prediction,
so you would not include Y.

But if your concern is about the scaling of X1, X2, and X3, perhaps you should
look, first, at those alone. I would start with checking the univariate scaling:
Are there outliers? Is there notable skew? (Does taking log or sqrt fix it?)
(Check M and Y, too ... assuming Y is continuous.) In my own data sets, fixing
the scaling (and repairing obvious data errors) often fixed most outliers.

If this is a serious case of data-cleaning, I suppose it does make sense to check
each {Xi, M} separately, dropping the worst, before checking them together.
On the other hand, if Case N_K has an M that is off, perhaps the case should
be dropped because of M. You haven't given us much to go on.

Of course, your final report should explain all the cases dropped and Why, so
you do have incentive to keep cases IN, rather than drop them. (Do not be
bound to an arbitrary, predetermined cutoff for What is an outlier.)

--
Rich Ulrich
... [show rest of quote]

... [show rest of quote]

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

Art Kendall

Oct 20, 2022; 1:39pm

Re: Mahalanobis Distance

Jon, Rich, and Bruce make good points. However, before these kinds of approaches are used, it is basic to consider what the legitimate/meaningful range of values are. For example, ages should not have negative values. "Ontario" is not a valid value for a US state. When tax returns data have separate variables for appreciation & depreciation there should be no negative values for appreciation. Prison population data should have no individuals who are born in the USA and are "alien felons". If a variable that is an item in a summative scale has an actual Likert response scale of 1 to 5, values of 7 are peculiar.

<soap box>
The best way to reduce the occurrence of anomalous/peculiar/outlier/suspicious values is careful design of the data gathering process and careful completion of the data definition.
</soap box>

Art Kendall
Social Research Consultants

jkpeck

Oct 20, 2022; 2:17pm

Re: Mahalanobis Distance

That leads me to point out again the VALIDATEDATA procedure in Statistics for defining and applying rules for checking data for simple or compound substantive issues.

Kirill Orlov

Oct 22, 2022; 3:14pm

Re: Mahalanobis Distance

In reply to this post by Salbod

Mahalanobis distance from a data point to the data centroid is a generalization of abs(zscore) from univariate to multivariate situation. It can serve one of simplest measures of multivariate outliers. If you are determined you need to compute Mahalanobis to centroid, you may use my function !KO_smahalc found in MATRIX-END MATRIX collection on https://www.spsstools.net/en/KO-spssmacros.

jkpeck

Oct 23, 2022; 5:52pm

Re: Mahalanobis Distance

Mahalanobis is also available for the regressors via the REGRESSION procedure.

Salbod

Oct 25, 2022; 6:39pm

Re: Mahalanobis Distance

In reply to this post by Art Kendall

Art,

Thank you for your questions: they make me smarter.

The independent variables represent scales from the Early Childhood Behavior Questionnaire (ECBQ). Each of the three scales (i.e., Surgency, Negativity, and Effortful Control) are made up of 12 items.

--Steve

Salbod

Oct 25, 2022; 6:46pm

Re: Mahalanobis Distance

In reply to this post by jkpeck

Thank you. Is there any documentation about interpreting DETECTANOMALY's output?

--Steve

jkpeck

Oct 25, 2022; 7:08pm

Re: Mahalanobis Distance

There is no case study for this procedure, but the CSR has some descriptive information, and the tables the procedure produces have tool tips on the headings.

If you really want to dive deep, the Algorithms manual has very detailed information for the procedure.

Art Kendall

Oct 25, 2022; 7:40pm

Re: Mahalanobis Distance

In reply to this post by Salbod

If I recall correctly, those are second order factors from a PAF. If it was done well, discriminant validity should be high. If it is a short form, but the scoring key still devised using PAF, the discriminant validity should be high.

I agree with Rich that the DV should be explored separately.

An additional perspective can be done by coarsening M, and using it as a nominal variable in a 3 D scatterplot. I.e., colors or symbols for M and x1 to x3 as dimensions. In the output window, rotate the graph in different ways.

Art Kendall
Social Research Consultants

jkpeck

Oct 25, 2022; 7:45pm

Re: Mahalanobis Distance

If you want a graphical view, there is a bagplot extension on the Extension Hub that produces a matrix scatter of 3-d plots. It can handle up to 10 dimensions, although that gets rather hard to read.