Mahalanobis Distance

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Mahalanobis Distance

Salbod
Your help is needed on a simple Mahalanobis Distance (MD) problem.

I have three moderation models that I’m testing. They all have the same outcome variable (Y) and moderator variable (M), but each predictor variable is one of three subscales from a scale (X1,X2, & X3).

I planned to use MD to eliminate multivariate outliers but, I’m not sure what is the best elimination procedure to use. Should I use the MD on each model and eliminate cases found significant only within the models or should I eliminate cases model wise (model vs modelwise).
 
Any suggestions or references will be greatly appreciated.

Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Mahalanobis Distance

Art Kendall
"one of three subscales from a scale (X1,X2, & X3)".

It sounds like you may mean "one of three items in a summative scale".
Or might you mean 3 scales in a battery?

Is this correct?

What is the response scale of the items/subscales?  Likert (strongly disagee ... strongly agree)?

Often including the data definition of the variables makes it easier to understand the question.
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Mahalanobis Distance

jkpeck
In reply to this post by Salbod
Try using DETECTANOMALY (Data > Identify Unusual Cases).  It clusters the data using TWOSTEP CLUSTER with noise handling, identifies multivariate outliers, and reports reasons and other useful information.
Reply | Threaded
Open this post in threaded view
|

Re: Mahalanobis Distance

Rich-Ulrich
In reply to this post by Salbod
I don't know what you have in mind when you say 'eliminate multivariate outliers.'

Are you thinking of outliers in the space of {X1, X2, X3, M and Y}?  
That would be looking (somewhat) at prediction.
It seems more reasonable to look, first, at outliers before you look at prediction,
so you would not include Y.  

But if your concern is about the scaling of X1, X2, and X3, perhaps you should
look, first, at those alone.  I would start with checking the univariate scaling:
Are there outliers?  Is there notable skew?  (Does taking log or sqrt fix it?)
(Check M and Y, too ... assuming Y is continuous.)  In my own data sets, fixing
the scaling (and repairing obvious data errors) often fixed most outliers.

If this is a serious case of data-cleaning, I suppose it does make sense to check
each {Xi, M} separately, dropping the worst, before checking them together.
On the other hand, if Case N_K  has an M that is off, perhaps the case should
be dropped because of M.  You haven't given us much to go on.

Of course, your final report should explain all the cases dropped and Why, so
you do have incentive to keep cases IN, rather than drop them.  (Do not be
bound to an arbitrary, predetermined cutoff for What is an outlier.)

--
Rich Ulrich

Reply | Threaded
Open this post in threaded view
|

Re: Mahalanobis Distance

Bruce Weaver
Administrator
I am in general agreement with what Rich said.  I would add that I always think measures of influence are for more telling than measures of distance or leverage on their own.  In other words, I would be more interested in things like Cook's D and DFBETAS.  And bearing in mind Rich's warning about absolute cut-offs, I would look at plots (e.g., index plots) to identify observations that merit further inspection.  



Rich-Ulrich wrote
I don't know what you have in mind when you say 'eliminate multivariate outliers.'

Are you thinking of outliers in the space of {X1, X2, X3, M and Y}?  
That would be looking (somewhat) at prediction.
It seems more reasonable to look, first, at outliers before you look at prediction,
so you would not include Y.  

But if your concern is about the scaling of X1, X2, and X3, perhaps you should
look, first, at those alone.  I would start with checking the univariate scaling:
Are there outliers?  Is there notable skew?  (Does taking log or sqrt fix it?)
(Check M and Y, too ... assuming Y is continuous.)  In my own data sets, fixing
the scaling (and repairing obvious data errors) often fixed most outliers.

If this is a serious case of data-cleaning, I suppose it does make sense to check
each {Xi, M} separately, dropping the worst, before checking them together.
On the other hand, if Case N_K  has an M that is off, perhaps the case should
be dropped because of M.  You haven't given us much to go on.

Of course, your final report should explain all the cases dropped and Why, so
you do have incentive to keep cases IN, rather than drop them.  (Do not be
bound to an arbitrary, predetermined cutoff for What is an outlier.)

--
Rich Ulrich
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Mahalanobis Distance

Art Kendall
Jon, Rich, and Bruce make good points.  However, before these kinds of approaches are used, it is basic to consider what the legitimate/meaningful range of values are.  For example, ages should not  have negative values. "Ontario" is not a valid value for a US state.  When tax returns data have separate variables for appreciation & depreciation there should be no negative values for appreciation. Prison population data should have no individuals who are born in the USA and are  "alien felons". If a variable that is an item in a summative scale has an actual Likert response scale of 1 to 5, values of 7 are peculiar.

<soap box>
The best way to reduce the occurrence of anomalous/peculiar/outlier/suspicious values is careful design of the data gathering process and careful completion of the data definition.
</soap box>
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Mahalanobis Distance

jkpeck
That leads me to point out again the VALIDATEDATA procedure in Statistics for defining and applying rules for checking data for simple or compound substantive issues.
Reply | Threaded
Open this post in threaded view
|

Re: Mahalanobis Distance

Kirill Orlov
In reply to this post by Salbod
Mahalanobis distance from a data point to the data centroid is a generalization of abs(zscore) from univariate to multivariate situation. It can serve one of simplest measures of multivariate outliers. If you are determined you need to compute Mahalanobis to centroid, you may use my function !KO_smahalc found in MATRIX-END MATRIX collection on https://www.spsstools.net/en/KO-spssmacros.
Reply | Threaded
Open this post in threaded view
|

Re: Mahalanobis Distance

jkpeck
Mahalanobis is also available for the regressors via the REGRESSION procedure.
Reply | Threaded
Open this post in threaded view
|

Re: Mahalanobis Distance

Salbod
In reply to this post by Art Kendall
Art,

Thank you for your questions: they make me smarter.

The independent variables represent scales from the Early Childhood Behavior Questionnaire (ECBQ). Each of the three scales (i.e., Surgency, Negativity, and Effortful Control) are made up of 12 items.

--Steve

Reply | Threaded
Open this post in threaded view
|

Re: Mahalanobis Distance

Salbod
In reply to this post by jkpeck
Thank you. Is there any documentation about interpreting DETECTANOMALY's output?

--Steve
Reply | Threaded
Open this post in threaded view
|

Re: Mahalanobis Distance

jkpeck
There is no case study for this procedure, but the CSR has some descriptive information, and the tables the procedure produces have tool tips on the headings.

If you really want to dive deep, the Algorithms manual has very detailed information for the procedure.
Reply | Threaded
Open this post in threaded view
|

Re: Mahalanobis Distance

Art Kendall
In reply to this post by Salbod
If I recall correctly, those are second order factors from a PAF.  If it was done well, discriminant validity should be high.  If it is a short form, but the scoring key still devised using PAF, the discriminant validity should be high.

I agree with Rich that the DV should be explored separately.

An additional perspective can be done by coarsening M, and using it as a nominal variable in a 3 D scatterplot. I.e., colors or symbols for M and x1 to x3 as dimensions.  In the output window, rotate the graph in different ways.
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Mahalanobis Distance

jkpeck
If  you want a graphical view, there is a bagplot extension on the Extension Hub that produces a matrix scatter of 3-d plots.  It can handle up to 10 dimensions, although that gets rather hard to read.