IQR and outliers

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

IQR and outliers

Chris Vaughan-3
Hello Everyone,

I need some help w/ syntax please.   I have a dataset w/ a number of variables representing different test scores.  I am trying to exclude outliers based upon
a 1.5 times IQR cut-off, for each individual variable separately.   I don't want to exclude the entire case, rather just that one case for the one variable (e.g., test score).  Unfortunately, this has to be done separately for each of the 88 different variables (11 vars by 8 age groups).    Obviously i could use the explore procedure, gather the 3rd and 1st quartile points and then calculate IQR and 1.5 *  IQR and select on those cases between the cut-offs. The
problem is that i don't want to have to do this separately 88 times after
manually entering each quartile score into the syntax.   I would also like to be
able to re-run this command after additional data are added.

Any help on how get this done would be greatly appreciated.  I have attempted to
use some similar syntax from the archive list that uses mean and sd as the basis
for exclusion, however, I don't know that the correct terms are for SPSS to look
for the 1st and 3rd quartiles and so i haven't been able to translate it.

Thanks in advance,

Chris Vaughan

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: IQR and outliers

Marta Garcia-Granero
Chris Vaughan wrote:

> I need some help w/ syntax please.   I have a dataset w/ a number of variables representing different test scores.  I am trying to exclude outliers based upon
> a 1.5 times IQR cut-off, for each individual variable separately.   I don't want to exclude the entire case, rather just that one case for the one variable (e.g., test score).  Unfortunately, this has to be done separately for each of the 88 different variables (11 vars by 8 age groups).    Obviously i could use the explore procedure, gather the 3rd and 1st quartile points and then calculate IQR and 1.5 *  IQR and select on those cases between the cut-offs. The
> problem is that i don't want to have to do this separately 88 times after
> manually entering each quartile score into the syntax.   I would also like to be
> able to re-run this command after additional data are added.
>
> Any help on how get this done would be greatly appreciated.  I have attempted to
> use some similar syntax from the archive list that uses mean and sd as the basis
> for exclusion, however, I don't know that the correct terms are for SPSS to look
> for the 1st and 3rd quartiles and so i haven't been able to translate it.
>
First of all, the wisdom of eliminating outliers has been discussed
several times, and the general idea is that it shouldn't be done lightly.

Now concerning your question. I'm sure you can modify this macro to work
with the whole list (I'm pretty much busy right now, and I've written
something fast, not the best answer). it works on a single quantitative
variable and a single grouping variable.

WARNING: outliers will be replaced by missing values, you should keep an
untouched copy of the dataset!!!.

DEFINE CleanData(!POS !TOKENS(1) / !POS !TOKENS(1)).
* This is needed for matching percentile data later to dataset *.
PRESERVE.
SET TNumbers=Values.
SET OLANG=ENGLISH.
DATASET NAME OriginalData.
DATASET DECLARE Percentiles.
OMS
 /SELECT TABLES
 /IF COMMANDS = ["Explore"]
     SUBTYPES = ["Percentiles"]
 /DESTINATION FORMAT = SAV
  OUTFILE = Percentiles.
EXAMINE
  VARIABLES=!1 BY !2
  /PLOT NONE
  /PERCENTILES(25,50,75)
  /STATISTICS NONE
  /MISSING PAIRWISE
  /NOTOTAL.
OMSEND.
DATASET ACTIVATE Percentiles.
SELECT IF (Var1 = "Tukey's Hinges").
EXE./* Needed before any "DELETE VARIABLES" *.
DELETE VARIABLES Command_ TO Var2 @50.
RENAME VARIABLES Var3=!2.
DATASET ACTIVATE OriginalData.
SORT CASES BY !2(A).
MATCH FILES /FILE=*
 /TABLE='Percentiles'
 /BY !2.
DATASET CLOSE Percentiles.
COMPUTE IQR1.5=1.5*(@75-@25).
COMPUTE Lower=@25-IQR1.5.
COMPUTE Upper=@75+IQR1.5.
IF (!1 LT Lower) OR (!1 GT Upper) !1=$SYSMIS.
EXE.
DELETE VARIABLES @25 TO Upper.
RESTORE.
!ENDDEFINE.

* I have used "1991 U.S. General Social Survey.sav", variables age (as
quantitative) & race (as grouping),
  and since there were no outliers in the three race groups, I added
some false data at the end of the file (clear outliers)
  to be sure that they were detected and cleaned correctly * .

CleanData age race.

HTH,
Marta García-Granero

--
For miscellaneous statistical stuff, visit:
http://gjyp.nl/marta/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: IQR and outliers

Marta Garcia-Granero
In reply to this post by Chris Vaughan-3
I finally found the time to modify the macro to work with a list of
dependent variables plus one grouping variable (might be useful for
people I work with). Also, it now modifies a COPY of the dataset,
leaving the original data unchanged.

* Cleaning outliers (1.5·IQR cut-off criteria) in a set of quantitative
variables with one grouping variable *.

DEFINE CleanAllData(!POS !TOKENS(1) / !POS !ENCLOSED('(',')')).
* See boxplot of all variables to check if there are data that should be
"cleaned" *.
EXAMINE
  VARIABLES=!2 BY !1
  /PLOT BOXPLOT
  /COMPARE GROUP
  /STATISTICS NONE
  /MISSING PAIRWISE
  /NOTOTAL.
PRESERVE.
* Show numbers instead of labels in pivot tables (needed to match
percentiles to file later) *.
SET TNumbers=Values.
* Set output language to English (needed to select "Tukey's Hinges"
later) *.
SET OLANG=ENGLISH.
* Make a working copy of the dataset, to leave the original unmodified *.
DATASET COPY CleanedData.
DATASET ACTIVATE CleanedData.
* To restore original order later *.
COMPUTE id=$casenum.
SORT CASES BY !1(A).
*  Cycle thru the list of quantitative variables *.
!DO !var !IN (!2).
. DATASET DECLARE Percentiles WINDOW=HIDDEN.
. OMS
 /SELECT TABLES
 /IF COMMANDS = ["Explore"]
     SUBTYPES = ["Percentiles"]
 /DESTINATION FORMAT = SAV
  OUTFILE = Percentiles.
. EXAMINE
  VARIABLES=!var BY !1
  /PLOT NONE
  /PERCENTILES(25,50,75)
  /STATISTICS NONE
  /NOTOTAL.
. OMSEND.
. DATASET ACTIVATE Percentiles.
* Get rid of useless data *.
. SELECT IF (Var1 = "Tukey's Hinges").
. EXE./* Needed before next command *.
. DELETE VARIABLES Command_ TO Var2 @50.
. RENAME VARIABLES Var3=!1.
* Adding P25&P75 to file *.
. DATASET ACTIVATE CleanedData WINDOW=ASIS.
. MATCH FILES
 /FILE=*
 /TABLE='Percentiles'
 /BY !1.
. DATASET CLOSE Percentiles.
* Computing cut-off values *.
. COMPUTE IQR1.5=1.5*(@75-@25).
. COMPUTE Lower=@25-IQR1.5.
. COMPUTE Upper=@75+IQR1.5.
* Cleaning outliers *.
. IF (!var LT Lower) OR (!var GT Upper) !var=$SYSMIS.
. EXE. /* Needed too *.
* Eliminating useless variables *.
. DELETE VARIABLES @25 TO Upper.
!DOEND.
* Restoring original order *.
SORT CASES BY id(A).
DELETE VARIABLES id.
RESTORE. /* Restoring original TNumbers and Olang settings *.
!ENDDEFINE.

* I have used "1991 U.S. General Social Survey.sav", with race as grouping.
* Grouping variable goes first, then the list of quantitatives (don't
use keyword TO,
  even if the variables are consecutive) enclosed in parenthesis *.

CleanAllData race (age educ paeduc maeduc speduc).

* "CleanedData" will contain the outliers-cleared copy of the original
dataset *.


Best regards,
Marta


--
For miscellaneous statistical stuff, visit:
http://gjyp.nl/marta/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD