Syntax to Clean Data - Needs Help

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Syntax to Clean Data - Needs Help

Fahim Jafary
Hi Folks.  New to SPSS and would be grateful for some help.

I'm reviewing a dataset which I need to "clean" before analysis.  I
need a syntax method to do the following:

1.  Do a frequencies and descriptives command on ALL categorical and
continuous variables respectively that are present in the dataset..  If I
do Frequencies All or
Descriptives ALL, SPSS doesn't differentiate between categorical and
continuous variables.

2.  Identify outliers for continuous variables and then display them
with their case ID numbers so that I may go back and check.  One syntax
suggested to me is as follows for variable age.  However, doing this
repeatedly for dozens of variables is very tedious and suggestions for
an effecient way would be a great help.  Furthermore this syntax does
NOT list which cases are the outliers

DESCRIPTIVES
   VARIABLES=age  /SAVE
   /STATISTICS=MEAN STDDEV RANGE MIN MAX .

compute outlier1=abs(zage).
execute.

recode
outlier1 (3.29 thru highest=4)(2.58 thru highest=3)(1.96 thru
highest=2)(lowest thru 2=1).
exe.

value labels outlier1
1'Absolute z score < 2' 2 'Absolute z zcore > 1.96' 3 'Absolute z score

 > 2.58' 4 'Absolute z score > 3.29'.

Frequencies
Variables=outlier1

Once again, many thanks for the help

Fahim H. Jafary, MD
Aga Khan University Hospital
Karachi, Pakistan.
Reply | Threaded
Open this post in threaded view
|

Re: Syntax to Clean Data - Needs Help

Maguin, Eugene
Fahim,

I think there is no easy way to do what you ask.

With respect to item 1: I would do a Display dictionary command and copy or
export the output to word, edit the table to delete columns, then sort on
measurement level column and finally, copy the variable lists back into
spss. It's tedious but easy. All that said, I do think that if you want to
invest in learning either the Sax basic scripting language or, assuming you
have version 14, the Python language, then you can do what you want directly
using a combination of Sax basic or python language structures and spss
syntax. I think others on this list have done and can advise you directly.

With respect to item 2: I think you have to do two passes. In pass one, do
your frequencies and then inspect the distribution for each variable and
identify outliers. There was a discussion of outliers on the list a few days
ago, which will be in the archives. However, my opinion is that outliers are
distribution dependent. I don't think you can arbitrarily set a greater than
3 standard deviation rule. Anyway. Then, in pass two use a Do repeat
structure to create an new variable that marks a case as having an outlier
for that variable. Like this.

Do repeat var=x1 to x10/
   low=1.2 2.4 4.5 78 ..../
   hi=5 5.6 12.3 345 .../
   out=ol1 to ol10.
+  if (var le low or var ge hi) out=1.
End repeat.

The low and hi values define the beginning of the outlier region and are
included in it. Once that is done, you can do a series of list commands, as
in

Temporary.
Select if (ol1 eq 1).
List case x1.

You'll need one for every variable if you want to see just the outlier cases
for that variable.

However, what would be more compact would be to run a list or print command
inside of a do repeat structure, which unfortunately won't work. What I
think will work is a macro. I don't use macros; so I can't advise you. They
are documented in the syntax manual and you may be able to modify a example.
Otherwise, I know there are others on the list who are skilled with macros.

Gene Maguin
Reply | Threaded
Open this post in threaded view
|

Re: Syntax to Clean Data - Needs Help

Peck, Jon
In reply to this post by Fahim Jafary
-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Fahim H. Jafary
Sent: Thursday, August 31, 2006 2:58 PM
To: [hidden email]
Subject: [SPSSX-L] Syntax to Clean Data - Needs Help

Hi Folks.  New to SPSS and would be grateful for some help.

I'm reviewing a dataset which I need to "clean" before analysis.  I
need a syntax method to do the following:

1.  Do a frequencies and descriptives command on ALL categorical and
continuous variables respectively that are present in the dataset..  If I
do Frequencies All or
Descriptives ALL, SPSS doesn't differentiate between categorical and
continuous variables.
[>>>Peck, Jon] With SPSS 14 and the Python Plug-In, you can do this easily.  Look at example3a in the PowerPoint presentation on the SPSS Developer Central website (www.spss.com/devcentral).  You can find this under the link "Directions Conference Presentation on Programmability" on the first page.  This example indeed runs the FREQUENCIES procedure on all the categorical variables and DESCRIPTIVES on all the scale variables.

2.  Identify outliers for continuous variables and then display them
with their case ID numbers so that I may go back and check.
[>>>Peck, Jon]
The Identify Unusual Cases dialog (ANOMALYDETECTION procedure), which is in the Data Validation option introduced in SPSS 14, gives you a wealth of information like this.
That options also contains the VALIDATE DATA procedure, which checks data values against rules you define to detect illegal values.

HTH,
Jon Peck
SPSS

One syntax
suggested to me is as follows for variable age.  However, doing this
repeatedly for dozens of variables is very tedious and suggestions for
an effecient way would be a great help.  Furthermore this syntax does
NOT list which cases are the outliers

DESCRIPTIVES
   VARIABLES=age  /SAVE
   /STATISTICS=MEAN STDDEV RANGE MIN MAX .

compute outlier1=abs(zage).
execute.

recode
outlier1 (3.29 thru highest=4)(2.58 thru highest=3)(1.96 thru
highest=2)(lowest thru 2=1).
exe.

value labels outlier1
1'Absolute z score < 2' 2 'Absolute z zcore > 1.96' 3 'Absolute z score

 > 2.58' 4 'Absolute z score > 3.29'.

Frequencies
Variables=outlier1

Once again, many thanks for the help

Fahim H. Jafary, MD
Aga Khan University Hospital
Karachi, Pakistan.
Reply | Threaded
Open this post in threaded view
|

Re: Syntax to Clean Data - Needs Help

Richard Ristow
In reply to this post by Fahim Jafary
At 03:58 PM 8/31/2006, Fahim H. Jafary wrote:

>I'm reviewing a dataset which I need to "clean" before analysis.  I
>need a syntax method to do the following:
[...]

>2.  Identify outliers for continuous variables and then display them
>with their case ID numbers so that I may go back and check.

To start with, there's been a lot of recent discussion about the
meaning of 'outliers'. Review it if possible; in any case, think
carefully before classifying a case as an 'outlier' by any automatic
criterion.

On the other hand, checking end-of-range cases for errors is a useful
technique.

>   One syntax suggested to me is as follows for variable
> age.  However, doing this repeatedly for dozens of variables is very
> tedious and suggestions for an effecient way would be a great
> help.  Furthermore this syntax does NOT list which cases are the
> outliers
>
>DESCRIPTIVES
>   VARIABLES=age  /SAVE
>   /STATISTICS=MEAN STDDEV RANGE MIN MAX .
>
>compute outlier1=abs(zage).
>
>recode
>  outlier1 (3.29 thru highest=4)(2.58 thru highest=3)
>           (1.96 thru highest=2)(lowest thru 2=1).
>
>value labels outlier1
>           1 'Absolute z score < 2'
>           2 'Absolute z zcore > 1.96'
>           3 'Absolute z score > 2.58'
>           4 'Absolute z score > 3.29'.

(I've taken out the "execute." and "exe." statements, which aren't
necessary and can slow processing if your file is large.)

You can list the cases with z scores above 3.29 by replacing

>Frequencies
>Variables=outlier1

by

TEMPORARY.
SELECT IF ABS(ZAGE) > 3.29.
LIST CASE_ID AGE ZAGE.

Now, the problem is the dozens of variables. First, you need a list of
your continuous variables. Gene Maguin and Jon Peck have both made
suggestions; but you may find it's as easy to make the list by hand.

Then, you want to select the extreme values on all of those dozens of
variables, and list. It's trickier to do that, nicely. I suggest
separating your variables, so each variable is in a separate SPSS case.
That is, instead of one record for each CASE_ID, create one record for
each variable for each CASE_ID; many more records.

Something like this, but it's not tested. Where I write "<continuous
variables>", replace it all (including angle brackets) by your list of
continuous variables.

*  Write all the variable values to separate records  .
*  -------------------------------------------------  .

VARSTOCASES
    MAKE VALUE 'Value of the continuous variable'
    FROM <continuous variables>
   /INDEX = VBL_NAME 'Name of source variable' (VALUE).


*  Group the file by the original variables           .
*  -------------------------------------------------  .

SORT CASES BY VBL_NAME.
SPLIT FILE BY VBL_NAME.


*  Calculate statistics to judge 'outliers'           .
*  -------------------------------------------------  .

*  Save the Z-scores; note use of "Z_SCORE" in        .
*  parentheses, instead of "SAVE" subcommand.         .

DESCRIPTIVES
   VARIABLES=VALUE (Z_SCORE)
   /STATISTICS=MEAN STDDEV MIN MAX .

*  A supplement or alternative to Z-scores:           .
*  Save percentiles scores                            .

RANK
    VARIABLES=VALUE
   /PERCENT  =PCNTILE.

*  Re-group by original case, select and list         .
*  -------------------------------------------------  .

SPLIT FILE OFF.
SORT  CASES BY CASE_ID VBL_NAME.

*  List values with Z-scores >3.29                    .
TEMPORARY.
SELECT IF ABS(Z_SCORE) > 3.29.
LIST CASE_ID VBL_NAME VALUE Z_SCORE PCNTILE.

*  List values in the top or bottom 5%                .
TEMPORARY.
SELECT IF PCTILE LE 5 OR PCTILE GE 95.
LIST CASE_ID VBL_NAME VALUE Z_SCORE PCNTILE.