Hi Folks. New to SPSS and would be grateful for some help.
I'm reviewing a dataset which I need to "clean" before analysis. I need a syntax method to do the following: 1. Do a frequencies and descriptives command on ALL categorical and continuous variables respectively that are present in the dataset.. If I do Frequencies All or Descriptives ALL, SPSS doesn't differentiate between categorical and continuous variables. 2. Identify outliers for continuous variables and then display them with their case ID numbers so that I may go back and check. One syntax suggested to me is as follows for variable age. However, doing this repeatedly for dozens of variables is very tedious and suggestions for an effecient way would be a great help. Furthermore this syntax does NOT list which cases are the outliers DESCRIPTIVES VARIABLES=age /SAVE /STATISTICS=MEAN STDDEV RANGE MIN MAX . compute outlier1=abs(zage). execute. recode outlier1 (3.29 thru highest=4)(2.58 thru highest=3)(1.96 thru highest=2)(lowest thru 2=1). exe. value labels outlier1 1'Absolute z score < 2' 2 'Absolute z zcore > 1.96' 3 'Absolute z score > 2.58' 4 'Absolute z score > 3.29'. Frequencies Variables=outlier1 Once again, many thanks for the help Fahim H. Jafary, MD Aga Khan University Hospital Karachi, Pakistan. |
Fahim,
I think there is no easy way to do what you ask. With respect to item 1: I would do a Display dictionary command and copy or export the output to word, edit the table to delete columns, then sort on measurement level column and finally, copy the variable lists back into spss. It's tedious but easy. All that said, I do think that if you want to invest in learning either the Sax basic scripting language or, assuming you have version 14, the Python language, then you can do what you want directly using a combination of Sax basic or python language structures and spss syntax. I think others on this list have done and can advise you directly. With respect to item 2: I think you have to do two passes. In pass one, do your frequencies and then inspect the distribution for each variable and identify outliers. There was a discussion of outliers on the list a few days ago, which will be in the archives. However, my opinion is that outliers are distribution dependent. I don't think you can arbitrarily set a greater than 3 standard deviation rule. Anyway. Then, in pass two use a Do repeat structure to create an new variable that marks a case as having an outlier for that variable. Like this. Do repeat var=x1 to x10/ low=1.2 2.4 4.5 78 ..../ hi=5 5.6 12.3 345 .../ out=ol1 to ol10. + if (var le low or var ge hi) out=1. End repeat. The low and hi values define the beginning of the outlier region and are included in it. Once that is done, you can do a series of list commands, as in Temporary. Select if (ol1 eq 1). List case x1. You'll need one for every variable if you want to see just the outlier cases for that variable. However, what would be more compact would be to run a list or print command inside of a do repeat structure, which unfortunately won't work. What I think will work is a macro. I don't use macros; so I can't advise you. They are documented in the syntax manual and you may be able to modify a example. Otherwise, I know there are others on the list who are skilled with macros. Gene Maguin |
In reply to this post by Fahim Jafary
-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Fahim H. Jafary Sent: Thursday, August 31, 2006 2:58 PM To: [hidden email] Subject: [SPSSX-L] Syntax to Clean Data - Needs Help Hi Folks. New to SPSS and would be grateful for some help. I'm reviewing a dataset which I need to "clean" before analysis. I need a syntax method to do the following: 1. Do a frequencies and descriptives command on ALL categorical and continuous variables respectively that are present in the dataset.. If I do Frequencies All or Descriptives ALL, SPSS doesn't differentiate between categorical and continuous variables. [>>>Peck, Jon] With SPSS 14 and the Python Plug-In, you can do this easily. Look at example3a in the PowerPoint presentation on the SPSS Developer Central website (www.spss.com/devcentral). You can find this under the link "Directions Conference Presentation on Programmability" on the first page. This example indeed runs the FREQUENCIES procedure on all the categorical variables and DESCRIPTIVES on all the scale variables. 2. Identify outliers for continuous variables and then display them with their case ID numbers so that I may go back and check. [>>>Peck, Jon] The Identify Unusual Cases dialog (ANOMALYDETECTION procedure), which is in the Data Validation option introduced in SPSS 14, gives you a wealth of information like this. That options also contains the VALIDATE DATA procedure, which checks data values against rules you define to detect illegal values. HTH, Jon Peck SPSS One syntax suggested to me is as follows for variable age. However, doing this repeatedly for dozens of variables is very tedious and suggestions for an effecient way would be a great help. Furthermore this syntax does NOT list which cases are the outliers DESCRIPTIVES VARIABLES=age /SAVE /STATISTICS=MEAN STDDEV RANGE MIN MAX . compute outlier1=abs(zage). execute. recode outlier1 (3.29 thru highest=4)(2.58 thru highest=3)(1.96 thru highest=2)(lowest thru 2=1). exe. value labels outlier1 1'Absolute z score < 2' 2 'Absolute z zcore > 1.96' 3 'Absolute z score > 2.58' 4 'Absolute z score > 3.29'. Frequencies Variables=outlier1 Once again, many thanks for the help Fahim H. Jafary, MD Aga Khan University Hospital Karachi, Pakistan. |
In reply to this post by Fahim Jafary
At 03:58 PM 8/31/2006, Fahim H. Jafary wrote:
>I'm reviewing a dataset which I need to "clean" before analysis. I >need a syntax method to do the following: [...] >2. Identify outliers for continuous variables and then display them >with their case ID numbers so that I may go back and check. To start with, there's been a lot of recent discussion about the meaning of 'outliers'. Review it if possible; in any case, think carefully before classifying a case as an 'outlier' by any automatic criterion. On the other hand, checking end-of-range cases for errors is a useful technique. > One syntax suggested to me is as follows for variable > age. However, doing this repeatedly for dozens of variables is very > tedious and suggestions for an effecient way would be a great > help. Furthermore this syntax does NOT list which cases are the > outliers > >DESCRIPTIVES > VARIABLES=age /SAVE > /STATISTICS=MEAN STDDEV RANGE MIN MAX . > >compute outlier1=abs(zage). > >recode > outlier1 (3.29 thru highest=4)(2.58 thru highest=3) > (1.96 thru highest=2)(lowest thru 2=1). > >value labels outlier1 > 1 'Absolute z score < 2' > 2 'Absolute z zcore > 1.96' > 3 'Absolute z score > 2.58' > 4 'Absolute z score > 3.29'. (I've taken out the "execute." and "exe." statements, which aren't necessary and can slow processing if your file is large.) You can list the cases with z scores above 3.29 by replacing >Frequencies >Variables=outlier1 by TEMPORARY. SELECT IF ABS(ZAGE) > 3.29. LIST CASE_ID AGE ZAGE. Now, the problem is the dozens of variables. First, you need a list of your continuous variables. Gene Maguin and Jon Peck have both made suggestions; but you may find it's as easy to make the list by hand. Then, you want to select the extreme values on all of those dozens of variables, and list. It's trickier to do that, nicely. I suggest separating your variables, so each variable is in a separate SPSS case. That is, instead of one record for each CASE_ID, create one record for each variable for each CASE_ID; many more records. Something like this, but it's not tested. Where I write "<continuous variables>", replace it all (including angle brackets) by your list of continuous variables. * Write all the variable values to separate records . * ------------------------------------------------- . VARSTOCASES MAKE VALUE 'Value of the continuous variable' FROM <continuous variables> /INDEX = VBL_NAME 'Name of source variable' (VALUE). * Group the file by the original variables . * ------------------------------------------------- . SORT CASES BY VBL_NAME. SPLIT FILE BY VBL_NAME. * Calculate statistics to judge 'outliers' . * ------------------------------------------------- . * Save the Z-scores; note use of "Z_SCORE" in . * parentheses, instead of "SAVE" subcommand. . DESCRIPTIVES VARIABLES=VALUE (Z_SCORE) /STATISTICS=MEAN STDDEV MIN MAX . * A supplement or alternative to Z-scores: . * Save percentiles scores . RANK VARIABLES=VALUE /PERCENT =PCNTILE. * Re-group by original case, select and list . * ------------------------------------------------- . SPLIT FILE OFF. SORT CASES BY CASE_ID VBL_NAME. * List values with Z-scores >3.29 . TEMPORARY. SELECT IF ABS(Z_SCORE) > 3.29. LIST CASE_ID VBL_NAME VALUE Z_SCORE PCNTILE. * List values in the top or bottom 5% . TEMPORARY. SELECT IF PCTILE LE 5 OR PCTILE GE 95. LIST CASE_ID VBL_NAME VALUE Z_SCORE PCNTILE. |
Free forum by Nabble | Edit this page |