Draft of hints about data cleaning for distribution to others

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Draft of hints about data cleaning for distribution to others

Art Kendall
Below there are some notes about data cleaning.
Any reactions, clarifications, enhancements?


When you have all of the data in the data view, make sure that the variables view is communicative to your audience. Use meaningful variable name.  Provide meaningful labels for each variable.  Define displays formats that improve readability, e.g., no excess decimal places, use date formats for dates, currency formats for money, etc.  Define which values indicate missing data.  Provide labels for the different missing value codes.  For variables where they apply provide labels for values.  Fill in the level of measurement.

Have a cold reader look over your dictionary.

Use the menus to write syntax to find duplicate cases.  Even if the data is to be anonymized, be sure that each case has a unique ID so that users can report problems with the data

Use menus to write syntax to find unusual cases.

If you have string variables that should have limited numbers of values, but the data was enter by people, use AUTORECODE to find variations in capitalization, misspellings, and spacing.

Use menus to write syntax to create data validation rules.


Write syntax to write validation checks, e.g., whether attitude items are all given the same answer.


Run frequencies and descriptives to make sure that values are legitimate, e.g., that there are no unlabelled or illegitimate codes' that heights; weights, IQs, etc. are reasonable;  that there are not unusual gaps or heaps in the distributions, etc.


Run crosstabs, correlations, and visualizations such as parallel coordinate plots, rotatable 3D scatterplots, etc. To look for cases with suspicious values given the values of other values. 72 inch tall 9 year olds, pregnant males, etc.
 
If you have items for summative scales, use the RELIABILITY procedure to check on the scoring keys.

 

 YMMV, but in my experience,  the vast majority of suspicious values are due to data entry.
-- 
Art Kendall
Social Research Consultants
Art Kendall
Social Research Consultants