Draft of hints about data cleaning for distribution to others
Posted by
Art Kendall on
Mar 28, 2013; 9:47pm
URL: http://spssx-discussion.165.s1.nabble.com/Draft-of-hints-about-data-cleaning-for-distribution-to-others-tp5719156.html
Below there are some notes
about data cleaning.
Any reactions,
clarifications, enhancements?
When you have all of the data in the data view, make
sure that the variables view is communicative to
your audience. Use meaningful variable name. Provide meaningful labels for each variable. Define displays formats that improve
readability, e.g., no excess
decimal places, use date
formats for dates, currency formats for
money, etc.
Define which values indicate missing
data. Provide labels for the different
missing value codes. For variables where they
apply provide labels
for values. Fill in the level of
measurement.
Have a cold
reader look over your
dictionary.
Use the menus to write syntax
to find duplicate
cases. Even if the data is to
be anonymized, be sure
that each case has a
unique ID so that users
can report problems with
the data
Use
menus to write syntax to
find unusual cases.
If you
have string
variables that should
have limited numbers of
values, but the data
was enter by people,
use AUTORECODE to find
variations in
capitalization, misspellings,
and spacing.
Use
menus
to write
syntax to
create data
validation
rules.
Write
syntax to write
validation checks,
e.g., whether
attitude items
are all given
the same
answer.
Run
frequencies
and descriptives
to make sure
that values
are
legitimate,
e.g., that
there are no
unlabelled or
illegitimate
codes'
that heights;
weights, IQs,
etc.
are reasonable;
that there are
not unusual
gaps or
heaps in the
distributions,
etc.
Run
crosstabs,
correlations,
and visualizations
such as
parallel
coordinate
plots,
rotatable 3D
scatterplots,
etc. To look
for cases with
suspicious
values
given the
values of other
values. 72
inch tall 9
year olds,
pregnant
males, etc.
If
you have items
for summative
scales,
use the RELIABILITY
procedure to check
on the scoring
keys.
YMMV,
but in my experience, the vast majority
of suspicious values are due to data entry.
--
Art Kendall
Social Research Consultants
Art Kendall
Social Research Consultants