SPSSX Discussion

RECODE question

Classic

List

Threaded

4 messages Options

Tricia Seifert

RECODE question

Hello,
I was wondering if someone would be able to help me through a recode
question. I have a question that asks respondents to report their
parents income. The response categories are the following:
1 = Less than $14,999, 4.2%
2 = $15,000 - $24,999, 4.0%
3 = $25,000 - $34,999, 6.0%
4 = $35,000 - $49,999, 8.5%
5 = $50,000 - $74,999, 17.3%
6 = $75,000 - $99,999, 15.2%
7 = $100,000 - $199,999, 22.6%
8 = $200,000 - $299,999, 5.5%
9 = $300,000 or more, 5.1%
0 = Not applicable, 9.5%
MISSING = 2.0%

The overall goal is to be able to run a regression in which I don't
lose any cases. Does anyone have any suggestions to do this? I
greatly appreciate any thoughts or suggestions. I've thought about
recoding into several dummy variables but I can't seem to eliminate
the missing data from the parents income variable, which is the
variable with the most variability.

I was also wondering if anyone knows of any standard in how to recode
the largest response category in which no upper limit is specified.
For example, response option 9 is for those whose parents earn
$300,000 or more. Any suggestions as to how I should set this to a
midpoint?

Again, thanks for any help that you provide in this matter.
Best,
Tricia Seifert
Postdoctoral Research Scholar
Center for Research on Undergraduate Education
University of Iowa
N438 Lindquist Center (physical address)
N491 Lindquist Center (mailing address)
Iowa City, IA 52242
319-335-5377

Dan Zetu

Re: RECODE question

You could create a dummy called "Missing", which is 1 if income is missing
and 0 otherwise. This way, you wouldn't lose any cases, although I don't
know what value that may add to your model (unless there is a pattern for
missing Income and losing these cases may affect other variables).

Dan

>From: Tricia Seifert <[hidden email]>
>Reply-To: Tricia Seifert <[hidden email]>
>To: [hidden email]
>Subject: RECODE question
>Date: Tue, 13 Mar 2007 10:37:05 -0500
>
>Hello,
>I was wondering if someone would be able to help me through a recode
>question. I have a question that asks respondents to report their
>parents income. The response categories are the following:
>1 = Less than $14,999, 4.2%
>2 = $15,000 - $24,999, 4.0%
>3 = $25,000 - $34,999, 6.0%
>4 = $35,000 - $49,999, 8.5%
>5 = $50,000 - $74,999, 17.3%
>6 = $75,000 - $99,999, 15.2%
>7 = $100,000 - $199,999, 22.6%
>8 = $200,000 - $299,999, 5.5%
>9 = $300,000 or more, 5.1%
>0 = Not applicable, 9.5%
>MISSING = 2.0%
>
>The overall goal is to be able to run a regression in which I don't
>lose any cases. Does anyone have any suggestions to do this? I
>greatly appreciate any thoughts or suggestions. I've thought about
>recoding into several dummy variables but I can't seem to eliminate
>the missing data from the parents income variable, which is the
>variable with the most variability.
>
>I was also wondering if anyone knows of any standard in how to recode
>the largest response category in which no upper limit is specified.
>For example, response option 9 is for those whose parents earn
>$300,000 or more. Any suggestions as to how I should set this to a
>midpoint?
>
>Again, thanks for any help that you provide in this matter.
>Best,
>Tricia Seifert
>Postdoctoral Research Scholar
>Center for Research on Undergraduate Education
>University of Iowa
>N438 Lindquist Center (physical address)
>N491 Lindquist Center (mailing address)
>Iowa City, IA 52242
>319-335-5377

Ornelas, Fermin

Re: RECODE question

I was just thinking that if you have other characteristics on the
responders you may be able to set up the missing income to a given
value. That is you could profile their other attributes to see if they
resemble similar attributes in other income categories and based on that
assign the missing values in one or more categories you have established
already.

My guess is also that if you chose dummies then it may not matter
whether you have an upper bound on the $300k income.

Fermin Ornelas, Ph.D.
Management Analyst III, AZ DES
Tel: (602) 542-5639
E-mail: [hidden email]
-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Dan Zetu
Sent: Tuesday, March 13, 2007 9:22 AM
To: [hidden email]
Subject: Re: RECODE question

You could create a dummy called "Missing", which is 1 if income is
missing
and 0 otherwise. This way, you wouldn't lose any cases, although I don't
know what value that may add to your model (unless there is a pattern
for
missing Income and losing these cases may affect other variables).

Dan

_________________________________________________________________
Play Flexicon: the crossword game that feeds your brain. PLAY now for
FREE.
http://zone.msn.com/en/flexicon/default.htm?icid=flexicon_hmtagline

NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR
CONFIDENTIAL information and is intended only for the use of the
specific individual(s) to whom it is addressed. It may contain
information that is privileged and confidential under state and federal
law. This information may be used or disclosed only in accordance with
law, and you may be subject to penalties under law for improper use or
further disclosure of the information in this e-mail and its
attachments. If you have received this e-mail in error, please
immediately notify the person named above by reply e-mail, and then
delete the original e-mail. Thank you.

Maguin, Eugene

Re: RECODE question

In reply to this post by Tricia Seifert

Tricia,

I'd like to give a somewhat different perspective on your two questions. It
sounds like you are think of recoding the income categories to, maybe, the
range midpoints. This leads to the question about what dollar value to give
to $300K plus. (Let me point out that you have the same problem for the less
than $15K category.) I'd suggest that you evaluate whether income has linear
relationships with your DV(s). You could use Means if you have a continuous
DV or crosstabs/logistic, nominal, or ordinal regression if you have a
categorical DV. (Someone more skillful than can advise you about more
sophisticated log-linear models.) It may be (or not) that you find that the
10 income categories can be collapsed to 3 or 4. Another thing you can do is
to recode to category midpoints (and let the < $15K midpoint be 7.5K and the
$300K plus midpoint be, for example, $500K) and then evaluate linearity by
including nonlinear (i.e., quadractic and or cubic terms, but center these
to minimize collinearity) and test for linearity. If income is really
important, then you should be concerned with nonlinearity. Where to put the
range midpoints for the upper and lower categories is a question of
sensitivity. That is, if the relationship is sensitive to the range midpoint
choice then when you choose as the upper or lower recode value is very
important. If, on the other hand, the relationship is not sensitive, then
the choice has little significance. To evaluate sensitivity, I'd run my
analyses with different values for the upper and lower recode points to see
how much the relationship changes for a choice of $5K or $10K for the lower
recode point.

More importantly, I think you have two missing data problems and not one.
How will you treat NA? What does an NA response mean? How is an NA response
different from a missing response (does missing=refused)?

I disagree with Dan Zetu's suggestion, which Cohen (and probably others)
have suggested because it has been shown lead to biased estimates. Secondly,
I understand that income data is regarded as a 'nasty' kind of variable
because it is believed that people refuse to say their income because of
their income. This is known as nonignorable missing (or nonignorable
nonrespone) and there is literature on this topic.

If you want to/need to do something with the NA and missing=refused, then
you could, as Fermin Ornelas suggested and identify the correlates of
income, e.g., education. You could use the correlates of income level to
'impute' a value for income. There are a number of ways to do imputation,
including the hot deck procedure, which I understand to be used by the
census. There is a good sized literature on imputation.

Gene Maguin