Hello,
I was wondering if someone would be able to help me through a recode question. I have a question that asks respondents to report their parents income. The response categories are the following: 1 = Less than $14,999, 4.2% 2 = $15,000 - $24,999, 4.0% 3 = $25,000 - $34,999, 6.0% 4 = $35,000 - $49,999, 8.5% 5 = $50,000 - $74,999, 17.3% 6 = $75,000 - $99,999, 15.2% 7 = $100,000 - $199,999, 22.6% 8 = $200,000 - $299,999, 5.5% 9 = $300,000 or more, 5.1% 0 = Not applicable, 9.5% MISSING = 2.0% The overall goal is to be able to run a regression in which I don't lose any cases. Does anyone have any suggestions to do this? I greatly appreciate any thoughts or suggestions. I've thought about recoding into several dummy variables but I can't seem to eliminate the missing data from the parents income variable, which is the variable with the most variability. I was also wondering if anyone knows of any standard in how to recode the largest response category in which no upper limit is specified. For example, response option 9 is for those whose parents earn $300,000 or more. Any suggestions as to how I should set this to a midpoint? Again, thanks for any help that you provide in this matter. Best, Tricia Seifert Postdoctoral Research Scholar Center for Research on Undergraduate Education University of Iowa N438 Lindquist Center (physical address) N491 Lindquist Center (mailing address) Iowa City, IA 52242 319-335-5377 |
You could create a dummy called "Missing", which is 1 if income is missing
and 0 otherwise. This way, you wouldn't lose any cases, although I don't know what value that may add to your model (unless there is a pattern for missing Income and losing these cases may affect other variables). Dan >From: Tricia Seifert <[hidden email]> >Reply-To: Tricia Seifert <[hidden email]> >To: [hidden email] >Subject: RECODE question >Date: Tue, 13 Mar 2007 10:37:05 -0500 > >Hello, >I was wondering if someone would be able to help me through a recode >question. I have a question that asks respondents to report their >parents income. The response categories are the following: >1 = Less than $14,999, 4.2% >2 = $15,000 - $24,999, 4.0% >3 = $25,000 - $34,999, 6.0% >4 = $35,000 - $49,999, 8.5% >5 = $50,000 - $74,999, 17.3% >6 = $75,000 - $99,999, 15.2% >7 = $100,000 - $199,999, 22.6% >8 = $200,000 - $299,999, 5.5% >9 = $300,000 or more, 5.1% >0 = Not applicable, 9.5% >MISSING = 2.0% > >The overall goal is to be able to run a regression in which I don't >lose any cases. Does anyone have any suggestions to do this? I >greatly appreciate any thoughts or suggestions. I've thought about >recoding into several dummy variables but I can't seem to eliminate >the missing data from the parents income variable, which is the >variable with the most variability. > >I was also wondering if anyone knows of any standard in how to recode >the largest response category in which no upper limit is specified. >For example, response option 9 is for those whose parents earn >$300,000 or more. Any suggestions as to how I should set this to a >midpoint? > >Again, thanks for any help that you provide in this matter. >Best, >Tricia Seifert >Postdoctoral Research Scholar >Center for Research on Undergraduate Education >University of Iowa >N438 Lindquist Center (physical address) >N491 Lindquist Center (mailing address) >Iowa City, IA 52242 >319-335-5377 _________________________________________________________________ Play Flexicon: the crossword game that feeds your brain. PLAY now for FREE. http://zone.msn.com/en/flexicon/default.htm?icid=flexicon_hmtagline |
I was just thinking that if you have other characteristics on the
responders you may be able to set up the missing income to a given value. That is you could profile their other attributes to see if they resemble similar attributes in other income categories and based on that assign the missing values in one or more categories you have established already. My guess is also that if you chose dummies then it may not matter whether you have an upper bound on the $300k income. Fermin Ornelas, Ph.D. Management Analyst III, AZ DES Tel: (602) 542-5639 E-mail: [hidden email] -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Dan Zetu Sent: Tuesday, March 13, 2007 9:22 AM To: [hidden email] Subject: Re: RECODE question You could create a dummy called "Missing", which is 1 if income is missing and 0 otherwise. This way, you wouldn't lose any cases, although I don't know what value that may add to your model (unless there is a pattern for missing Income and losing these cases may affect other variables). Dan >From: Tricia Seifert <[hidden email]> >Reply-To: Tricia Seifert <[hidden email]> >To: [hidden email] >Subject: RECODE question >Date: Tue, 13 Mar 2007 10:37:05 -0500 > >Hello, >I was wondering if someone would be able to help me through a recode >question. I have a question that asks respondents to report their >parents income. The response categories are the following: >1 = Less than $14,999, 4.2% >2 = $15,000 - $24,999, 4.0% >3 = $25,000 - $34,999, 6.0% >4 = $35,000 - $49,999, 8.5% >5 = $50,000 - $74,999, 17.3% >6 = $75,000 - $99,999, 15.2% >7 = $100,000 - $199,999, 22.6% >8 = $200,000 - $299,999, 5.5% >9 = $300,000 or more, 5.1% >0 = Not applicable, 9.5% >MISSING = 2.0% > >The overall goal is to be able to run a regression in which I don't >lose any cases. Does anyone have any suggestions to do this? I >greatly appreciate any thoughts or suggestions. I've thought about >recoding into several dummy variables but I can't seem to eliminate >the missing data from the parents income variable, which is the >variable with the most variability. > >I was also wondering if anyone knows of any standard in how to recode >the largest response category in which no upper limit is specified. >For example, response option 9 is for those whose parents earn >$300,000 or more. Any suggestions as to how I should set this to a >midpoint? > >Again, thanks for any help that you provide in this matter. >Best, >Tricia Seifert >Postdoctoral Research Scholar >Center for Research on Undergraduate Education >University of Iowa >N438 Lindquist Center (physical address) >N491 Lindquist Center (mailing address) >Iowa City, IA 52242 >319-335-5377 _________________________________________________________________ Play Flexicon: the crossword game that feeds your brain. PLAY now for FREE. http://zone.msn.com/en/flexicon/default.htm?icid=flexicon_hmtagline NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR CONFIDENTIAL information and is intended only for the use of the specific individual(s) to whom it is addressed. It may contain information that is privileged and confidential under state and federal law. This information may be used or disclosed only in accordance with law, and you may be subject to penalties under law for improper use or further disclosure of the information in this e-mail and its attachments. If you have received this e-mail in error, please immediately notify the person named above by reply e-mail, and then delete the original e-mail. Thank you. |
In reply to this post by Tricia Seifert
Tricia,
I'd like to give a somewhat different perspective on your two questions. It sounds like you are think of recoding the income categories to, maybe, the range midpoints. This leads to the question about what dollar value to give to $300K plus. (Let me point out that you have the same problem for the less than $15K category.) I'd suggest that you evaluate whether income has linear relationships with your DV(s). You could use Means if you have a continuous DV or crosstabs/logistic, nominal, or ordinal regression if you have a categorical DV. (Someone more skillful than can advise you about more sophisticated log-linear models.) It may be (or not) that you find that the 10 income categories can be collapsed to 3 or 4. Another thing you can do is to recode to category midpoints (and let the < $15K midpoint be 7.5K and the $300K plus midpoint be, for example, $500K) and then evaluate linearity by including nonlinear (i.e., quadractic and or cubic terms, but center these to minimize collinearity) and test for linearity. If income is really important, then you should be concerned with nonlinearity. Where to put the range midpoints for the upper and lower categories is a question of sensitivity. That is, if the relationship is sensitive to the range midpoint choice then when you choose as the upper or lower recode value is very important. If, on the other hand, the relationship is not sensitive, then the choice has little significance. To evaluate sensitivity, I'd run my analyses with different values for the upper and lower recode points to see how much the relationship changes for a choice of $5K or $10K for the lower recode point. More importantly, I think you have two missing data problems and not one. How will you treat NA? What does an NA response mean? How is an NA response different from a missing response (does missing=refused)? I disagree with Dan Zetu's suggestion, which Cohen (and probably others) have suggested because it has been shown lead to biased estimates. Secondly, I understand that income data is regarded as a 'nasty' kind of variable because it is believed that people refuse to say their income because of their income. This is known as nonignorable missing (or nonignorable nonrespone) and there is literature on this topic. If you want to/need to do something with the NA and missing=refused, then you could, as Fermin Ornelas suggested and identify the correlates of income, e.g., education. You could use the correlates of income level to 'impute' a value for income. There are a number of ways to do imputation, including the hot deck procedure, which I understand to be used by the census. There is a good sized literature on imputation. Gene Maguin |
Free forum by Nabble | Edit this page |