I have some questions about the selection of appropriate variables to include in the Multiple Imputation mdel (SPSS 19). I have a large dataset with more than 1000 cases and around 3000 variables. I now want to impute missing values for 8 variables (5-40% missing values). I couldn't find a lot of literature about which and how many variables to select but what I found was:
a) you should inlcude as many variables as possible in the model, b) include variables that are correlated with the imputed variable, c) include variables that are associated with the missingness of the imputed variable and d) variables that will be used in the analysis later. If I follow these advices I will have to include almost all variables which is not possible. And theoretically it doesn't make sense to me to include all variables to predict different variables in the dataset. Thanks for any helpful advice. |
What to do has a lot to do with the substantive
nature of your research.
Why are those values missing? Do they have distinct missing value codes? What role were they intended for in the analysis? What questions are you using the data to answer? Are the missing variables items in scales? Are the variables that have missing values some form of repeated measure like scale items, taken at different times, points along a spectrum etc? Which variables are plausibly related to missingness? Is the missingness correlated across the 8? Why do you have so many variables? What role were those 8 variables intended to have in the analysis? How did you get you get the set of cases? How did you choose the variables that you measured? Art Kendall Social Research Consultants On 5/13/2011 6:55 AM, Pia wrote: ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARDI have some questions about the selection of appropriate variables to include in the Multiple Imputation mdel (SPSS 19). I have a large dataset with more than 1000 cases and around 3000 variables. I now want to impute missing values for 8 variables (5-40% missing values). I couldn't find a lot of literature about which and how many variables to select but what I found was: a) you should inlcude as many variables as possible in the model, b) include variables that are correlated with the imputed variable, c) include variables that are associated with the missingness of the imputed variable and d) variables that will be used in the analysis later. If I follow these advices I will have to include almost all variables which is not possible. And theoretically it doesn't make sense to me to include all variables to predict different variables in the dataset. Thanks for any helpful advice. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Multiple-Imputation-tp4392805p4392805.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
It is a longitudinal (5 time points) quasi-experimental research project about testing and evaluating an intervention. The questionnaire was quite large containing many different standardized psychological scales (e.g. Depression, Grief, Social Support) but also non-standardized questions. Question: do the interventions show any effect on these measures but also more in-depth modeling about risk and protective factors for our specific sample. We used different missing codes (e.g. not applicable, interviewer omission). The missing values we want to impute were mainly omitted by the interviewer for different reasons. All of them are scale items at different time points (not all of them show more than 5% missings at all time points). The analysis we are going to do are: MANCOVA, repeated measures, SEM. For the 8 different variables there are, of course, different correlating variables/predictors in the dataset. I don't think the missingness is correlated across the 8 variables (different reasons for missingness depending on the variable).
Hope that helps! |
The variables that are most likely to be well
correlated with the variable that has missing values are the other
items in that scale at that time.
I don't under stand not all of them show more than 5% missings at all time points. Are the scale norms given to you as means or sums? If you do something like this count k_missing = grieft2_17, depressionT3_22 ....(missing). frequencies vars = k_missing. what do you get? would you please post a tiny table with 8 lines, one for each item and 4 columns: scale_name time #items_asked #cases_missing Art Kendall Social Research Consultants On 5/13/2011 8:29 AM, Pia wrote: ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARDIt is a longitudinal (5 time points) quasi-experimental research project about testing and evaluating an intervention. The questionnaire was quite large containing many different standardized psychological scales (e.g. Depression, Grief, Social Support) but also non-standardized questions. Question: do the interventions show any effect on these measures but also more in-depth modeling about risk and protective factors for our specific sample. We used different missing codes (e.g. not applicable, interviewer omission). The missing values we want to impute were mainly omitted by the interviewer for different reasons. All of them are scale items at different time points (not all of them show more than 5% missings at all time points). The analysis we are going to do are: MANCOVA, repeated measures, SEM. For the 8 different variables there are, of course, different correlating variables/predictors in the dataset. I don't think the missingness is correlated across the 8 variables (different reasons for missingness depending on the variable). Hope that helps! -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Multiple-Imputation-tp4392805p4392971.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
scale_name #items % missing values
Prosocial 5 17% in only 1 item of T1 Grief 23 20% in all items of T1 Coping 32 5% in only 1 item of T1 PTSD 9 7-8% in all items of T1 Daily Problems 14 20-30% in only 1 item of T3,T4 and T5 Support 5 7-20% missings in T1,T2,T3,T4,T5 Assets 9 6, 10,13 and 54 % missings in 4 items of T1 You see, I have missing items in all 5 time points. "Not all of them show more than 5% missings at all time points" means thta there are some items that only show too many missings at one time point, not in all 5 time points (see above). We compute scale sums and scale means, but most liekly analyze with scale means. |
Free forum by Nabble | Edit this page |