Hi,
I've been tasked with creating a model to predict responses from cases over an 84 month period based on historical data (this data only dates back around 30 months). The responses are scale (normally 0-1000, but occasionally larger) and I have roughly 100 variables to use to try and predict the responses of new cases added. I have a historical sample size of around 1 million cases. I'm not sure where to start with the methodology for this project. Am I better served creating a regression model for each months response (1-30) and then performing a time series analysis for the remaining months? Do I try and segment the cases first and then begin the analysis? Any help would be appreciated before I'm swamped with it all! Regards, JC ___________________________________________________________ Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for your free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html |
Cardiff,
>>I've been tasked with creating a model to predict responses from cases over an 84 month period based on historical data (this data only dates back around 30 months). The responses are scale (normally 0-1000, but occasionally larger) and I have roughly 100 variables to use to try and predict the responses of new cases added. I have a historical sample size of around 1 million cases. Perhaps others will understand exactly what you have in mind but I don't. More information about the project would be useful in addition to answers to some specific questions. Maybe the first question concerns the specific design of the historical and ongoing dataset. Does the historical and the ongoing dataset have the same design structure? If no, how do they differ? Were the 1 million cases assessed once (when?) or multiple times (if yes, how many times?) at regular intervals or at irregular intervals? Do all 1 million have 30 months of followup? What is it that you are trying to predict? What does the distribution of the response look like? Normal? J-shaped? Continuous or categorical? That might do for starters. Please post replies back to the list for all to see. Gene Maguin |
In reply to this post by Cardiff Tyke
(1) To begin with if you have one million cases take a random sample to
develop two data sets one for development and a second one for validation. 2) Do some exploratory analysis to verify data integrity and accuracy (means standard deviations, max and min values) 3) Do graphical analysis to get an idea of how each of the predictors correlates with the response variable and compute correlation among predictors. 4) Select a reasonable number of predictors based on (1) (2) and (3) and attempt some initial model building. Some modeling procedure would apply such as stepwise variable selection. Other useful guidance are: PRESS, Mallows C_p criteria, and MSE. For the potential useful models you have to verify/test basic modeling assumptions such as normality, outliers, time effect, constant variance and independence. This would require you to do normal probability plots to check for normality, plots of residuals versus predictors and fitted response, and testing for constant variance, plots of residuals versus time to check for time sequence effect. It is not clear what the purpose of the project is. It is from here that one can get an idea on how to define or modify the response variable definition to suit the needs of the project. Moreover, depending on the problems found in the initial data analysis, transformations of the response variable or the predictors may be required to enhance the model estimation process. This list is intended just as a brief outline, you may want to check a text book for a more formal description on the steps required for model building and variable selection. Fermin Ornelas, Ph.D. Management Analyst III, AZ DES Tel: (602) 542-5639 E-mail: [hidden email] -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Cardiff Tyke Sent: Monday, April 16, 2007 9:29 AM To: [hidden email] Subject: Help with model methodology Hi, I've been tasked with creating a model to predict responses from cases over an 84 month period based on historical data (this data only dates back around 30 months). The responses are scale (normally 0-1000, but occasionally larger) and I have roughly 100 variables to use to try and predict the responses of new cases added. I have a historical sample size of around 1 million cases. I'm not sure where to start with the methodology for this project. Am I better served creating a regression model for each months response (1-30) and then performing a time series analysis for the remaining months? Do I try and segment the cases first and then begin the analysis? Any help would be appreciated before I'm swamped with it all! Regards, JC ___________________________________________________________ Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for your free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07 .html NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR CONFIDENTIAL information and is intended only for the use of the specific individual(s) to whom it is addressed. It may contain information that is privileged and confidential under state and federal law. This information may be used or disclosed only in accordance with law, and you may be subject to penalties under law for improper use or further disclosure of the information in this e-mail and its attachments. If you have received this e-mail in error, please immediately notify the person named above by reply e-mail, and then delete the original e-mail. Thank you. |
Free forum by Nabble | Edit this page |