Thanks to all that helped with my last problem!
I am currently using SPSS Classification Trees to try and segment a data file into Good and Bad risk (ideally I'd like to segment into 4 categories, but two is proving difficult enough). I have a list of accounts plus various fields and demographic variables. However, due to the customer base, the file is heavily weighted towards the Bad risk accounts (the split is around 85%:15%). On using classification trees, the "best " node (I've selected "Good" accounts as my primary interest) is still very weighted towards bad payers (around 70%:30%) and overall SPSS cannot predict any "Good" account holders. I'm tasked with trying to find the attibutes of "Good" risk account holders. When faced with this situation, what should a data analyst do? 1) Try and use a more equal sample? I'm not sure how to select an equal number of cases in SPSS without manually deleting records. 2) Try and find more variables (I think I have all the ones that are available) 3) Try and different approach (I chose Classification Trees as I assumed this would be the easiest starting point and would identify key variables the quickest) As ever, any help is greatly appreciated. JC. |
Two comments:
1. If the 85/15 bad/good split is representative of the population of interest, then it's probably not a good idea to to try to create a sample that is "more equal". If, however, you have reason to believe that the "bad" group is over-represented, you could weight the data to produce a more representative distribution. 2. Try reducing the minimum parent and child node sizes (the defaults are 100 and 50 respectively), and/or increasing the maximum tree depth (the default is 3 for CHAID, 5 for CRT and QUEST). ________________________________ From: SPSSX(r) Discussion on behalf of Cardiff Tyke Sent: Sun 8/20/2006 11:18 AM To: [hidden email] Subject: Help With Classification Trees Thanks to all that helped with my last problem! I am currently using SPSS Classification Trees to try and segment a data file into Good and Bad risk (ideally I'd like to segment into 4 categories, but two is proving difficult enough). I have a list of accounts plus various fields and demographic variables. However, due to the customer base, the file is heavily weighted towards the Bad risk accounts (the split is around 85%:15%). On using classification trees, the "best " node (I've selected "Good" accounts as my primary interest) is still very weighted towards bad payers (around 70%:30%) and overall SPSS cannot predict any "Good" account holders. I'm tasked with trying to find the attibutes of "Good" risk account holders. When faced with this situation, what should a data analyst do? 1) Try and use a more equal sample? I'm not sure how to select an equal number of cases in SPSS without manually deleting records. 2) Try and find more variables (I think I have all the ones that are available) 3) Try and different approach (I chose Classification Trees as I assumed this would be the easiest starting point and would identify key variables the quickest) As ever, any help is greatly appreciated. JC. |
Free forum by Nabble | Edit this page |