Hello Group, We have had AnswerTree in the office for a while now, and I am trying to add it to our analytical toolbox. I read the book that I believe came with the software, and I have a few questions. I have a decent background in statistics, but typically resolve to working through examples I see online to figure out the nitty gritty. Anyway, the first example is the Iris Flower classification. The author talks about the "improvement" of .3333 of the first level. In plain English, what does this mean? Second, I have seen that from the gains/lifts chart, it is relative to a target variable. What does this mean? If I was running this analysis for a Yes/No, is it safe to assume that I would use Yes as the target if I was interested in people who bought a product? How would I go about analyzing these charts? Third, why would I grow a tree myself if the automatic method optimizes classification? Last, has anyone seen any examples online going through the process step-by-step and explaining how to analyze a tree and to determine which method best fits my data? I have seen some great examples for logistic regression, cluster, etc. where the website walks you through the analysis and shows you output. I was wondering if there were any resources disucssing how to conduct this analysis. As always, many many thanks in advance. ~ Brock
|
1) You are describing the use of the C&RT approach on the fisher iris data.
At the start, there are 50 flowers from each of 3 species. That is, the beginning node at the top has counts 50:50:50. Since the groups are the same size, you can do no better in classification with knowledge of no other variables than to assign all 150 flowers to one group, say, Group 1, in which case you make 100 errors in assignment. As a fraction, that is 100/150 or .666. This number comes about through the use by C&RT of the gini coefficient, a measure of impurity. C&RT searches all split points for all predictors and enters the predictor that leads to the biggest drop in the gini coefficient after the split. Splitting on a particular split point on the variable petal length leads to the two "child" nodes 50:0:0 and 0:50:50 Notice that going to the "left," C&RT has isolated all 50 observations from species 1, while going to the "right" still leaves species 2 and 3 mixed. There are other split variables that can be applied to the right node to more or less unmix species 2 and 3. Considering the above child nodes, if we arbitrarily assign all of species 2 and 3 to group 2, we will make 50 errors in assignment. As a fraction, that is 50/150 equals .333. In sum, before the split, the gini coefficient is .666. After the split, it is .333. the improvement is .666-.333 equals .333. It is this latter number that is displayed in the tree. 2) The gains and lift charts are ways to place model results into a common evaluation framework. In your application, you are interested in the "1" values. Imagine a model that predicts the chance of being a "1." These predicted values can be sorted high to low. A model that is working well will sort the actual 1s to the top. Imagine a "pointer" moving down the sorted predicted values. At any point, there are some number of predicted and actual 1s. You can plot the cumulative number of predicted and actual 1s at all cutpoints to obtain a chart. Or, you might consider the fraction of 1s in some part of the dataset relative to the overall fraction of 1s in the file. 3) The automatic tree uses a heuristic. The so-called cross-validated minimum risk tree is one tree to obtain automatically if you are interested in classification. In addition, using trees in exploratory fashion can be very informative. If you allow a single variable to enter a tree, how well does it classify relative to the automatic tree? If you allow a pair of variables in, how well does that tree do? At any given point in generating a tree, when a particular variable enters, there are potentially other "competitor" variables that might have entered at that step with almost the same classification success. At a given point in a tree, a variable that enters can "mask" another variable. These are some reasons to use AnswerTree in interactive fashion. _____ From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of [hidden email] Sent: Tuesday, December 05, 2006 6:36 PM To: [hidden email] Subject: AnswerTree Hello Group, We have had AnswerTree in the office for a while now, and I am trying to add it to our analytical toolbox. I read the book that I believe came with the software, and I have a few questions. I have a decent background in statistics, but typically resolve to working through examples I see online to figure out the nitty gritty. Anyway, the first example is the Iris Flower classification. The author talks about the "improvement" of .3333 of the first level. In plain English, what does this mean? Second, I have seen that from the gains/lifts chart, it is relative to a target variable. What does this mean? If I was running this analysis for a Yes/No, is it safe to assume that I would use Yes as the target if I was interested in people who bought a product? How would I go about analyzing these charts? Third, why would I grow a tree myself if the automatic method optimizes classification? Last, has anyone seen any examples online going through the process step-by-step and explaining how to analyze a tree and to determine which method best fits my data? I have seen some great examples for logistic regression, cluster, etc. where the website walks you through the analysis and shows you output. I was wondering if there were any resources disucssing how to conduct this analysis. As always, many many thanks in advance. ~ Brock |
Free forum by Nabble | Edit this page |