SPSSX Discussion

AnswerTree

Classic

List

Threaded

2 messages Options

Brock-15

AnswerTree

Hello Group,

We have had AnswerTree in the office for a while now, and I am trying to add it to our analytical toolbox. I read the book that I believe came with the software, and I have a few questions.

I have a decent background in statistics, but typically resolve to working through examples I see online to figure out the nitty gritty. Anyway, the first example is the Iris Flower classification. The author talks about the "improvement" of .3333 of the first level. In plain English, what does this mean?

Second, I have seen that from the gains/lifts chart, it is relative to a target variable. What does this mean? If I was running this analysis for a Yes/No, is it safe to assume that I would use Yes as the target if I was interested in people who bought a product? How would I go about analyzing these charts?

Third, why would I grow a tree myself if the automatic method optimizes classification?

Last, has anyone seen any examples online going through the process step-by-step and explaining how to analyze a tree and to determine which method best fits my data? I have seen some great examples for logistic regression, cluster, etc. where the website walks you through the analysis and shows you output. I was wondering if there were any resources disucssing how to conduct this analysis.

As always, many many thanks in advance.

~ Brock

Anthony Babinec

Re: AnswerTree

1) You are describing the use of the C&RT approach on the fisher iris data.

At the start, there are 50 flowers from each of 3 species. That is, the
beginning node at the top has counts 50:50:50. Since the groups are the same
size, you can do no better in classification with knowledge of no other
variables than to assign all 150 flowers to one group, say, Group 1, in
which case you make 100 errors in assignment. As a fraction, that is 100/150
or .666. This number comes about through the use by C&RT of the gini
coefficient, a measure of impurity.

C&RT searches all split points for all predictors and enters the predictor
that leads to the biggest drop in the gini coefficient after the split.
Splitting on a particular split point on the variable petal length leads to
the two "child" nodes

50:0:0 and 0:50:50

Notice that going to the "left," C&RT has isolated all 50 observations from
species 1, while going to the "right" still leaves species 2 and 3 mixed.
There are other split variables that can be applied to the right node to
more or less unmix species 2 and 3. Considering the above child nodes, if we
arbitrarily assign all of species 2 and 3 to group 2, we will make 50 errors
in assignment. As a fraction, that is 50/150 equals .333.

In sum, before the split, the gini coefficient is .666. After the split, it
is .333. the improvement is .666-.333 equals .333. It is this latter number
that is displayed in the tree.

2) The gains and lift charts are ways to place model results into a common
evaluation framework. In your application, you are interested in the "1"
values. Imagine a model that predicts the chance of being a "1." These
predicted values can be sorted high to low. A model that is working well
will sort the actual 1s to the top. Imagine a "pointer" moving down the
sorted predicted values. At any point, there are some number of predicted
and actual 1s. You can plot the cumulative number of predicted and actual 1s
at all cutpoints to obtain a chart. Or, you might consider the fraction of
1s in some part of the dataset relative to the overall fraction of 1s in the
file.

3) The automatic tree uses a heuristic. The so-called cross-validated
minimum risk tree is one tree to obtain automatically if you are interested
in classification. In addition, using trees in exploratory fashion can be
very informative. If you allow a single variable to enter a tree, how well
does it classify relative to the automatic tree? If you allow a pair of
variables in, how well does that tree do? At any given point in generating a
tree, when a particular variable enters, there are potentially other
"competitor" variables that might have entered at that step with almost the
same classification success. At a given point in a tree, a variable that
enters can "mask" another variable. These are some reasons to use AnswerTree
in interactive fashion.

_____

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
[hidden email]
Sent: Tuesday, December 05, 2006 6:36 PM
To: [hidden email]
Subject: AnswerTree

Hello Group,

We have had AnswerTree in the office for a while now, and I am trying to add
it to our analytical toolbox. I read the book that I believe came with the
software, and I have a few questions.

I have a decent background in statistics, but typically resolve to working
through examples I see online to figure out the nitty gritty. Anyway, the
first example is the Iris Flower classification. The author talks about the
"improvement" of .3333 of the first level. In plain English, what does this
mean?

Second, I have seen that from the gains/lifts chart, it is relative to a
target variable. What does this mean? If I was running this analysis for a
Yes/No, is it safe to assume that I would use Yes as the target if I was
interested in people who bought a product? How would I go about analyzing
these charts?

Third, why would I grow a tree myself if the automatic method optimizes
classification?

Last, has anyone seen any examples online going through the process
step-by-step and explaining how to analyze a tree and to determine which
method best fits my data? I have seen some great examples for logistic
regression, cluster, etc. where the website walks you through the analysis
and shows you output. I was wondering if there were any resources
disucssing how to conduct this analysis.

As always, many many thanks in advance.

~ Brock