I'm trying to figure out how the improvement is calculated in a C&RT. For
the following data set, I'm trying to predict ownership from lot size and
income.
Lot Income Ownership
18.8 33 N
20.4 43.2 N
16.4 47.4 N
17.6 49.2 N
14 51 N
22 51 O
20.8 52.8 N
16 59.4 N
18.4 60 O
20.8 61.5 O
14.8 63 N
17.2 64.8 N
21.6 64.8 O
18.4 66 N
20 69 O
19.6 75 N
20 81 O
22.4 82.8 O
17.6 84 N
16.8 85.5 O
23.6 87 O
20.8 93 O
17.6 108 O
19.2 110.1 O
Based on the first split (Income <= 59.7), I first calculated the Gini
impurity, I, for each part of the split as I(left) = 1 - (1/8)^2 - (7/8)^2
and I(right) = 1 - (11/16)^2 - (5/16)^2. Then I calculated the weighted
average of these as (8/24)I(left) + (16/24)I(right) = 9/64 = 0.140625. This
value is consistent with that listed by SPSS.
However, at the next split (Lot <= 21.4), I tried to calculate the
improvement in the same fashion for the 100% pure nodes, and I don't come up
with SPSS's 0.073. I get a Gini Impurity of 0, so I don't know how the
0.073 figure is determined?
Can anyone help?
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD