Importance of Indep vars in Classification trees

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Importance of Indep vars in Classification trees

Mark Webb-3
In the CRT growing method it's possible to rank the independent predictors by importance to the model.
But the top ranked predictor is NOT necessarily the same as the first split predictor.
I'm new to trees & would appreciate an explanation for this. Is importance determined on the whole model rather than the order of the splits?
If this reasoning is correct then one can't comment on the importance of predictors when other growth methods are used. So if Chaid is used it's not correct to say that the first splitting variable is the most important/influential.
Can someone clear this up for me ?

Regards
Mark
Reply | Threaded
Open this post in threaded view
|

Re: Importance of Indep vars in Classification trees

Antro, Mark
Hi Mark,

You can request an "Importance to Model" table and chart for independent variables (only for CRT) from the Output sub-dialog box.

The Measure of Importance M(X) of an independent variable X in relation to the final tree T, is defined as the (weighted) sum across all nodes in the tree of the improvements that X has when it is used as a primary or surrogate splitter. The independent variable's weights at each split are based on whether the independent variable was the primary splitter (the independent variable on which the parent node was split, where the weight is 1) or a surrogate (where the weight depends on the independent variables ranking as a surrogate. (If 5 surrogates are allowed, the first surrogate for a split has the highest weight among the surrogates and the fifth surrogate has the lowest weight). This weighted sum is reported in the Importance column of the "Independent Variable Importance" table in the Tree output. The "Normalized Importance" for a variable is defined as

Normalised M(X) = 100 * M(X) / Maximum Importance

Thus the most important predictor has a normalised importance of 100. The measure is taken from the Breiman et al. (1984).

You can't compare this to CHAID, since CHAID doesn't use surrogate variables (variables used when you have a missing value for the actual split variable). CHAID simply treats missing values as an extra category.

Hth

Antro.
Reply | Threaded
Open this post in threaded view
|

Re: Importance of Indep vars in Classification trees

Mark Webb-3
Mark-Thanks for the reply-can I ask a few more questions ?

I work with market research data that doesn't have many missing values so
this issue is low on my priorities - but thanks for the pointers.

What I'm after is what some call Key Driver Analysis - what vars are
drivers/predictors ?

In the notes section of the output, the input independent vars are listed as
well as those used in the model.
Can I read anything into this ? i.e. I'm assuming they are the significant
ones but does the order imply anything ? It's not necessarily the same as
the input.

Thanks


----- Original Message -----
From: "Antro, Mark" <[hidden email]>
To: "Mark Webb" <[hidden email]>; <[hidden email]>
Sent: Friday, September 15, 2006 9:56 AM
Subject: RE: Importance of Indep vars in Classification trees


Hi Mark,

You can request an "Importance to Model" table and chart for independent
variables (only for CRT) from the Output sub-dialog box.

The Measure of Importance M(X) of an independent variable X in relation to
the final tree T, is defined as the (weighted) sum across all nodes in the
tree of the improvements that X has when it is used as a primary or
surrogate splitter. The independent variable's weights at each split are
based on whether the independent variable was the primary splitter (the
independent variable on which the parent node was split, where the weight is
1) or a surrogate (where the weight depends on the independent variables
ranking as a surrogate. (If 5 surrogates are allowed, the first surrogate
for a split has the highest weight among the surrogates and the fifth
surrogate has the lowest weight). This weighted sum is reported in the
Importance column of the "Independent Variable Importance" table in the Tree
output. The "Normalized Importance" for a variable is defined as

Normalised M(X) = 100 * M(X) / Maximum Importance

Thus the most important predictor has a normalised importance of 100. The
measure is taken from the Breiman et al. (1984).

You can't compare this to CHAID, since CHAID doesn't use surrogate variables
(variables used when you have a missing value for the actual split
variable). CHAID simply treats missing values as an extra category.

Hth

Antro.

__________ NOD32 1.1757 (20060914) Information __________

This message was checked by NOD32 antivirus system.
http://www.eset.com
Reply | Threaded
Open this post in threaded view
|

Re: Importance of Indep vars in Classification trees

Antro, Mark
Hi Mark,

As far as I know, there is no order to the list in the Model Summary table of the variables used in the table. It merely lists those which are significant, and which have been used in build to the tree. Whilst the CRT method gives you the Importance to Model table, there is no equivalent for CHAID (or Exhaustive CHAID).

You could pick the variables used in the top levels of the tree and use regression to assess the importance using the standardised beta values. You could use enter or stepwise methods.

Rgds,
Antro.
Reply | Threaded
Open this post in threaded view
|

Re: Importance of Indep vars in Classification trees

Oliver, Richard
In reply to this post by Mark Webb-3
Yes, you can read something into the Model Summary (not Notes) table, because the list of variables used in the model indicates the variables that provided a significant contribution. I think you can interpret the order in which they're listed as a general indication of relative importance, although I now see that this does not always exactly match the "normalized" importance.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mark Webb
Sent: Friday, September 15, 2006 4:43 AM
To: [hidden email]
Subject: Re: Importance of Indep vars in Classification trees

Mark-Thanks for the reply-can I ask a few more questions ?

I work with market research data that doesn't have many missing values so this issue is low on my priorities - but thanks for the pointers.

What I'm after is what some call Key Driver Analysis - what vars are drivers/predictors ?

In the notes section of the output, the input independent vars are listed as well as those used in the model.
Can I read anything into this ? i.e. I'm assuming they are the significant ones but does the order imply anything ? It's not necessarily the same as the input.

Thanks


----- Original Message -----
From: "Antro, Mark" <[hidden email]>
To: "Mark Webb" <[hidden email]>; <[hidden email]>
Sent: Friday, September 15, 2006 9:56 AM
Subject: RE: Importance of Indep vars in Classification trees


Hi Mark,

You can request an "Importance to Model" table and chart for independent variables (only for CRT) from the Output sub-dialog box.

The Measure of Importance M(X) of an independent variable X in relation to the final tree T, is defined as the (weighted) sum across all nodes in the tree of the improvements that X has when it is used as a primary or surrogate splitter. The independent variable's weights at each split are based on whether the independent variable was the primary splitter (the independent variable on which the parent node was split, where the weight is
1) or a surrogate (where the weight depends on the independent variables ranking as a surrogate. (If 5 surrogates are allowed, the first surrogate for a split has the highest weight among the surrogates and the fifth surrogate has the lowest weight). This weighted sum is reported in the Importance column of the "Independent Variable Importance" table in the Tree output. The "Normalized Importance" for a variable is defined as

Normalised M(X) = 100 * M(X) / Maximum Importance

Thus the most important predictor has a normalised importance of 100. The measure is taken from the Breiman et al. (1984).

You can't compare this to CHAID, since CHAID doesn't use surrogate variables (variables used when you have a missing value for the actual split variable). CHAID simply treats missing values as an extra category.

Hth

Antro.

__________ NOD32 1.1757 (20060914) Information __________

This message was checked by NOD32 antivirus system.
http://www.eset.com
Reply | Threaded
Open this post in threaded view
|

Re: Importance of Indep vars in Classification trees

Beadle, ViAnn
In reply to this post by Mark Webb-3
I'm no expert at this but have been playing around with Trees for a project I'm working on.
I think that the order of variables in the summary table (not the Notes table which just echoes the syntax),
is set by the first time the variables appear in the tree.
 
Note also that a variable can appear as a split at multiple levels of the tree. That is, it might be the primary split
and then down at level 3 be a "sub-split" of the original split.
 
Based upon a quick perusal of articles googled on Key Driver Analysis, I'd say that there is not a one-to-one
mapping between variable importance in CRT and what the articles call key drivers. It seems to me that key-drivers
are variables which differentiate outcomes on a single variable basis and NOT on a multivariate model.
I suppose you could do this by producing charts of the variables identified as important in the tree against
the observed target values.

________________________________

From: SPSSX(r) Discussion on behalf of Mark Webb
Sent: Fri 9/15/2006 4:42 AM
To: [hidden email]
Subject: Re: Importance of Indep vars in Classification trees



Mark-Thanks for the reply-can I ask a few more questions ?

I work with market research data that doesn't have many missing values so
this issue is low on my priorities - but thanks for the pointers.

What I'm after is what some call Key Driver Analysis - what vars are
drivers/predictors ?

In the notes section of the output, the input independent vars are listed as
well as those used in the model.
Can I read anything into this ? i.e. I'm assuming they are the significant
ones but does the order imply anything ? It's not necessarily the same as
the input.

Thanks


----- Original Message -----
From: "Antro, Mark" <[hidden email]>
To: "Mark Webb" <[hidden email]>; <[hidden email]>
Sent: Friday, September 15, 2006 9:56 AM
Subject: RE: Importance of Indep vars in Classification trees


Hi Mark,

You can request an "Importance to Model" table and chart for independent
variables (only for CRT) from the Output sub-dialog box.

The Measure of Importance M(X) of an independent variable X in relation to
the final tree T, is defined as the (weighted) sum across all nodes in the
tree of the improvements that X has when it is used as a primary or
surrogate splitter. The independent variable's weights at each split are
based on whether the independent variable was the primary splitter (the
independent variable on which the parent node was split, where the weight is
1) or a surrogate (where the weight depends on the independent variables
ranking as a surrogate. (If 5 surrogates are allowed, the first surrogate
for a split has the highest weight among the surrogates and the fifth
surrogate has the lowest weight). This weighted sum is reported in the
Importance column of the "Independent Variable Importance" table in the Tree
output. The "Normalized Importance" for a variable is defined as

Normalised M(X) = 100 * M(X) / Maximum Importance

Thus the most important predictor has a normalised importance of 100. The
measure is taken from the Breiman et al. (1984).

You can't compare this to CHAID, since CHAID doesn't use surrogate variables
(variables used when you have a missing value for the actual split
variable). CHAID simply treats missing values as an extra category.

Hth

Antro.

__________ NOD32 1.1757 (20060914) Information __________

This message was checked by NOD32 antivirus system.
http://www.eset.com
Reply | Threaded
Open this post in threaded view
|

Re: Importance of Indep vars in Classification trees

Oliver, Richard
In reply to this post by Mark Webb-3
Oops. Never mind what I said about the order of the variables in that table. You should, however, be able to get some idea of relative importance by looking at the order in which predictors appear in the tree (unless you use FORCE keyword to force the first predictor specified to the top).

-----Original Message-----
From: Oliver, Richard
Sent: Friday, September 15, 2006 8:40 AM
To: 'Mark Webb'; [hidden email]
Subject: RE: Re: Importance of Indep vars in Classification trees

Yes, you can read something into the Model Summary (not Notes) table, because the list of variables used in the model indicates the variables that provided a significant contribution. I think you can interpret the order in which they're listed as a general indication of relative importance, although I now see that this does not always exactly match the "normalized" importance.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mark Webb
Sent: Friday, September 15, 2006 4:43 AM
To: [hidden email]
Subject: Re: Importance of Indep vars in Classification trees

Mark-Thanks for the reply-can I ask a few more questions ?

I work with market research data that doesn't have many missing values so this issue is low on my priorities - but thanks for the pointers.

What I'm after is what some call Key Driver Analysis - what vars are drivers/predictors ?

In the notes section of the output, the input independent vars are listed as well as those used in the model.
Can I read anything into this ? i.e. I'm assuming they are the significant ones but does the order imply anything ? It's not necessarily the same as the input.

Thanks


----- Original Message -----
From: "Antro, Mark" <[hidden email]>
To: "Mark Webb" <[hidden email]>; <[hidden email]>
Sent: Friday, September 15, 2006 9:56 AM
Subject: RE: Importance of Indep vars in Classification trees


Hi Mark,

You can request an "Importance to Model" table and chart for independent variables (only for CRT) from the Output sub-dialog box.

The Measure of Importance M(X) of an independent variable X in relation to the final tree T, is defined as the (weighted) sum across all nodes in the tree of the improvements that X has when it is used as a primary or surrogate splitter. The independent variable's weights at each split are based on whether the independent variable was the primary splitter (the independent variable on which the parent node was split, where the weight is
1) or a surrogate (where the weight depends on the independent variables ranking as a surrogate. (If 5 surrogates are allowed, the first surrogate for a split has the highest weight among the surrogates and the fifth surrogate has the lowest weight). This weighted sum is reported in the Importance column of the "Independent Variable Importance" table in the Tree output. The "Normalized Importance" for a variable is defined as

Normalised M(X) = 100 * M(X) / Maximum Importance

Thus the most important predictor has a normalised importance of 100. The measure is taken from the Breiman et al. (1984).

You can't compare this to CHAID, since CHAID doesn't use surrogate variables (variables used when you have a missing value for the actual split variable). CHAID simply treats missing values as an extra category.

Hth

Antro.

__________ NOD32 1.1757 (20060914) Information __________

This message was checked by NOD32 antivirus system.
http://www.eset.com