In the CRT growing method it's possible to rank the independent predictors by importance to the model.
But the top ranked predictor is NOT necessarily the same as the first split predictor. I'm new to trees & would appreciate an explanation for this. Is importance determined on the whole model rather than the order of the splits? If this reasoning is correct then one can't comment on the importance of predictors when other growth methods are used. So if Chaid is used it's not correct to say that the first splitting variable is the most important/influential. Can someone clear this up for me ? Regards Mark |
Hi Mark,
You can request an "Importance to Model" table and chart for independent variables (only for CRT) from the Output sub-dialog box. The Measure of Importance M(X) of an independent variable X in relation to the final tree T, is defined as the (weighted) sum across all nodes in the tree of the improvements that X has when it is used as a primary or surrogate splitter. The independent variable's weights at each split are based on whether the independent variable was the primary splitter (the independent variable on which the parent node was split, where the weight is 1) or a surrogate (where the weight depends on the independent variables ranking as a surrogate. (If 5 surrogates are allowed, the first surrogate for a split has the highest weight among the surrogates and the fifth surrogate has the lowest weight). This weighted sum is reported in the Importance column of the "Independent Variable Importance" table in the Tree output. The "Normalized Importance" for a variable is defined as Normalised M(X) = 100 * M(X) / Maximum Importance Thus the most important predictor has a normalised importance of 100. The measure is taken from the Breiman et al. (1984). You can't compare this to CHAID, since CHAID doesn't use surrogate variables (variables used when you have a missing value for the actual split variable). CHAID simply treats missing values as an extra category. Hth Antro. |
Mark-Thanks for the reply-can I ask a few more questions ?
I work with market research data that doesn't have many missing values so this issue is low on my priorities - but thanks for the pointers. What I'm after is what some call Key Driver Analysis - what vars are drivers/predictors ? In the notes section of the output, the input independent vars are listed as well as those used in the model. Can I read anything into this ? i.e. I'm assuming they are the significant ones but does the order imply anything ? It's not necessarily the same as the input. Thanks ----- Original Message ----- From: "Antro, Mark" <[hidden email]> To: "Mark Webb" <[hidden email]>; <[hidden email]> Sent: Friday, September 15, 2006 9:56 AM Subject: RE: Importance of Indep vars in Classification trees Hi Mark, You can request an "Importance to Model" table and chart for independent variables (only for CRT) from the Output sub-dialog box. The Measure of Importance M(X) of an independent variable X in relation to the final tree T, is defined as the (weighted) sum across all nodes in the tree of the improvements that X has when it is used as a primary or surrogate splitter. The independent variable's weights at each split are based on whether the independent variable was the primary splitter (the independent variable on which the parent node was split, where the weight is 1) or a surrogate (where the weight depends on the independent variables ranking as a surrogate. (If 5 surrogates are allowed, the first surrogate for a split has the highest weight among the surrogates and the fifth surrogate has the lowest weight). This weighted sum is reported in the Importance column of the "Independent Variable Importance" table in the Tree output. The "Normalized Importance" for a variable is defined as Normalised M(X) = 100 * M(X) / Maximum Importance Thus the most important predictor has a normalised importance of 100. The measure is taken from the Breiman et al. (1984). You can't compare this to CHAID, since CHAID doesn't use surrogate variables (variables used when you have a missing value for the actual split variable). CHAID simply treats missing values as an extra category. Hth Antro. __________ NOD32 1.1757 (20060914) Information __________ This message was checked by NOD32 antivirus system. http://www.eset.com |
Hi Mark,
As far as I know, there is no order to the list in the Model Summary table of the variables used in the table. It merely lists those which are significant, and which have been used in build to the tree. Whilst the CRT method gives you the Importance to Model table, there is no equivalent for CHAID (or Exhaustive CHAID). You could pick the variables used in the top levels of the tree and use regression to assess the importance using the standardised beta values. You could use enter or stepwise methods. Rgds, Antro. |
In reply to this post by Mark Webb-3
Yes, you can read something into the Model Summary (not Notes) table, because the list of variables used in the model indicates the variables that provided a significant contribution. I think you can interpret the order in which they're listed as a general indication of relative importance, although I now see that this does not always exactly match the "normalized" importance.
-----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mark Webb Sent: Friday, September 15, 2006 4:43 AM To: [hidden email] Subject: Re: Importance of Indep vars in Classification trees Mark-Thanks for the reply-can I ask a few more questions ? I work with market research data that doesn't have many missing values so this issue is low on my priorities - but thanks for the pointers. What I'm after is what some call Key Driver Analysis - what vars are drivers/predictors ? In the notes section of the output, the input independent vars are listed as well as those used in the model. Can I read anything into this ? i.e. I'm assuming they are the significant ones but does the order imply anything ? It's not necessarily the same as the input. Thanks ----- Original Message ----- From: "Antro, Mark" <[hidden email]> To: "Mark Webb" <[hidden email]>; <[hidden email]> Sent: Friday, September 15, 2006 9:56 AM Subject: RE: Importance of Indep vars in Classification trees Hi Mark, You can request an "Importance to Model" table and chart for independent variables (only for CRT) from the Output sub-dialog box. The Measure of Importance M(X) of an independent variable X in relation to the final tree T, is defined as the (weighted) sum across all nodes in the tree of the improvements that X has when it is used as a primary or surrogate splitter. The independent variable's weights at each split are based on whether the independent variable was the primary splitter (the independent variable on which the parent node was split, where the weight is 1) or a surrogate (where the weight depends on the independent variables ranking as a surrogate. (If 5 surrogates are allowed, the first surrogate for a split has the highest weight among the surrogates and the fifth surrogate has the lowest weight). This weighted sum is reported in the Importance column of the "Independent Variable Importance" table in the Tree output. The "Normalized Importance" for a variable is defined as Normalised M(X) = 100 * M(X) / Maximum Importance Thus the most important predictor has a normalised importance of 100. The measure is taken from the Breiman et al. (1984). You can't compare this to CHAID, since CHAID doesn't use surrogate variables (variables used when you have a missing value for the actual split variable). CHAID simply treats missing values as an extra category. Hth Antro. __________ NOD32 1.1757 (20060914) Information __________ This message was checked by NOD32 antivirus system. http://www.eset.com |
In reply to this post by Mark Webb-3
I'm no expert at this but have been playing around with Trees for a project I'm working on.
I think that the order of variables in the summary table (not the Notes table which just echoes the syntax), is set by the first time the variables appear in the tree. Note also that a variable can appear as a split at multiple levels of the tree. That is, it might be the primary split and then down at level 3 be a "sub-split" of the original split. Based upon a quick perusal of articles googled on Key Driver Analysis, I'd say that there is not a one-to-one mapping between variable importance in CRT and what the articles call key drivers. It seems to me that key-drivers are variables which differentiate outcomes on a single variable basis and NOT on a multivariate model. I suppose you could do this by producing charts of the variables identified as important in the tree against the observed target values. ________________________________ From: SPSSX(r) Discussion on behalf of Mark Webb Sent: Fri 9/15/2006 4:42 AM To: [hidden email] Subject: Re: Importance of Indep vars in Classification trees Mark-Thanks for the reply-can I ask a few more questions ? I work with market research data that doesn't have many missing values so this issue is low on my priorities - but thanks for the pointers. What I'm after is what some call Key Driver Analysis - what vars are drivers/predictors ? In the notes section of the output, the input independent vars are listed as well as those used in the model. Can I read anything into this ? i.e. I'm assuming they are the significant ones but does the order imply anything ? It's not necessarily the same as the input. Thanks ----- Original Message ----- From: "Antro, Mark" <[hidden email]> To: "Mark Webb" <[hidden email]>; <[hidden email]> Sent: Friday, September 15, 2006 9:56 AM Subject: RE: Importance of Indep vars in Classification trees Hi Mark, You can request an "Importance to Model" table and chart for independent variables (only for CRT) from the Output sub-dialog box. The Measure of Importance M(X) of an independent variable X in relation to the final tree T, is defined as the (weighted) sum across all nodes in the tree of the improvements that X has when it is used as a primary or surrogate splitter. The independent variable's weights at each split are based on whether the independent variable was the primary splitter (the independent variable on which the parent node was split, where the weight is 1) or a surrogate (where the weight depends on the independent variables ranking as a surrogate. (If 5 surrogates are allowed, the first surrogate for a split has the highest weight among the surrogates and the fifth surrogate has the lowest weight). This weighted sum is reported in the Importance column of the "Independent Variable Importance" table in the Tree output. The "Normalized Importance" for a variable is defined as Normalised M(X) = 100 * M(X) / Maximum Importance Thus the most important predictor has a normalised importance of 100. The measure is taken from the Breiman et al. (1984). You can't compare this to CHAID, since CHAID doesn't use surrogate variables (variables used when you have a missing value for the actual split variable). CHAID simply treats missing values as an extra category. Hth Antro. __________ NOD32 1.1757 (20060914) Information __________ This message was checked by NOD32 antivirus system. http://www.eset.com |
In reply to this post by Mark Webb-3
Oops. Never mind what I said about the order of the variables in that table. You should, however, be able to get some idea of relative importance by looking at the order in which predictors appear in the tree (unless you use FORCE keyword to force the first predictor specified to the top).
-----Original Message----- From: Oliver, Richard Sent: Friday, September 15, 2006 8:40 AM To: 'Mark Webb'; [hidden email] Subject: RE: Re: Importance of Indep vars in Classification trees Yes, you can read something into the Model Summary (not Notes) table, because the list of variables used in the model indicates the variables that provided a significant contribution. I think you can interpret the order in which they're listed as a general indication of relative importance, although I now see that this does not always exactly match the "normalized" importance. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mark Webb Sent: Friday, September 15, 2006 4:43 AM To: [hidden email] Subject: Re: Importance of Indep vars in Classification trees Mark-Thanks for the reply-can I ask a few more questions ? I work with market research data that doesn't have many missing values so this issue is low on my priorities - but thanks for the pointers. What I'm after is what some call Key Driver Analysis - what vars are drivers/predictors ? In the notes section of the output, the input independent vars are listed as well as those used in the model. Can I read anything into this ? i.e. I'm assuming they are the significant ones but does the order imply anything ? It's not necessarily the same as the input. Thanks ----- Original Message ----- From: "Antro, Mark" <[hidden email]> To: "Mark Webb" <[hidden email]>; <[hidden email]> Sent: Friday, September 15, 2006 9:56 AM Subject: RE: Importance of Indep vars in Classification trees Hi Mark, You can request an "Importance to Model" table and chart for independent variables (only for CRT) from the Output sub-dialog box. The Measure of Importance M(X) of an independent variable X in relation to the final tree T, is defined as the (weighted) sum across all nodes in the tree of the improvements that X has when it is used as a primary or surrogate splitter. The independent variable's weights at each split are based on whether the independent variable was the primary splitter (the independent variable on which the parent node was split, where the weight is 1) or a surrogate (where the weight depends on the independent variables ranking as a surrogate. (If 5 surrogates are allowed, the first surrogate for a split has the highest weight among the surrogates and the fifth surrogate has the lowest weight). This weighted sum is reported in the Importance column of the "Independent Variable Importance" table in the Tree output. The "Normalized Importance" for a variable is defined as Normalised M(X) = 100 * M(X) / Maximum Importance Thus the most important predictor has a normalised importance of 100. The measure is taken from the Breiman et al. (1984). You can't compare this to CHAID, since CHAID doesn't use surrogate variables (variables used when you have a missing value for the actual split variable). CHAID simply treats missing values as an extra category. Hth Antro. __________ NOD32 1.1757 (20060914) Information __________ This message was checked by NOD32 antivirus system. http://www.eset.com |
Free forum by Nabble | Edit this page |