SPSSX Discussion

Using a predictor only once in a Tree

Classic

List

Threaded

5 messages Options

Kirill Orlov

Using a predictor only once in a Tree

This question (not mine) appeared on Cross-Validated forum (http://stats.stackexchange.com/questions/100287/spss-chaid-crt-query), and I found it interesting to repost here.

"...I want to know how do I set configuration so that a particular classification variable is used at only one level of the tree. E.g say at the first level the records were classified using Location [variable] then at no subsequent level of the tree, Location should be used. Currently, at the first level sub-trees are getting formed based on location, then on the next level sub-trees are getting formed using Business_Size [variable] but on the next level again Location variable is getting used for further classification..."

Indeed, why does TREE in SPSS miss this constraining option to use a predictor no more than once? Is there any reason or is it developer overlook?

Jon K Peck

Re: Using a predictor only once in a Tree

I can't think of any situation where this constraint would make sense. It seems inconsistent with the very idea of how a tree works, where the structure lower down (farther from the root) can vary from one subtree to another. And for C&RT trees, which tend to be deep and narrow, it would be an especially severe restriction. Do you have any examples where one would need to impose such a constraint?

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: Kirill Orlov <[hidden email]>
To: [hidden email],
Date: 05/28/2014 12:43 AM
Subject: [SPSSX-L] Using a predictor only once in a Tree
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Kirill Orlov

Re: Using a predictor only once in a Tree

28.05.2014 16:15, Jon K Peck пишет:

I can't think of any situation where this constraint would make sense. It seems inconsistent with the very idea of how a tree works, where the structure lower down (farther from the root) can vary from one subtree to another. And for C&RT trees, which tend to be deep and narrow, it would be an especially severe restriction. Do you have any examples where one would need to impose such a constraint?

Andy W

Re: Using a predictor only once in a Tree

In reply to this post by Jon K Peck

The original question on Cross Validated listed an example I have seen it a few times - clustering of geographic data - although the way it is described (in this example) in effect is very restrictive because it naively clusters the geography once and then grows the tree from those sub-geographies.

The way I've seen some work do it is to restrict the growing of the trees to observations within geographical proximity, so basically clustering multivariate data with the constraint that observations need to be "near-by" in space. Sometimes it is called "regionalization" in the geographical literature (when working with aggregate units) - and one of the original motivations was for an empirical way to make election districts. This lab out of the University of South Carolina uses the logic extensively in a variety of their work, http://www.spatialdatamining.org/.

My (very limited) experience is that geographic data tend to be so similar in space (Tobler's First law of geography!) that clustering without including the geographical coordinates still tends to produce very close to geographically consistent regions. So oftentimes the constraint is a non-problem in practice (and the geographic outliers in the cluster tend to be informative anyway).

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

Jon K Peck

Re: Using a predictor only once in a Tree

In reply to this post by Kirill Orlov

Please don't make assertions about what is easy to implement. TREES provides four different algorithms. It would require bookkeeping in the algorithms that might or might not be easy to implement given the existing data structures, especially considering surrogates. And if subtrees were being built in parallel, this could introduce other complications. There is also a definitional question. Perhaps what is wanted is that no single subtree contains the same variable more than once, but it could also mean only one use of a variable across subtrees.

Without a use case, there is little chance that such a feature would be implemented. If there is a compelling use case, submitting it to [hidden email] would be the best way to get it considered.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: Kirill Orlov <[hidden email]>
To: [hidden email],
Date: 05/28/2014 06:40 AM
Subject: Re: [SPSSX-L] Using a predictor only once in a Tree
Sent by: "SPSSX(r) Discussion" <[hidden email]>

I agree that the constraint is odd enought for the CT idea. But I've myself seen some people doing or requesting such constrained ("may use predictor only once") trees. Moreover, it seems that some other software have this feature. Since it is very easy to implement, wouldn't you think to log it as a feature request?

28.05.2014 16:15, Jon K Peck пишет:
I can't think of any situation where this constraint would make sense. It seems inconsistent with the very idea of how a tree works, where the structure lower down (farther from the root) can vary from one subtree to another. And for C&RT trees, which tend to be deep and narrow, it would be an especially severe restriction. Do you have any examples where one would need to impose such a constraint?