SPSSX Discussion

random forest V best subset regression

Classic

List

Threaded

4 messages Options

Kornbrot, Diana

random forest V best subset regression

Apologies for cross posting

Dear All,

Please help with this conundrum

Attempting to predict p(overall satisfied) from p(feature i), there are 21 features, i.e. values of i.

Following analyses were conducted for 53 data sets, 5 times using different definitions of ‘satisfied'

analysis 1 best subset recognition

using SPSS automated linear modelling,with best subset option

output: accuracy (adjusted r-squred) coefficients, p-value, importance, from which relative importance is calculated

coefficients not in same order as importance [which measures effect of leaving var in or out]

only includes most important coefficients, according to some algorithm - typically 4 -8 are included

analysis 2 random forest

SPSS

using R extension bundle RanFor estimation

output: explained variance percentage, variable importance for all variables, from which relative importance is calculated

comparison of analyses

variance accounted for similar for both analyses,varying from 50% to 94%, mostly round 89%

importance of predictors is NOT the same for both analyses

i.e typically there are 2/3 predictors of high importance in both analyses, 2 of high importance in analysis 1 only and 2 high importance in analysis 2 only

question1: is this degree of discrepancy to be expected?

question2: which analysis should i ‘believe’?

question3: any useful references on comparing such methods, much appreciated

thanks for your help

best

Diana

_______________

Professor Diana Kornbrot

University of Hertfordshire

College Lane, Hatfield, Hertfordshire AL10 9AB, UK

+44 (0) 208 444 2081

+44 (0) 7403 18 16 12

+44 (0) 170 728 4626

[hidden email]

http://dianakornbrot.wordpress.com/

http://go.herts.ac.uk/Diana_Kornbrot

skype: kornbrotme_______________________________

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon K Peck

Re: random forest V best subset regression

Best subset and random forests are fundamentally different techniques both in how they select variables and in what you get out of them. I am not at all surprised that the results could be quite different.

Best subset single mindedly looks for the best fitting model on a set of data (note that it is not exhaustive if there are a lot of variables), but it gives you a model result including (possibly overstated) significance.

randomForest builds 500 or so models in a way designed to minimize the risks of overfitting but where individual models are, by design, fairly weak. You don't get the comfortable type of regression results you are used to, but you may get a better model for predictions (out of bag error) and some idea of what variables are important but without quantifying the effects.

You might find it useful to take the variables found important with RF and run an ordinary regression with those are predictors.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: "Kornbrot, Diana" <[hidden email]>
To: [hidden email],
Date: 07/22/2014 08:13 AM
Subject: [SPSSX-L] random forest V best subset regression
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Apologies for cross posting
Dear All,
Please help with this conundrum

Attempting to predict p(overall satisfied) from p(feature i), there are 21 features, i.e. values of i.
Following analyses were conducted for 53 data sets, 5 times using different definitions of ‘satisfied'
analysis 1 best subset recognition
using SPSS automated linear modelling,with best subset option
output: accuracy (adjusted r-squred) coefficients, p-value, importance, from which relative importance is calculated
coefficients not in same order as importance [which measures effect of leaving var in or out]
only includes most important coefficients, according to some algorithm - typically 4 -8 are included
analysis 2 random forest
SPSS
using R extension bundle RanFor estimation
output: explained variance percentage, variable importance for all variables, from which relative importance is calculated

comparison of analyses
variance accounted for similar for both analyses,varying from 50% to 94%, mostly round 89%
importance of predictors is NOT the same for both analyses
i.e typically there are 2/3 predictors of high importance in both analyses, 2 of high importance in analysis 1 only and 2 high importance in analysis 2 only

question1: is this degree of discrepancy to be expected?
question2: which analysis should i ‘believe’?
question3: any useful references on comparing such methods, much appreciated

thanks for your help
best
Diana

_______________
Professor Diana Kornbrot
University of Hertfordshire
College Lane, Hatfield, Hertfordshire AL10 9AB, UK
+44 (0) 208 444 2081
+44 (0) 7403 18 16 12
+44 (0) 170 728 4626
d.e.kornbrot@...
http://dianakornbrot.wordpress.com/
http://go.herts.ac.uk/Diana_Kornbrot
skype: kornbrotme_______________________________

===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Kornbrot, Diana

Re: random forest V best subset regression

HI Jon and other SPSSers

Jon’s suggestion is very helpful. will try running regression with only ‘important’ variables from RF.

Are there any guidelines/ rules of thumb fro criterion for important, 80%?, 20%?

The actual, as opposed to relative importances vary enormously, depending on variable metric, and who knows what else. but NOT the variance accounted for

raw importance max is .005 in some data sets, 2800 in others, with very similar ~80% variance accounted for.

I notice that the importance diagrams almost always have a sharp cut off somewhere, a big change in relative importance at some apparently arbitrary point. Maybe feature of RF.

These are indeed very different techniques but in some sense they address the same question

if one wants to improve overall satisfaction, on which features should one concentrate on or invest in?

best

diana

On 23 Jul 2014, at 13:18, Jon K Peck <[hidden email]> wrote:

Best subset and random forests are fundamentally different techniques both in how they select variables and in what you get out of them. I am not at all surprised that the results could be quite different.

Best subset single mindedly looks for the best fitting model on a set of data (note that it is not exhaustive if there are a lot of variables), but it gives you a model result including (possibly overstated) significance.

randomForest builds 500 or so models in a way designed to minimize the risks of overfitting but where individual models are, by design, fairly weak. You don't get the comfortable type of regression results you are used to, but you may get a better model for predictions (out of bag error) and some idea of what variables are important but without quantifying the effects.

You might find it useful to take the variables found important with RF and run an ordinary regression with those are predictors.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: "Kornbrot, Diana" <[hidden email]>
To: [hidden email],
Date: 07/22/2014 08:13 AM
Subject: [SPSSX-L] random forest V best subset regression
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Apologies for cross posting
Dear All,
Please help with this conundrum

Attempting to predict p(overall satisfied) from p(feature i), there are 21 features, i.e. values of i.
Following analyses were conducted for 53 data sets, 5 times using different definitions of ‘satisfied'
analysis 1 best subset recognition
using SPSS automated linear modelling,with best subset option
output: accuracy (adjusted r-squred) coefficients, p-value, importance, from which relative importance is calculated
coefficients not in same order as importance [which measures effect of leaving var in or out]
only includes most important coefficients, according to some algorithm - typically 4 -8 are included
analysis 2 random forest
SPSS
using R extension bundle RanFor estimation
output: explained variance percentage, variable importance for all variables, from which relative importance is calculated

comparison of analyses
variance accounted for similar for both analyses,varying from 50% to 94%, mostly round 89%
importance of predictors is NOT the same for both analyses
i.e typically there are 2/3 predictors of high importance in both analyses, 2 of high importance in analysis 1 only and 2 high importance in analysis 2 only

question1: is this degree of discrepancy to be expected?
question2: which analysis should i ‘believe’?
question3: any useful references on comparing such methods, much appreciated

thanks for your help
best
Diana

_______________
Professor Diana Kornbrot
University of Hertfordshire
College Lane, Hatfield, Hertfordshire AL10 9AB, UK
+44 (0) 208 444 2081
+44 (0) 7403 18 16 12
+44 (0) 170 728 4626
[hidden email]
http://dianakornbrot.wordpress.com/
http://go.herts.ac.uk/Diana_Kornbrot
skype: kornbrotme_______________________________

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

_______________

Professor Diana Kornbrot

University of Hertfordshire

College Lane, Hatfield, Hertfordshire AL10 9AB, UK

+44 (0) 208 444 2081

+44 (0) 7403 18 16 12

+44 (0) 170 728 4626

[hidden email]

http://dianakornbrot.wordpress.com/

http://go.herts.ac.uk/Diana_Kornbrot

skype: kornbrotme_______________________________

Rich Ulrich

Re: random forest V best subset regression

In reply to this post by Kornbrot, Diana

Perspective 1.
I wonder about "adjusted r-squared" for the regression. If the d.f. for the
samples are small enough that the adjusted term is much different from the
raw term, then the adjusted term is still much too large when stepwise uses
only 4-8 out of 21 variables. Be aware that the program only "adjusts" for
the number that are "stepped" into the equation. Conservative advice suggests
that the formula should adjust for the total number of items that were available,
since the first few will capitalize on chance by most of that total.

Formula says that the R^2 -by-chance is p/(N-1); there are a couple of different
formulas, but the simple one I used for adjusted R^2 subtracts that computed
value from the observed, and maps the residual between 0 and 1.

Perspective 2.
The main validation for stepwise procedures is extensive cross-validation.
I have not used Random Forests; it appears to incorporate cross-validation as
part of the method. To get a feel for it, I would want to split a sample in half
and see how two applications of the RF compare.

Perspective 3.
If results on 53 data sets vary from 50% and 94%, that's crap for comparability.
Does that reflect failure to "adjust" the R^2 usefully? Does it reflect real
differences between data sets? - for instance - Does low prediction occur where
there is low variability in outcome or low variability in predictors? Whenever
you report on these data, surely you should be ready to comment on that range.

The estimated R^2 of 89% (first note) or even 80% (second note) is high enough
by itself to show over-fitting in the areas that I have fitted regressions, so it
makes me want to know whether that overall achievement is reasonable.

--
Rich Ulrich

Date: Tue, 22 Jul 2014 15:12:25 +0100
From: [hidden email]
Subject: random forest V best subset regression
To: [hidden email]

Apologies for cross posting

Dear All,

Please help with this conundrum

Attempting to predict p(overall satisfied) from p(feature i), there are 21 features, i.e. values of i.

Following analyses were conducted for 53 data sets, 5 times using different definitions of ‘satisfied'

analysis 1 best subset recognition

using SPSS automated linear modelling,with best subset option

output: accuracy (adjusted r-squred) coefficients, p-value, importance, from which relative importance is calculated

coefficients not in same order as importance [which measures effect of leaving var in or out]

only includes most important coefficients, according to some algorithm - typically 4 -8 are included

analysis 2 random forest

SPSS

using R extension bundle RanFor estimation

output: explained variance percentage, variable importance for all variables, from which relative importance is calculated

comparison of analyses

variance accounted for similar for both analyses,varying from 50% to 94%, mostly round 89%

importance of predictors is NOT the same for both analyses

i.e typically there are 2/3 predictors of high importance in both analyses, 2 of high importance in analysis 1 only and 2 high importance in analysis 2 only

question1: is this degree of discrepancy to be expected?

question2: which analysis should i ‘believe’?

question3: any useful references on comparing such methods, much appreciated

thanks for your help

best

Diana

_______________

Professor Diana Kornbrot

University of Hertfordshire

College Lane, Hatfield, Hertfordshire AL10 9AB, UK

+44 (0) 208 444 2081

+44 (0) 7403 18 16 12

+44 (0) 170 728 4626

[hidden email]

http://dianakornbrot.wordpress.com/

http://go.herts.ac.uk/Diana_Kornbrot

skype: kornbrotme_______________________________