Apologies for cross posting Dear All, Please help with this conundrum Attempting to predict p(overall satisfied) from p(feature i), there are 21 features, i.e. values of i. Following analyses were conducted for 53 data sets, 5 times using different definitions of ‘satisfied' analysis 1 best subset recognition using SPSS automated linear modelling,with best subset option output: accuracy (adjusted r-squred) coefficients, p-value, importance, from which relative importance is calculated coefficients not in same order as importance [which measures effect of leaving var in or out] only includes most important coefficients, according to some algorithm - typically 4 -8 are included analysis 2 random forest SPSS using R extension bundle RanFor estimation output: explained variance percentage, variable importance for all variables, from which relative importance is calculated comparison of analyses variance accounted for similar for both analyses,varying from 50% to 94%, mostly round 89% importance of predictors is NOT the same for both analyses i.e typically there are 2/3 predictors of high importance in both analyses, 2 of high importance in analysis 1 only and 2 high importance in analysis 2 only question1: is this degree of discrepancy to be expected? question2: which analysis should i ‘believe’? question3: any useful references on comparing such methods, much appreciated thanks for your help best Diana _______________ Professor Diana Kornbrot University of Hertfordshire College Lane, Hatfield, Hertfordshire AL10 9AB, UK +44 (0) 208 444 2081 +44 (0) 7403 18 16 12 +44 (0) 170 728 4626 skype: kornbrotme_______________________________ |
Best subset and random forests are fundamentally
different techniques both in how they select variables and in what you
get out of them. I am not at all surprised that the results could
be quite different.
Best subset single mindedly looks for the best fitting model on a set of data (note that it is not exhaustive if there are a lot of variables), but it gives you a model result including (possibly overstated) significance. randomForest builds 500 or so models in a way designed to minimize the risks of overfitting but where individual models are, by design, fairly weak. You don't get the comfortable type of regression results you are used to, but you may get a better model for predictions (out of bag error) and some idea of what variables are important but without quantifying the effects. You might find it useful to take the variables found important with RF and run an ordinary regression with those are predictors. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: "Kornbrot, Diana" <[hidden email]> To: [hidden email], Date: 07/22/2014 08:13 AM Subject: [SPSSX-L] random forest V best subset regression Sent by: "SPSSX(r) Discussion" <[hidden email]> Apologies for cross posting Dear All, Please help with this conundrum Attempting to predict p(overall satisfied) from p(feature i), there are 21 features, i.e. values of i. Following analyses were conducted for 53 data sets, 5 times using different definitions of ‘satisfied' analysis 1 best subset recognition using SPSS automated linear modelling,with best subset option output: accuracy (adjusted r-squred) coefficients, p-value, importance, from which relative importance is calculated coefficients not in same order as importance [which measures effect of leaving var in or out] only includes most important coefficients, according to some algorithm - typically 4 -8 are included analysis 2 random forest SPSS using R extension bundle RanFor estimation output: explained variance percentage, variable importance for all variables, from which relative importance is calculated comparison of analyses variance accounted for similar for both analyses,varying from 50% to 94%, mostly round 89% importance of predictors is NOT the same for both analyses i.e typically there are 2/3 predictors of high importance in both analyses, 2 of high importance in analysis 1 only and 2 high importance in analysis 2 only question1: is this degree of discrepancy to be expected? question2: which analysis should i ‘believe’? question3: any useful references on comparing such methods, much appreciated thanks for your help best Diana _______________ Professor Diana Kornbrot University of Hertfordshire College Lane, Hatfield, Hertfordshire AL10 9AB, UK +44 (0) 208 444 2081 +44 (0) 7403 18 16 12 +44 (0) 170 728 4626 d.e.kornbrot@... http://dianakornbrot.wordpress.com/ http://go.herts.ac.uk/Diana_Kornbrot skype: kornbrotme_______________________________ ===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
HI Jon and other SPSSers
Jon’s suggestion is very helpful. will try running regression with only ‘important’ variables from RF. Are there any guidelines/ rules of thumb fro criterion for important, 80%?, 20%? The actual, as opposed to relative importances vary enormously, depending on variable metric, and who knows what else. but NOT the variance accounted for raw importance max is .005 in some data sets, 2800 in others, with very similar ~80% variance accounted for. I notice that the importance diagrams almost always have a sharp cut off somewhere, a big change in relative importance at some apparently arbitrary point. Maybe feature of RF. These are indeed very different techniques but in some sense they address the same question if one wants to improve overall satisfaction, on which features should one concentrate on or invest in? best diana On 23 Jul 2014, at 13:18, Jon K Peck <[hidden email]> wrote: Best subset and random forests are fundamentally different techniques both in how they select variables and in what you get out of them. I am not at all surprised that the results could be quite different. _______________ Professor Diana Kornbrot University of Hertfordshire College Lane, Hatfield, Hertfordshire AL10 9AB, UK +44 (0) 208 444 2081 +44 (0) 7403 18 16 12 +44 (0) 170 728 4626 skype: kornbrotme_______________________________ |
In reply to this post by Kornbrot, Diana
Perspective 1.
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
I wonder about "adjusted r-squared" for the regression. If the d.f. for the samples are small enough that the adjusted term is much different from the raw term, then the adjusted term is still much too large when stepwise uses only 4-8 out of 21 variables. Be aware that the program only "adjusts" for the number that are "stepped" into the equation. Conservative advice suggests that the formula should adjust for the total number of items that were available, since the first few will capitalize on chance by most of that total. Formula says that the R^2 -by-chance is p/(N-1); there are a couple of different formulas, but the simple one I used for adjusted R^2 subtracts that computed value from the observed, and maps the residual between 0 and 1. Perspective 2. The main validation for stepwise procedures is extensive cross-validation. I have not used Random Forests; it appears to incorporate cross-validation as part of the method. To get a feel for it, I would want to split a sample in half and see how two applications of the RF compare. Perspective 3. If results on 53 data sets vary from 50% and 94%, that's crap for comparability. Does that reflect failure to "adjust" the R^2 usefully? Does it reflect real differences between data sets? - for instance - Does low prediction occur where there is low variability in outcome or low variability in predictors? Whenever you report on these data, surely you should be ready to comment on that range. The estimated R^2 of 89% (first note) or even 80% (second note) is high enough by itself to show over-fitting in the areas that I have fitted regressions, so it makes me want to know whether that overall achievement is reasonable. -- Rich Ulrich Date: Tue, 22 Jul 2014 15:12:25 +0100 From: [hidden email] Subject: random forest V best subset regression To: [hidden email] Apologies for cross posting Dear All, Please help with this conundrum Attempting to predict p(overall satisfied) from p(feature i), there are 21 features, i.e. values of i. Following analyses were conducted for 53 data sets, 5 times using different definitions of ‘satisfied' analysis 1 best subset recognition using SPSS automated linear modelling,with best subset option output: accuracy (adjusted r-squred) coefficients, p-value, importance, from which relative importance is calculated coefficients not in same order as importance [which measures effect of leaving var in or out] only includes most important coefficients, according to some algorithm - typically 4 -8 are included analysis 2 random forest SPSS using R extension bundle RanFor estimation output: explained variance percentage, variable importance for all variables, from which relative importance is calculated comparison of analyses variance accounted for similar for both analyses,varying from 50% to 94%, mostly round 89% importance of predictors is NOT the same for both analyses i.e typically there are 2/3 predictors of high importance in both analyses, 2 of high importance in analysis 1 only and 2 high importance in analysis 2 only question1: is this degree of discrepancy to be expected? question2: which analysis should i ‘believe’? question3: any useful references on comparing such methods, much appreciated thanks for your help best Diana _______________ Professor Diana Kornbrot University of Hertfordshire College Lane, Hatfield, Hertfordshire AL10 9AB, UK +44 (0) 208 444 2081 +44 (0) 7403 18 16 12 +44 (0) 170 728 4626 skype: kornbrotme_______________________________ |
Free forum by Nabble | Edit this page |