SPSSX Discussion

Stepwise vs Full model Discriminant Function Analysis and classification results

Classic

List

Threaded

5 messages Options

DP_Sydney

Stepwise vs Full model Discriminant Function Analysis and classification results

Hi All,

I have run a series of discriminant function analyses (DFA) in SPSS to compare the utility of different variables in classification (in my case whether a bird is male or female) and derive a user-friendly function. I had four variables (measurements of bird size) so I ran 1) four analyses with a single variable, 2) a full model with all four, and 3) a stepwise model (which produced a model with two variables).

I used leave-one-out cross-validation to 'test' model performance.

- Single variables had classification results of 66.2%, 66.9%, 83.1% and 83.1% (the last two coincidentally had the same values, they didn't agree on classifications of all cases)

- The full model classified 87.7% of cases correctly

- The stepwise model classified 88.3% of cases correctly (it included a variable, call it A, with 83.1% classification on its own, and a variable, call it B, with 66.9% classification)

My main question is: how can a model with a subset of the variables perform better than a model with all variables?

I've been told "It is impossible, in any sensible world, for a model based on only two parameters to perform better [than a model with more parameters]. By analogy, if a multiple regression – a technique closely related to LDA - came up with a higher r-squared using a subset of variables than obtained by using all of them, you would immediately go looking for the error. We do have an error here". (note this reviewer refers to DFA as Linear Discriminant Analysis for some reason - as if there aren't enough terms for the same statistical tests!)

Second to that: am I correct in interpreting that the inclusion of variable B (which correctly classified only 66.9% of cases) in the stepwise model with variable A, instead of the other better performing variable (call it C), is because variable A correctly classified a greater number of the cases incorrectly classified by B than expected by chance, whereas variable C or the other variables did not do so. Also, why is variable A selected rather than variable C when they had the same classification accuracy alone? Is it because variable C and B share more correct classifications in common than B does with A?

Any enlightenment will be greatly appreciated so I can prepare a response to the reviewer's comment.

Thanks,

Dean

PS: I have re-run the analyses to confirm the results and I get the same classification percentages. (I'm using a single SPSS worksheet and simply replacing variables in the 'Independents' list in the 'Discriminant Analysis' window, or adding them all in the dependent list for the full model and leaving 'Enter Independents Together' checked, or adding them all and selecting 'Use Stepwise Method'. I double-checked the stepwise result by running an analysis with the two variables the model had selected and choosing 'Enter Independents Together' - same result as expected. All cases have values for all variables, i.e. sample size remains unchanged).

****************************************

Dean Portelli

PhD candidate

School of Biological, Earth and Environmental Sciences

University of New South Wales

Sydney AUSTRALIA 2052

Rich Ulrich

Re: Stepwise vs Full model Discriminant Function Analysis and classification results

I will mention first that leave-one-out validation is better
than no validation; but stepwise methods can call for much
more extensive validation than that, if you want robust results
that extend reliably to other samples.

As to your "main question" -- DFA is mathematically a version of
regression on a 0/1 criterion, where R^2 is the criterion. This
criterion is always "improved" by including more variables, so
the assumption in your question is wrong. What is improved is
the sum of the squared deviations from predicting 0 or 1. (And,
relevant to a comment below, an instance that is predicted as
"less than 0" or "greater than 1.0" starts adding more to the
error term, instead of being treated as fine success in prediction.)

The rate of "correct classifications" is an ancillary statistic, which
can be varied (for instance) by changing the cut-point used for
classes. It is not a criterion for the equation.

In fact, when the two groups are massively different in Ns, it
is likely that you get more "correct classifications" by labeling
all cases as <big group> than by doing any analysis at all.
"90 versus 10" yields "90% right" for the no-analysis option.
On the other hand, a "balanced" solution with 20% errors in
each group will only have "80% right".

Many people prefer Logistic Regression over DFA for all their
modeling. The weaknesses of LR are for small Ns, and prediction
that is "too perfect". I also like the statistics and presentation
of means that I get with DFA. However, the DFA is a model
that is less precisely appropriate, especially for instances with
very good discrimination, as you seem to have.

As to coefficients: Both DFA and LR provide "partial regression
coefficients"; those show a unique contribution beyond what
is contributed by other variables in the equation. When predictors
are highly correlated, it is possible that their *difference* may
also be predictive, in which case their two coefficients will be
unusually large and have opposite signs (see: suppressor
variables).

--
Rich Ulrich

Date: Thu, 5 Jul 2012 01:44:53 +1000
From: [hidden email]
Subject: Stepwise vs Full model Discriminant Function Analysis and classification results
To: [hidden email]

Hi All,

I used leave-one-out cross-validation to 'test' model performance.

- Single variables had classification results of 66.2%, 66.9%, 83.1% and 83.1% (the last two coincidentally had the same values, they didn't agree on classifications of all cases)

- The full model classified 87.7% of cases correctly

- The stepwise model classified 88.3% of cases correctly (it included a variable, call it A, with 83.1% classification on its own, and a variable, call it B, with 66.9% classification)

My main question is: how can a model with a subset of the variables perform better than a model with all variables?

Any enlightenment will be greatly appreciated so I can prepare a response to the reviewer's comment.

Thanks,

Dean

****************************************

Dean Portelli

PhD candidate

School of Biological, Earth and Environmental Sciences

University of New South Wales

Sydney AUSTRALIA 2052

DP_Sydney

Re: Stepwise vs Full model Discriminant Function Analysis and classification results

Hi Rich,

Thanks for the prompt response! I have tried to digest what you have said, but I think I'm a little lost (which is quite normal).

RICH: "As to your "main question" -- DFA is mathematically a version of

regression on a 0/1 criterion, where R^2 is the criterion. This

criterion is always "improved" by including more variables, so

the assumption in your question is wrong. What is improved is

the sum of the squared deviations from predicting 0 or 1."

"0/1 criterion" = binomial dependent variable? How then does DFA differ from logistic regression? Does the difference lie in the R^2 being used somehow?

I understand that any increase in explanatory variables/parameters in a model would increase the coefficient of determination, R^2 (i.e. greater proportion of variance 'explained' collectively by the variables), but do I understand you correctly in that R^2 isn't predictably related to classification accuracy? Therefore, the assumption that increased variables will increase classification accuracy is incorrect.

On a related matter, I have reported r (the canonical correlation coefficient) for each of the discriminant analyses, but I'm not sure how to interpret this value. Is it analogous to r from a simple OLS regression? Should I report R^2 instead?

RICH: "The rate of "correct classifications" is an ancillary statistic, which

can be varied (for instance) by changing the cut-point used for

classes. It is not a criterion for the equation".

I'm afraid I don't understand this at all, but it seems important to understanding how the classification accuracy is influenced.

I think I'll gradually have an adequate understanding to defend my analysis.

Many thanks,

Dean

From: [hidden email]
To: [hidden email]; [hidden email]
Subject: RE: Stepwise vs Full model Discriminant Function Analysis and classification results
Date: Wed, 4 Jul 2012 12:29:20 -0400

Date: Thu, 5 Jul 2012 01:44:53 +1000
From: [hidden email]
Subject: Stepwise vs Full model Discriminant Function Analysis and classification results
To: [hidden email]

Hi All,

I used leave-one-out cross-validation to 'test' model performance.

- Single variables had classification results of 66.2%, 66.9%, 83.1% and 83.1% (the last two coincidentally had the same values, they didn't agree on classifications of all cases)

- The full model classified 87.7% of cases correctly

- The stepwise model classified 88.3% of cases correctly (it included a variable, call it A, with 83.1% classification on its own, and a variable, call it B, with 66.9% classification)

My main question is: how can a model with a subset of the variables perform better than a model with all variables?

Any enlightenment will be greatly appreciated so I can prepare a response to the reviewer's comment.

Thanks,

Dean

****************************************

Dean Portelli

PhD candidate

School of Biological, Earth and Environmental Sciences

University of New South Wales

Sydney AUSTRALIA 2052

DP_Sydney

Re: Stepwise vs Full model Discriminant Function Analysis and classification results

In reply to this post by Rich Ulrich

Hi Rich,

Firstly, I'm not sure what has happened with the threads of this post - there are duplicate copies of my first email and your first reply in the thread on SPSSX Discussion archives. I only realised after I sent the recent email to you about posting that your reply was tacked on the bottom of your email.

The reply to my most recent email, which I understand has not gone to the group, appears to have snipped off so may not be there in full?

Thanks very much for the confirmation and clarification. I'm confident I understand the salient points to interpreting R^2:

R^2 in DFA is a pseudo-R^2 that is calculated differently to least squares regression as predictions from DFA are either 1 or 0

R^2 and classification are not directly related

Model 'performance' is assessed by comparing predicted group memberships with actual membership, rather than through interpretation of the R^2 (which is what I thought initially). In which case, does the R^2 value provide any additional pertinent information about the analysis?

I have inserted my queries amongst your blue text "Look at the Predicted Values: DF is predicting to 0/1 scores which are *not* probabilities. I get this bit - i.e. the 'fitted value' from the linear model (DFA) is either a 0 or 1 (group membership). Is R^2 calculated the same in MANOVA? It upsets non-statisticians to see predicted values outside the range, - not sure what 'range' refers to - because they were comfortable think of them as p. Logistic Regression predicts to log(p/(1-p)) -- which does include "probability" on an infinite scale" - I'm lost here, I don't understand how a regression 'predicts to' probabilities. Going back to basics, my understanding is that a fitted model predicts a value of the dependent variable for each case based on the independent variables. The discrepancy between these predicted and actual values is the basis upon which model performance is assessed (i.e. sums of squares).

Thanks again for your time

Cheers,

Dean

From: [hidden email]
To: [hidden email]
Subject: RE: Stepwise vs Full model Discriminant Function Analysis and classification results
Date: Wed, 4 Jul 2012 13:46:01 -0400

- I did reply to the group+private address; so you may
yet receive another copy of mine. Your reply that I see
right now is addressed only to me. So this reply is not
going to the group

From: [hidden email]
To: [hidden email]
Subject: RE: Stepwise vs Full model Discriminant Function Analysis and classification results
Date: Thu, 5 Jul 2012 02:55:50 +1000

>Hi Rich,

> Thanks for the prompt response! I have tried to digest what you have said, but I think I'm a little lost (which is quite normal).

As to your "main question" -- DFA is mathematically a version of

regression on a 0/1 criterion, where R^2 is the criterion. This

criterion is always "improved" by including more variables, so

the assumption in your question is wrong. What is improved is

the sum of the squared deviations from predicting 0 or 1.

>"0/1 criterion" = binomial dependent variable? How then does DFA differ from logistic regression? Does the difference lie in the R^2 being used somehow?

Look at the Predicted Values: DF is predicting to 0/1 scores
which are *not* probabilities. It upsets non-statisticians
to see predicted values outside the range, because they were
comfortable think of them as p. Logistic Regression predicts
to log(p/(1-p)) -- which does include "probability" on an
infinite scale.

Yes, R^2 is used for "least squares" statistics like DFA.
"Maximum Likelihood" Logistic Regression does not have
an immediate counterpart. There are at least three versions
of pseudo-R^2 used sometimes, but none of them can
convert "best" prediction at plus or minus infinite to a
Sum-of-squares for deviations, to be translated to R^2.

>I understand that any increase in explanatory variables/parameters in a model would increase the coefficient of determination, R^2 (i.e. greater proportion of variance 'explained' collectively by the variables), but do I understand you correctly in that R^2 isn't predictably related to classification accuracy? Therefore, the assumption that increased variables will increase classification accuracy is incorrect.

Right.

>On a related matter, I have reported r (the canonical correlation coefficient) for each of the discriminant analyses, but I'm not sure how to interpret this value. Is it analogous to r from a simple OLS regression? Should I report R^2 instead?

Well, yeah, I think so, but the statistic reported with
the test is Wilks's lambda. I think that it equals
(1-R^2) for the two group case. If I recall correctly.

The rate of "correct classifications" is an ancillary statistic, which

can be varied (for instance) by changing the cut-point used for

classes. It is not a criterion for the equation.

>I'm afraid I don't understand this at all, but it seems important to understanding how the classification accuracy is influenced.

This is a version of regression. There is a prediction equation.
There is a default cut-off for dividing groups on a predicted
score. You can change the cutoff used by the program by
means of the "prior probability" option (I think that is what
it is called). If you change the priors to 2:1, you will put 90%+
of the cases into one group if they started out 50-50.

The easier way to look at the effect of cutoffs is to sort the
cases in order of Predicted score, and look at the cumulative
count of Correct and Incorrect Predictions.

>I think I'll gradually have an adequate understanding to defend my analysis.

<snip, previous...>

Rich Ulrich

DFA and Logsitic. Was: Stepwise vs Full model Discriminant Function Analysis ...

Dean, and the List --

I gave a private reply to Dean which (a) he possibly did not
see all of, and (b) he mis-interpreted mcug of what he summarizes.
My response had included a few statements about Logistic Regression,
since that is often a preferred alternative to DFA, especially when
prediction is strong (as his is).

I will correct his "salient points"; and then try to contrast DFA and LR.

Dean's salient points -
1) R^2 in DFA is exactly the same as it would be for OLS regression with
a 0/1 outcome, since the two are showing exactly the same model.
DFA presents correlations and coefficients using a "within group"
basis for standardization, so it is mainly the tests that show up as
exactly the same. R^2 is the same for all least-square procedures,
including DFA, ANOVA, MANCOVA.
2) Well, a better R^2 correlates well with better classification when
group sizes are equal. But they are certainly not the same thing.
3) Predicted versus actual group membership (using default cutoff
scores to classify; ignoring group Ns) is *not* a vrery good measure
of "model performance".
For DFA (where R^2 is available), the Wilks's Lambda (1-R^2)
is the primary measure. And you can look at the p-value.
For LR, varioius "pseudo-R^2"s have been suggested, but none
work well. You are stuck with the overall chi squared, or its p-value.

The "percent classified correctly" lets you compare some
alternate models informally, but it is not a criterion being optimized.
And it often could be "improved" as a number by merely adjusting
the cut-off score slightly, since there is always a cut-off to separate
two groups. Both DFA (or regression) and LR are methods to
derive a linear prediction equation. They extract weights to apply
to the predictors, and the result is a SCORE for each case. Then a
cut-off line is applied, which places individuals into one of the two
groups. This line is, in some arbitrary sense, "in the middle". It is
not adjusted to pick up one or two more correct classifications,
by edging higher or lower. - This could be the main reason that
Dean saw changes in "percent" which were not consistent with the
improvement in the overall R^2 when another variable was added.

Also, as a standard, Percent Classified Correctly (PCC) tends to fail
horribly when group sizes are drastically different. For instance,
when 90% of cases are Group 1, you can achieve 90% PCC by
calling every case "Group 1". But this is not effective prediction,
whether you do it by hand or do it by a computer program.

A less arbitrary, more general standard of performance is obtained
when you require that the line be drawn to produce equal error
rates for each group. (This can be thought of as Sensitivity and
Specificity, if you know those terms.) Thus, for the "90%" example,
if you achieve 80% accuracy for *each* group, you have done some
effective prediction -- However, clearly, you now have reduced your
PCC from 90% to 80%.

Predicted Scores.
For this topic, it is easier to refer to the 0/1 regression expression
of DFA, than to some other DFA formulation. What you get as
predicted scores for each case usually range between the extremes
of 0 and 1. This leads to the complacent (or naive) interpretation
of those scores as being a "probability of group membership" (though
they are never, really, that). So long as adding another variable
moves the "prediction" closer to 0 or 1, you get an improvement of
R^2 -- since that is measured as the sum of the squared deviations.
This will be the case so long as prediction is pretty mediocre.

Unfortunately for our convenience, a set of several "good" predictors
can result in predicted scores that are greater than 1 or less than 0.

First, this makes it obvious that the scores are not really "probabilities
of group membership." Second, the "well-predicted" cases, the ones
that are most extreme, start *adding* to the squared deviations, as
they get further from 0 or 1. That is not desirable, since it decreases
the R^2. (Example of how this works: Consider two rare predictors,
uncorrelated, each of which "practically insure" membership in Group
1; so they both should have large regression coefficients. Now, for
the rare*rare combination case, we are sure that it is Group 1. But
the added regression score is far *beyond* 1.0, and therefore is
penalized... or, rather, results in biasing the coefficients towards
smaller size.)

Logistic Regression does not have the problem from "over-prediction."
Thus, LR is a correct theoretical construal of the prediction problem,
in this aspect where DFA fails (for high R^2).

In place of "least squares", LR draws on another method of statistical
estimation. It uses the "likelihood function" and "maximum likelihood
estimates" (MLE) to obtain testing on the difference between log-
likelihoods for fitted models. MLE provides great flexibility in models,
as illustrated in LR by setting up the explicit scale of prediction as
the "logit", which is log(p(1-p)). This has no problem for over-
prediction since it is infinite at both extremes. Also, the predicted
logit can be translated back to a predicted "probability of group
membership", which is what a lot of people want.

What LR does not have is a convenient way of comparing models
on the absolute scale that R^2 seems to provide. There are
several suggestions for pseudo-R^2 for LR, but none has won out.

There are other shortfalls of the LR procedures of today as a total
replacement for DFA. DFA has ancillary statistics that are nice,
including diagnostics that are standard or available. LR can blow
up or give unstable coefficient values without warning you,
especially when its predicted separation nears 100%. (People are
working on these problems.)

Date: Thu, 5 Jul 2012 17:46:32 +1000
From: [hidden email]
Subject: Re: Stepwise vs Full model Discriminant Function Analysis and classification results
To: [hidden email]

Hi Rich,

The reply to my most recent email, which I understand has not gone to the group, appears to have snipped off so may not be there in full?

Thanks very much for the confirmation and clarification. I'm confident I understand the salient points to interpreting R^2:

R^2 in DFA is a pseudo-R^2 that is calculated differently to least squares regression as predictions from DFA are either 1 or 0

R^2 and classification are not directly related

Thanks again for your time

Cheers,

Dean

From: [hidden email]
To: [hidden email]
Subject: RE: Stepwise vs Full model Discriminant Function Analysis and classification results
Date: Wed, 4 Jul 2012 13:46:01 -0400

- I did reply to the group+private address; so you may
yet receive another copy of mine. Your reply that I see
right now is addressed only to me. So this reply is not
going to the group

From: [hidden email]
To: [hidden email]
Subject: RE: Stepwise vs Full model Discriminant Function Analysis and classification results
Date: Thu, 5 Jul 2012 02:55:50 +1000

>Hi Rich,

> Thanks for the prompt response! I have tried to digest what you have said, but I think I'm a little lost (which is quite normal).

As to your "main question" -- DFA is mathematically a version of

regression on a 0/1 criterion, where R^2 is the criterion. This

criterion is always "improved" by including more variables, so

the assumption in your question is wrong. What is improved is

the sum of the squared deviations from predicting 0 or 1.

>"0/1 criterion" = binomial dependent variable? How then does DFA differ from logistic regression? Does the difference lie in the R^2 being used somehow?

Right.

Well, yeah, I think so, but the statistic reported with
the test is Wilks's lambda. I think that it equals
(1-R^2) for the two group case. If I recall correctly.

The rate of "correct classifications" is an ancillary statistic, which

can be varied (for instance) by changing the cut-point used for

classes. It is not a criterion for the equation.

>I'm afraid I don't understand this at all, but it seems important to understanding how the classification accuracy is influenced.

>I think I'll gradually have an adequate understanding to defend my analysis.

<snip, previous...>