SPSSX Discussion

SIMCA: When number of Variables far exceeds number of cases

Classic

List

Threaded

2 messages Options

FORCHEH, N. (DR.)

SIMCA: When number of Variables far exceeds number of cases

I have not had the opportunity to follow this discussion on the
Relationship between Sets of Dependent and Independent Variables, so
apologies if this has already been raised. I pick up from the last
postings:

In classical inference, the sample size must always out number the
number of variables by several factors. This is to avoid singularity and
the resulting spurious correlations that give a false sense of a good
fit.

However, in Pattern recognition, Chemometrics and related studies, the
number of variables usually far exceeds number of cases. Furthermore,
you can't discard some of these variables because of what they
represent. Methods such as SIMCA and 4 stage partial least squares have
been developed in Chemotetrics to deal with such situations. Very early
papers on these include Jain and Dubes(1978) -- Feature Definition in
pattern recognition with small sample sizes. -- in Pattern recognition
Vol 10, p85-97 and Albano and Blomqvist(1981). Pattern Recognition by
means of disjoint principal components models -- in a conference
proceeding). I do not have recent refs just now, but there are many.

There is a joke about statisticians and sample sizes:

A fire broke out and several academics were consulted for a way to deal
with the fire. The Statistician replied that one fire was just too few
to enable him to derive a solution. His recommendation was that more
fires should be lit and only then could a solution be modeled.

N Forcheh
University of Botswana

(Disclaimant: There are many variants to this joke and I do not know the
originator)
-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Hector Maletta
Sent: 22 August 2006 19:10
To: [hidden email]
Subject: Re: Relationship between Sets of Dependent and Independent
Variables

It is all the same. You may use, for instance, factor analysis to derive
a
scale representing the 18 variables of one of the sets, one scale for
women
and another similar scale for men, but the scale for women will be built
with data from a sample of 60 women, i.e. 2.33 women per variable, and
that
is hardly statistically significant. Take any other sample of 60 women
from
the same population and the results are likely to be completely
different.
The margin or error in the factor loadings and the regression
coefficients
will be very wide, especially if some of the variables are not very
highly
correlated (r>0.90 or r>0.95) with some of the others.
Data, as the old saying goes, can always be tortured till they confess.
But
you better don't. There are better ways to get to the truth.
Hector

-----Mensaje original-----
De: Susan M. Sereika [mailto:[hidden email]]
Enviado el: Tuesday, August 22, 2006 1:45 PM
Para: 'Hector Maletta'
Asunto: RE: Relationship between Sets of Dependent and Independent
Variables

Dear Hector:

I agree that sample size is problematic. Is there anything that
can
be savaged from this? Would it be reasonable to present the work as
exploratory? Or perhaps to apply principal components analysis to derive
a
smaller number of derived variables and conduct the analyses with these
derived variables using regression analysis? Thank you very much for
your
thoughts on this.

Sincerely,
Susan

-----Original Message-----
From: Hector Maletta [mailto:[hidden email]]
Sent: Tuesday, August 22, 2006 12:09 PM
To: 'Susan M. Sereika'
Subject: RE: Relationship between Sets of Dependent and Independent
Variables

Now the situation is clearer, and the answer more definitely negative.
The
36 variables are far too many for just 200 cases (below 6 per variable),
let
alone for 30% of them, i.e. for about 60 women (which is about 1.8 cases
per
variable, when the old rule of thumb, now discredited for insufficiency,
was
at least 10; nowadays far more than 10 is usually required, depending on
the
variance of variables and the strength of the relationship). Hector

-----Mensaje original-----
De: Susan M. Sereika [mailto:[hidden email]]
Enviado el: Tuesday, August 22, 2006 12:54 PM
Para: 'Hector Maletta'
Asunto: RE: Relationship between Sets of Dependent and Independent
Variables

Dear Hector:

Thank you for your very quick and thoughtful reply. The investigation
is
theoretically driven for the most part with respect to the relationship
between the two sets of variables. The idea that the relationship may
vary
by gender/sex is a little more exploratory, although there is some
literature to support some relationships. Each set of dependent and
independent variables consists of 18 variables and the variables are
subscale scores believed to measure two concepts: beliefs about
depression
(18 variables) and coping (18 variables). The initial investigation
focused
on just examining the relationships between the two variables sets and
given
the complexity of the data, canonical correlation analysis (CCA) was
used.
Then the investigation was expanded to consider differences between men
and
women and a CCA was conducted within each gender subsample. The smaller
subsample sizes are problematic, especially for the male subsample.

Sincerely,
Susan

-----Original Message-----
From: Hector Maletta [mailto:[hidden email]]
Sent: Monday, August 21, 2006 12:05 PM
To: 'Susan M. Sereika'; [hidden email]
Subject: RE: Relationship between Sets of Dependent and Independent
Variables

You state your friend's goals in a very sketchy way, so it is very
difficult
to give an opinion. For instance, how many variables are involved? 200
cases
may be way too few if the variables happen to be (even moderately)
numerous.
Is he/she interested in bivariate or multivariate relations between
these
variables? For instance, one may be interested in crossing pairs of
variables such as X BY Z BY sex, and see whether the
association/correlation
of X and Z varies with sex, and this may be feasible with 200 cases (140
women, 60 men), only if one has, say, K variables there would be
K*(K-1)/2
pairs of variables, which rapidly goes into the hundreds or the
thousands as
K grows. For K=50, there are 1225 pairs of variables to consider. If one
is
interested in models involving many variables, such as regression, the
number of possible models grows exponentially and, besides, the small
number
of cases in the sample becomes rapidly a limitation. Another
consideration
is whether your friend has any theory or conceptual approach or
problem-oriented goal when facing these data, or is just exploring
blindly
around. What is he/she looking for? Just mining around for any kind of
non-random-looking patterns, like an astronomer searching for signs of
extra-terrestrial intelligence among random electromagnetic cosmic
noise, or
like John Nash, he of the beautiful mind, parsing newspapers in the
worst of
his madness? In a sample of 200 she/he may find many promising patterns,
but
they may be nothing but sample flukes.

Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Susan
M. Sereika Enviado el: Monday, August 21, 2006 10:44 AM
Para: [hidden email]
Asunto: Relationship between Sets of Dependent and Independent Variables

Dear Listserv members:

A colleague is interested in examining the relationship between
two
sets of variables (dependent variables and independent variables).
Additionally she would like to investigate whether the relationship
varies
between men and women. The total sample size is somewhat moderate
(about
200 participants) with 70% being women. What might be a good approach to
use
when analyze these data in light of the objectives? Any suggestions are
most appreciated.

Sincerely,
Susan Sereika

Richard Ristow

Re: SIMCA: When number of Variables far exceeds number of cases

At 12:47 PM 8/23/2006, FORCHEH, N. (DR.) wrote:

>In classical inference, the sample size must always out number the
>number of variables by several factors. This is to avoid singularity
>and the resulting spurious correlations that give a false sense of a
>good fit.
>
>However, in Pattern recognition, Chemometrics and related studies, the
>number of variables usually far exceeds number of cases. Methods such
>as SIMCA and 4 stage partial least squares have been developed in
>Chemotetrics to deal with such situations.

To display ignorance and curiosity: can you give us an example? I
bounced around the Web about SIMCA a little. Nothing seemed to make the
claim that it could handle many more variables than there are case, or
even any more. From descriptions I saw, it's based on PCA, which
certainly doesn't allow extracting more components than there are
cases.

Classical statistics or not, it's hard for me to see how you can
estimate from any data, more degrees of freedom than it has.

It looked like some chemometrics cases involve looking for patterns in
spectra and the like. Now, if a 'case' is a full spectrum rather than a
single numeric value, you may well be able to estimate many more
variables than there are 'cases', because a 'case' is actually a large
number of observations.

That's the best I could do in a quick scan. Can you give an instance of
estimating a "number of variables usually far exceeding the number of
cases," using SIMCA or otherwise?

-Many thanks,
Richard Ristow