SPSSX Discussion

Chi-squared and Chi-squared test for trend comparison

Classic

List

Threaded

6 messages Options

Charlotte-9

Chi-squared and Chi-squared test for trend comparison

Dear all,

I have used a chi-squared test and a chi-squared test for trend to analyse
data with ordered categories (e.g. success/failure in an exam against age,
partitioned into groups from youngest to oldest). Success appears to
decrease with increased age.

My question concerns the difference between the chi-squared and the chi-
squared_trend values. If the difference, which I understand to be a chi-
squared value, proves to be significant, what kind of things can I
deduce? Do I deduce that the observed variation is not just due to a
linear trend and that a non-linear relationship exists? Do I deduce that
other factors are influencing the outcome of success? What other kinds of
analysis should I be carrying out to investigate further?

Any advice would be appreciated, just to point me in the right direction.
I realise that this a bit of a vague question, but Iâm just trying to
learn about this stuff and never really have anyone on hand to ask.

Thanks in advance,

Lou

Burleson,Joseph A.

Re: Chi-squared and Chi-squared test for trend comparison

Lou:

I would very highly advise to not use chi-square test of order when you have a (presumably) parametric, hopefully normally distributed variable such as age which is hypothesized to correlate with a dichotomous variable.

First test for skewness in the age distribution, and fix it by use of square root or natural ln if necessary.

In any event, simply use the much more powerful t-test to test the difference in the mean ages between the success and failure groups. Using chi-square seems unnecessarily complicated, with an actual loss of power. Partitioning age is arbitrary as to the number of categories, can get you in trouble if you have too few insofar as the relationship, and in trouble if you have too many insofar as power.

Joe Burleson

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Lou
Sent: Wednesday, April 04, 2007 6:58 AM
To: [hidden email]
Subject: Chi-squared and Chi-squared test for trend comparison

Dear all,

I have used a chi-squared test and a chi-squared test for trend to analyse
data with ordered categories (e.g. success/failure in an exam against age,
partitioned into groups from youngest to oldest). Success appears to
decrease with increased age.

My question concerns the difference between the chi-squared and the chi-
squared_trend values. If the difference, which I understand to be a chi-
squared value, proves to be significant, what kind of things can I
deduce? Do I deduce that the observed variation is not just due to a
linear trend and that a non-linear relationship exists? Do I deduce that
other factors are influencing the outcome of success? What other kinds of
analysis should I be carrying out to investigate further?

Any advice would be appreciated, just to point me in the right direction.
I realise that this a bit of a vague question, but Iâ€™m just trying to
learn about this stuff and never really have anyone on hand to ask.

Thanks in advance,

Lou

Charlotte-9

Re: Chi-squared and Chi-squared test for trend comparison

In reply to this post by Charlotte-9

Hi Joe,

That's some interesting advice there. The problem is that I am writing a
report in which the readers, including my bosses, want me to determine
whether or not uptake of a cancer screening test reduces with increased
age. Simple plots suggest this to be the case. The ages have already
been partitioned into 5 groups (before I started working on this data),
but I do also have the original ages in my data set.

In using your suggesting of testing the mean ages in the success and
failure groups, while this would be informative, I can't see that it would
answer the question I have stated above. Could you elaborate a bit more?
All in all, are you suggesting that I would be advised to work only with
the raw ages and not the categorised ages?

The main problem I have is that my bosses give me no time to look into the
techniques properly and insist I do certain things (even though they have
very limited stats knowledge). It's all very frustrating!!

Thanks for your help,
Lou

On Wed, 4 Apr 2007 10:41:04 -0400, Burleson,Joseph A.
<[hidden email]> wrote:

>Lou:
>
>I would very highly advise to not use chi-square test of order when you
have a (presumably) parametric, hopefully normally distributed variable
such as age which is hypothesized to correlate with a dichotomous variable.
>
>First test for skewness in the age distribution, and fix it by use of
square root or natural ln if necessary.
>
>In any event, simply use the much more powerful t-test to test the
difference in the mean ages between the success and failure groups. Using
chi-square seems unnecessarily complicated, with an actual loss of power.
Partitioning age is arbitrary as to the number of categories, can get you
in trouble if you have too few insofar as the relationship, and in trouble
if you have too many insofar as power.
>
>Joe Burleson
>
>-----Original Message-----
>From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Lou

>Sent: Wednesday, April 04, 2007 6:58 AM
>To: [hidden email]
>Subject: Chi-squared and Chi-squared test for trend comparison
>
>Dear all,
>
>I have used a chi-squared test and a chi-squared test for trend to analyse
>data with ordered categories (e.g. success/failure in an exam against age,
>partitioned into groups from youngest to oldest). Success appears to
>decrease with increased age.
>
>My question concerns the difference between the chi-squared and the chi-
>squared_trend values. If the difference, which I understand to be a chi-
>squared value, proves to be significant, what kind of things can I
>deduce? Do I deduce that the observed variation is not just due to a
>linear trend and that a non-linear relationship exists? Do I deduce that
>other factors are influencing the outcome of success? What other kinds of
>analysis should I be carrying out to investigate further?
>
>Any advice would be appreciated, just to point me in the right direction.
>I realise that this a bit of a vague question, but Iâm just trying to
>learn about this stuff and never really have anyone on hand to ask.
>
>Thanks in advance,
>
>Lou

Richard Ristow

Re: Chi-squared and Chi-squared test for trend comparison

In reply to this post by Burleson,Joseph A.

Another amicus remark - I seem to be making a lot, lately -

At 10:41 AM 4/4/2007, Burleson,Joseph A. wrote:

>[...] when you have a (presumably) parametric, hopefully normally
>distributed variable such as age [...]

If my principles and the gullibility of others permitted, I would get
very rich betting against the distribution of ages being normally
distributed, in any study.

The big reason for expecting normal distributions to be common, is the
central limit theorem. Its exact hypotheses will rarely be satisfied
with real data. However, if a quantity may reasonably be modeled as the
sum of a number of independent random quantities of similar variance,
it's reasonable to take it as approximately normal.

However ages are generated, this isn't how.

Burleson,Joseph A.

Re: Chi-squared and Chi-squared test for trend comparison

Richard:

The last 6 large (n = 400 to 6,000) clinical trials I analyzed all had
age perfectly normally distributed (i.e., skewness between -.20 and
+.20). On the other hand, I, too, have seen age not be normal (e.g.,
Poisson distributions, U-shaped distributions, etc.). One cannot assume
that it is one way or the other for no specific reason.

Sorry to be so nit-picky, but the Central Limit Theorem has nothing at
all to do with whether a population OR a single sample is normal or
non-normal. The CLT has to do with "sampling" distributions. Repeated
sampling of a certain sample size n, of a population (continuous
variable) distribution, whether the population distribution is normal or
not, but which has a mean and an SD, will produce a set of means which
is normally distributed (Pagano & Gauvreau, 2000; Hays, 1973).

The CLT is almost always born out when the n of the sample that is
repeated being taken is >=40, and at least 50 repeated samples are
taken. I am looking for the place where these bets are made!

I accept cash, checks, VISA or future promises.

Joe Burleson

-----Original Message-----
From: Richard Ristow [mailto:[hidden email]]
Sent: Wednesday, April 04, 2007 2:01 PM
To: Burleson,Joseph A.; [hidden email]
Subject: Re: Chi-squared and Chi-squared test for trend comparison

Another amicus remark - I seem to be making a lot, lately -

At 10:41 AM 4/4/2007, Burleson,Joseph A. wrote:

>[...] when you have a (presumably) parametric, hopefully normally
>distributed variable such as age [...]

If my principles and the gullibility of others permitted, I would get
very rich betting against the distribution of ages being normally
distributed, in any study.

The big reason for expecting normal distributions to be common, is the
central limit theorem. Its exact hypotheses will rarely be satisfied
with real data. However, if a quantity may reasonably be modeled as the
sum of a number of independent random quantities of similar variance,
it's reasonable to take it as approximately normal.

However ages are generated, this isn't how.

Richard Ristow

Re: Chi-squared and Chi-squared test for trend comparison

At 03:18 PM 4/4/2007, Burleson,Joseph A. wrote:

>The last 6 large (n = 400 to 6,000) clinical trials I analyzed all had
>age perfectly normally distributed (i.e., skewness between -.20 and
>+.20).

Well... The skewness measure may not be conclusive. The skewness is
zero for any symmetric distribution. That includes uniform
distributions, or choosing each one of two values with probability 0.5,
or any number of easy to construct long-tailed distributions.

I've no idea of the design of the studies you were on, but most
clinical trials select an age range, explicitly or by implication. The
population pyramid being fairly flat over much of its range, that tends
toward a uniform, or nearly uniform, age distribution. If it's a very
wide age range in an adult population, you'll probably see some upward
skewing.

So, I might collect after all. Did you run a Kolmogorov-Smirnov, or
other specific, test for normality?

(I might add that an approximately uniform age distribution will be
just fine for analysis, and there was no need to go beyond the skewness
check, for your purposes. The worst problem would be age outliers;
people near the end of the observed age range have very different
medical problems, of course. But the selection procedures surely
excluded those.)

>I, too, have seen age not be normal (e.g., Poisson distributions,
>U-shaped distributions, etc.). One cannot assume that it is one way or
>the other for no specific reason.

No. On the other hand, the age distribution, whatever it is, is usually
that way because of a selection criterion applied to an overall
population pyramid, and a clear grasp of the explicit or implicit
selection rule, is crucial.

(Stories: Like a study of at-risk - premature - neonates, that showed a
strong negative correlation between birth weight, and gestational age
at birth.)

>Sorry to be so nit-picky, but the Central Limit Theorem has nothing at
>all to do with whether a population OR a single sample is normal or
>non-normal.

Actually, I've often seen it argued, that it does. Admittedly the
argument is a little hand-wavy, as it deals with effects that can only
be hypothesized to exist.

>The CLT has to do with "sampling" distributions.

First, no; the CLT has to do with distributions of the sums (or means)
of random variables; sampling distributions are one instance.

Now, bear with me, and I'm taking a point of view standard among
probability theorists, but that often seems strange to statisticians:
the observations are not selected from a 'population', considered as a
finite, potentially identifiable set of subjects; but are drawn,
generated, according to distribution and dependency rules.

Consider residuals, then - 'random variation' added to an underlying
value that we actually want. (This is the standard premise of linear
models.) Why would we remotely expect these to be normally distributed?

Here's the hand-wavy part: If there are actually many unobserved
factors whose effects add to form the residuals, they are statistically
independent, and their variances are comparable ("uniformly bounded" is
the correct notion), then the hypotheses of the CLT apply, and we may
with some justice expect approximately normal residuals.

This model, of residuals that are the sum of many small random effects,
suggests a likely problem: what if they aren't all of comparable size?
Indeed, one of the more common observed deviations from normal
residuals, is long 'tails' - probability of very large residuals much
greater than given by the normal distribution. That is what you get if
you have one, or a few, influences that occur rarely but have high
variance when they do occur.

This model also suggests circumstances where its unwise to expect
normal residuals. For example, you've good hope that a scale made by
summing Likert-scale responses will be something like normally
distributed around its mean; but there's little chance that's true for
a single Likert scale.

Which brings us back to age. Subject ages aren't 'generated'; subjects
really are selected from a population with a known, usually
nowhere-near-normal, distribution of ages. Further, the selection is
almost always for a sub-range of the distribution.

It's hard to argue that the resulting distribution should be normal.
Hard enough, that if I saw a normal distribution of ages in a study,
I'd look skeptically at the selection criterion.

Now, an unskewed distribution, that I can readily believe. But I think
it'll usually look much more like uniform than like normal.

I'm interested in your comments, and anybody's, on what age
distributions are common in real studies.