Dear all,
I have used a chi-squared test and a chi-squared test for trend to analyse data with ordered categories (e.g. success/failure in an exam against age, partitioned into groups from youngest to oldest). Success appears to decrease with increased age. My question concerns the difference between the chi-squared and the chi- squared_trend values. If the difference, which I understand to be a chi- squared value, proves to be significant, what kind of things can I deduce? Do I deduce that the observed variation is not just due to a linear trend and that a non-linear relationship exists? Do I deduce that other factors are influencing the outcome of success? What other kinds of analysis should I be carrying out to investigate further? Any advice would be appreciated, just to point me in the right direction. I realise that this a bit of a vague question, but Iâm just trying to learn about this stuff and never really have anyone on hand to ask. Thanks in advance, Lou |
Lou:
I would very highly advise to not use chi-square test of order when you have a (presumably) parametric, hopefully normally distributed variable such as age which is hypothesized to correlate with a dichotomous variable. First test for skewness in the age distribution, and fix it by use of square root or natural ln if necessary. In any event, simply use the much more powerful t-test to test the difference in the mean ages between the success and failure groups. Using chi-square seems unnecessarily complicated, with an actual loss of power. Partitioning age is arbitrary as to the number of categories, can get you in trouble if you have too few insofar as the relationship, and in trouble if you have too many insofar as power. Joe Burleson -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Lou Sent: Wednesday, April 04, 2007 6:58 AM To: [hidden email] Subject: Chi-squared and Chi-squared test for trend comparison Dear all, I have used a chi-squared test and a chi-squared test for trend to analyse data with ordered categories (e.g. success/failure in an exam against age, partitioned into groups from youngest to oldest). Success appears to decrease with increased age. My question concerns the difference between the chi-squared and the chi- squared_trend values. If the difference, which I understand to be a chi- squared value, proves to be significant, what kind of things can I deduce? Do I deduce that the observed variation is not just due to a linear trend and that a non-linear relationship exists? Do I deduce that other factors are influencing the outcome of success? What other kinds of analysis should I be carrying out to investigate further? Any advice would be appreciated, just to point me in the right direction. I realise that this a bit of a vague question, but I’m just trying to learn about this stuff and never really have anyone on hand to ask. Thanks in advance, Lou |
In reply to this post by Charlotte-9
Hi Joe,
That's some interesting advice there. The problem is that I am writing a report in which the readers, including my bosses, want me to determine whether or not uptake of a cancer screening test reduces with increased age. Simple plots suggest this to be the case. The ages have already been partitioned into 5 groups (before I started working on this data), but I do also have the original ages in my data set. In using your suggesting of testing the mean ages in the success and failure groups, while this would be informative, I can't see that it would answer the question I have stated above. Could you elaborate a bit more? All in all, are you suggesting that I would be advised to work only with the raw ages and not the categorised ages? The main problem I have is that my bosses give me no time to look into the techniques properly and insist I do certain things (even though they have very limited stats knowledge). It's all very frustrating!! Thanks for your help, Lou On Wed, 4 Apr 2007 10:41:04 -0400, Burleson,Joseph A. <[hidden email]> wrote: >Lou: > >I would very highly advise to not use chi-square test of order when you have a (presumably) parametric, hopefully normally distributed variable such as age which is hypothesized to correlate with a dichotomous variable. > >First test for skewness in the age distribution, and fix it by use of square root or natural ln if necessary. > >In any event, simply use the much more powerful t-test to test the difference in the mean ages between the success and failure groups. Using chi-square seems unnecessarily complicated, with an actual loss of power. Partitioning age is arbitrary as to the number of categories, can get you in trouble if you have too few insofar as the relationship, and in trouble if you have too many insofar as power. > >Joe Burleson > >-----Original Message----- >From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Lou >Sent: Wednesday, April 04, 2007 6:58 AM >To: [hidden email] >Subject: Chi-squared and Chi-squared test for trend comparison > >Dear all, > >I have used a chi-squared test and a chi-squared test for trend to analyse >data with ordered categories (e.g. success/failure in an exam against age, >partitioned into groups from youngest to oldest). Success appears to >decrease with increased age. > >My question concerns the difference between the chi-squared and the chi- >squared_trend values. If the difference, which I understand to be a chi- >squared value, proves to be significant, what kind of things can I >deduce? Do I deduce that the observed variation is not just due to a >linear trend and that a non-linear relationship exists? Do I deduce that >other factors are influencing the outcome of success? What other kinds of >analysis should I be carrying out to investigate further? > >Any advice would be appreciated, just to point me in the right direction. >I realise that this a bit of a vague question, but Iâm just trying to >learn about this stuff and never really have anyone on hand to ask. > >Thanks in advance, > >Lou |
In reply to this post by Burleson,Joseph A.
Another amicus remark - I seem to be making a lot, lately -
At 10:41 AM 4/4/2007, Burleson,Joseph A. wrote: >[...] when you have a (presumably) parametric, hopefully normally >distributed variable such as age [...] If my principles and the gullibility of others permitted, I would get very rich betting against the distribution of ages being normally distributed, in any study. The big reason for expecting normal distributions to be common, is the central limit theorem. Its exact hypotheses will rarely be satisfied with real data. However, if a quantity may reasonably be modeled as the sum of a number of independent random quantities of similar variance, it's reasonable to take it as approximately normal. However ages are generated, this isn't how. |
Richard:
The last 6 large (n = 400 to 6,000) clinical trials I analyzed all had age perfectly normally distributed (i.e., skewness between -.20 and +.20). On the other hand, I, too, have seen age not be normal (e.g., Poisson distributions, U-shaped distributions, etc.). One cannot assume that it is one way or the other for no specific reason. Sorry to be so nit-picky, but the Central Limit Theorem has nothing at all to do with whether a population OR a single sample is normal or non-normal. The CLT has to do with "sampling" distributions. Repeated sampling of a certain sample size n, of a population (continuous variable) distribution, whether the population distribution is normal or not, but which has a mean and an SD, will produce a set of means which is normally distributed (Pagano & Gauvreau, 2000; Hays, 1973). The CLT is almost always born out when the n of the sample that is repeated being taken is >=40, and at least 50 repeated samples are taken. I am looking for the place where these bets are made! I accept cash, checks, VISA or future promises. Joe Burleson -----Original Message----- From: Richard Ristow [mailto:[hidden email]] Sent: Wednesday, April 04, 2007 2:01 PM To: Burleson,Joseph A.; [hidden email] Subject: Re: Chi-squared and Chi-squared test for trend comparison Another amicus remark - I seem to be making a lot, lately - At 10:41 AM 4/4/2007, Burleson,Joseph A. wrote: >[...] when you have a (presumably) parametric, hopefully normally >distributed variable such as age [...] If my principles and the gullibility of others permitted, I would get very rich betting against the distribution of ages being normally distributed, in any study. The big reason for expecting normal distributions to be common, is the central limit theorem. Its exact hypotheses will rarely be satisfied with real data. However, if a quantity may reasonably be modeled as the sum of a number of independent random quantities of similar variance, it's reasonable to take it as approximately normal. However ages are generated, this isn't how. |
At 03:18 PM 4/4/2007, Burleson,Joseph A. wrote:
>The last 6 large (n = 400 to 6,000) clinical trials I analyzed all had >age perfectly normally distributed (i.e., skewness between -.20 and >+.20). Well... The skewness measure may not be conclusive. The skewness is zero for any symmetric distribution. That includes uniform distributions, or choosing each one of two values with probability 0.5, or any number of easy to construct long-tailed distributions. I've no idea of the design of the studies you were on, but most clinical trials select an age range, explicitly or by implication. The population pyramid being fairly flat over much of its range, that tends toward a uniform, or nearly uniform, age distribution. If it's a very wide age range in an adult population, you'll probably see some upward skewing. So, I might collect after all. Did you run a Kolmogorov-Smirnov, or other specific, test for normality? (I might add that an approximately uniform age distribution will be just fine for analysis, and there was no need to go beyond the skewness check, for your purposes. The worst problem would be age outliers; people near the end of the observed age range have very different medical problems, of course. But the selection procedures surely excluded those.) >I, too, have seen age not be normal (e.g., Poisson distributions, >U-shaped distributions, etc.). One cannot assume that it is one way or >the other for no specific reason. No. On the other hand, the age distribution, whatever it is, is usually that way because of a selection criterion applied to an overall population pyramid, and a clear grasp of the explicit or implicit selection rule, is crucial. (Stories: Like a study of at-risk - premature - neonates, that showed a strong negative correlation between birth weight, and gestational age at birth.) >Sorry to be so nit-picky, but the Central Limit Theorem has nothing at >all to do with whether a population OR a single sample is normal or >non-normal. Actually, I've often seen it argued, that it does. Admittedly the argument is a little hand-wavy, as it deals with effects that can only be hypothesized to exist. >The CLT has to do with "sampling" distributions. First, no; the CLT has to do with distributions of the sums (or means) of random variables; sampling distributions are one instance. Now, bear with me, and I'm taking a point of view standard among probability theorists, but that often seems strange to statisticians: the observations are not selected from a 'population', considered as a finite, potentially identifiable set of subjects; but are drawn, generated, according to distribution and dependency rules. Consider residuals, then - 'random variation' added to an underlying value that we actually want. (This is the standard premise of linear models.) Why would we remotely expect these to be normally distributed? Here's the hand-wavy part: If there are actually many unobserved factors whose effects add to form the residuals, they are statistically independent, and their variances are comparable ("uniformly bounded" is the correct notion), then the hypotheses of the CLT apply, and we may with some justice expect approximately normal residuals. This model, of residuals that are the sum of many small random effects, suggests a likely problem: what if they aren't all of comparable size? Indeed, one of the more common observed deviations from normal residuals, is long 'tails' - probability of very large residuals much greater than given by the normal distribution. That is what you get if you have one, or a few, influences that occur rarely but have high variance when they do occur. This model also suggests circumstances where its unwise to expect normal residuals. For example, you've good hope that a scale made by summing Likert-scale responses will be something like normally distributed around its mean; but there's little chance that's true for a single Likert scale. Which brings us back to age. Subject ages aren't 'generated'; subjects really are selected from a population with a known, usually nowhere-near-normal, distribution of ages. Further, the selection is almost always for a sub-range of the distribution. It's hard to argue that the resulting distribution should be normal. Hard enough, that if I saw a normal distribution of ages in a study, I'd look skeptically at the selection criterion. Now, an unskewed distribution, that I can readily believe. But I think it'll usually look much more like uniform than like normal. I'm interested in your comments, and anybody's, on what age distributions are common in real studies. |
Free forum by Nabble | Edit this page |