I am comparing 7 groups on multiple dimensions (demographics, attitudes,
actions). Some comparisons are across all groups and some are among 2 or 3 or 4 of the groups. They include ANOVAs, chi-squares and t tests. I know with so many comparisons some will be significant by chance. How should I adjust for this? Also, should such an adjustment be made within types of questions (e.g. within demographics, within attitudes, and within actions) or across all items compared? Thanks for any help you can provide. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Hi Martha,
I don't claim to be an expert about what follows; others know more. If you step back a minute and think about your results, I think there two things to consider, in addition to the multiple comparisons per se. One is correlation among results and the other is independent variables subset comparisons, i.e., comparions involving differing numbers of groups. My only thought on that is to ignore the subset element and treat them as full set coparisons--unless the set of all results includes both subset comparisons and full set comparions of the same DV. To the extant that the DVs are correlated, the test statistics will be correlated. The traditional way of controlling for muliple comparisons has been the Bonferroni adjustment--if the nomial significance threshold is set at .05 and you do 10 tests, reset the threshold to .05/10=.005. Bonferroni is criticized because power is (much) reduced. An alternative is the false discovery rate (FDR). I think the procedures have undergone some development since first being published. The key names are Benjamini, Yoav and Hochberg, Yosef. I have also seen a reference to Holm, S. Look at the Wikipedia article on false discovery rate. It's important to understand that there is a difference between Bonferroni and FDR in terms of what is being controlled for. The wiki article shows an adjustement to the FDR computation for correlated tests that I don't recall seeing in the Benjamini et al. articles. The FDR procedure for uncorrelated tests is to rank the test results from most significant to least significant and then test the first at .05 (the assumed threshold), the second at .05/2=.025, the third at .05/3=.01667, etc until a result fails to pass the threshold. I think there was a discussion on the list sometime earlier this year on FDR and correlated tests and you might find something about it in the archives. Gene Maguin -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Martha Hewett Sent: Sunday, August 14, 2011 12:52 PM To: [hidden email] Subject: How should I account for multiple comparisons when looking at p values? I am comparing 7 groups on multiple dimensions (demographics, attitudes, actions). Some comparisons are across all groups and some are among 2 or 3 or 4 of the groups. They include ANOVAs, chi-squares and t tests. I know with so many comparisons some will be significant by chance. How should I adjust for this? Also, should such an adjustment be made within types of questions (e.g. within demographics, within attitudes, and within actions) or across all items compared? Thanks for any help you can provide. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Martha Hewett
What kind of statement are you making?
Who are you making it to, and what do they expect? What is the N? If you have tens of thousands, a journal like NEJM will suggest that you ignore all tests, and focus on "effect size", because your power would be so large. On the other hand, with 7 groups, observational data, and a much smaller N... you could have a serious problem with power for some analyses, especially if your group sizes are grossly unequal. I'll skip by those concerns. If the study is exploratory, then it is probably fair to report the straight p-values -- with suitable warning to the readers. That is the simplest case. Otherwise, corrections for multiple tests are needed for the important tests. If you want to actually *conclude* something, about *hypotheses*, then you should draw up your small number of hypotheses in advance, and figure out what variables or composite scores will be able to test them. As you describe it, there are dozens of possible hypotheses. These should be arranged in a hierarchy: These few are *primary*, the main reason we collected the data; and (if any will be), these will be tested with correction; these next are also interesting in an exploratory mode, and are reported without correction. When you are looking at a variable that might *bias* the other tests or comparisons, then it is proper, if not mandatory, to report the single test with its nominal p-value -- These are warnings, and you don't want to under-rate a warning. If a variable plays both roles (being a possible bias-factor, and being of intrinsic interest), you need to report both sorts of test-outcome if they are different. "Sex shows enough difference between groups that it could be an important biasing factor, even though the p-value is not significant after it is corrected for the multiple testing." Some tests fall "under" the original important tests, so that they may be regarded as (say) explaining or detailing the reasons for the significant (or non-significant) results in the primary tests. You can point to the original, significant test as an "overall" test on the area, which then justifies using the nominal test size in followup. - Of course, that organization of tests should have been done before you ever collected the data. Then, you probably would have done some things a little different. After the fact, you can only try to achieve the same "fair" state of mind; do not try to draw on what you have seen in the results, because whatever critics exist will probably catch you at it. Hope this helps. -- Rich Ulrich > Date: Sun, 14 Aug 2011 12:52:01 -0400 > From: [hidden email] > Subject: How should I account for multiple comparisons when looking at p values? > To: [hidden email] > > I am comparing 7 groups on multiple dimensions (demographics, attitudes, > actions). Some comparisons are across all groups and some are among 2 or 3 > or 4 of the groups. They include ANOVAs, chi-squares and t tests. I know > with so many comparisons some will be significant by chance. How should I > adjust for this? Also, should such an adjustment be made within types of > questions (e.g. within demographics, within attitudes, and within actions) > or across all items compared? > > Thanks for any help you can provide. > |
Rich - Thanks very much for your input. The respondent n's for the 7 groups surveyed range from 380 to 569. Mailout for each group was 800, except for the smallest group where the max possible mailout was 441. Excluding that group, respondent ns range from 467 to 569 (We worked very hard to get these high response rates.) A key question is whether the treatments caused people to take any actions that fall within a broad category of actions. To examine this we asked about many actions (circa 40 without actually counting them). These are being tested individually and will probably also be put into groups of about 3 or 4 broad types of actions and tested as composites. I should explain that there is an objective measure of the impact of the treatments, and that is also being analyzed, but the examination of the self-reported actions attempts to determine why there is or isn't a measurable impact.
What kind of statement are you making? Who are you making it to, and what do they expect? What is the N? If you have tens of thousands, a journal like NEJM will suggest that you ignore all tests, and focus on "effect size", because your power would be so large. On the other hand, with 7 groups, observational data, and a much smaller N... you could have a serious problem with power for some analyses, especially if your group sizes are grossly unequal. I'll skip by those concerns. If the study is exploratory, then it is probably fair to report the straight p-values -- with suitable warning to the readers. That is the simplest case. Otherwise, corrections for multiple tests are needed for the important tests. If you want to actually *conclude* something, about *hypotheses*, then you should draw up your small number of hypotheses in advance, and figure out what variables or composite scores will be able to test them. As you describe it, there are dozens of possible hypotheses. These should be arranged in a hierarchy: These few are *primary*, the main reason we collected the data; and (if any will be), these will be tested with correction; these next are also interesting in an exploratory mode, and are reported without correction. When you are looking at a variable that might *bias* the other tests or comparisons, then it is proper, if not mandatory, to report the single test with its nominal p-value -- These are warnings, and you don't want to under-rate a warning. If a variable plays both roles (being a possible bias-factor, and being of intrinsic interest), you need to report both sorts of test-outcome if they are different. "Sex shows enough difference between groups that it could be an important biasing factor, even though the p-value is not significant after it is corrected for the multiple testing." Some tests fall "under" the original important tests, so that they may be regarded as (say) explaining or detailing the reasons for the significant (or non-significant) results in the primary tests. You can point to the original, significant test as an "overall" test on the area, which then justifies using the nominal test size in followup. - Of course, that organization of tests should have been done before you ever collected the data. Then, you probably would have done some things a little different. After the fact, you can only try to achieve the same "fair" state of mind; do not try to draw on what you have seen in the results, because whatever critics exist will probably catch you at it. Hope this helps. -- Rich Ulrich > Date: Sun, 14 Aug 2011 12:52:01 -0400 > From: [hidden email] > Subject: How should I account for multiple comparisons when looking at p values? > To: [hidden email] > > I am comparing 7 groups on multiple dimensions (demographics, attitudes, > actions). Some comparisons are across all groups and some are among 2 or 3 > or 4 of the groups. They include ANOVAs, chi-squares and t tests. I know > with so many comparisons some will be significant by chance. How should I > adjust for this? Also, should such an adjustment be made within types of > questions (e.g. within demographics, within attitudes, and within actions) > or across all items compared? > > Thanks for any help you can provide. > |
Okay, here is your comment - certain reported actions will be (probably, you say)
"put into groups of about 3 or 4 broad types of actions and tested as composites." Both for hypothesis testing and for clarity of presentation, I would create the composites and test them for fairer tests (less multiplicity) and the important overview they provide. If you want to correct for multiple testing, this gives you "3" to correct for, instead of "40". Any time I have the N for it, I factor analyze rating scales that come to me, even if there are well known factors. I want to know that my sample is similar to the norming samples, even if there is nothing else to be known. And I make sure that there are no screw-ups in the data, like reverse-scored items that I hadn't been warned of, or inexplicable non-loadings. In any case, going back to my start on rating scales in 1970, I don't remember when I have *ever* analyzed a 40 item scale without having subscales or factors of some kind that I included in tests. If there were only 10 items, maybe I could be satisfied to eyeball them, and not do a factoring; but I don't know why I should break what seems to be a good habit. "Yes (you may be able to say), THIS composite shows a strong difference among groups. And the elements that contribute most were such-and-so," with some detail on separate effect sizes and p-values. If the broad category does or does not show any difference, that should carry weight on how readily you generalize to the items. - In recent years, I moved to using "average item score" for my composites of commensurate items, because that allows everyone to point to the original anchor labels. For other composites, I finally started setting the result to mean=50, SD=10 (at baseline, if there are several periods). That made it easy to spot outlying groups and individuals, and to do it without writing scores with decimals. In the same vein of improving power for tests, I wonder -Are all 7 groups of equal importance? Would a couple of the groups provide a clearer *test* of whatever precise dynamics you are exploring? Or, could you categorize the 7 groups so that there are only two or three Categories of groups. There is a very good reason that experimental designs are so very often limited to two-group comparisons, and that is because the ability to detect differences (as "significant") becomes so much poorer. (Anything more than 5 SD away from the mean is unlikely to belong in any analysis, for one reason or another.) -- Rich Ulrich Date: Sun, 14 Aug 2011 15:10:41 -0500 From: [hidden email] Subject: Re: How should I account for multiple comparisons when looking at p values? To: [hidden email] Rich - Thanks very much for your input. The respondent n's for the 7 groups surveyed range from 380 to 569. Mailout for each group was 800, except for the smallest group where the max possible mailout was 441. Excluding that group, respondent ns range from 467 to 569 (We worked very hard to get these high response rates.) A key question is whether the treatments caused people to take any actions that fall within a broad category of actions. To examine this we asked about many actions (circa 40 without actually counting them). These are being tested individually and will probably also be put into groups of about 3 or 4 broad types of actions and tested as composites. I should explain that there is an objective measure of the impact of the treatments, and that is also being analyzed, but the examination of the self-reported actions attempts to determine why there is or isn't a measurable impact.
What kind of statement are you making? Who are you making it to, and what do they expect? What is the N? If you have tens of thousands, a journal like NEJM will suggest that you ignore all tests, and focus on "effect size", because your power would be so large. On the other hand, with 7 groups, observational data, and a much smaller N... you could have a serious problem with power for some analyses, especially if your group sizes are grossly unequal. I'll skip by those concerns. If the study is exploratory, then it is probably fair to report the straight p-values -- with suitable warning to the readers. That is the simplest case. Otherwise, corrections for multiple tests are needed for the important tests. If you want to actually *conclude* something, about *hypotheses*, then you should draw up your small number of hypotheses in advance, and figure out what variables or composite scores will be able to test them. As you describe it, there are dozens of possible hypotheses. These should be arranged in a hierarchy: These few are *primary*, the main reason we collected the data; and (if any will be), these will be tested with correction; these next are also interesting in an exploratory mode, and are reported without correction. When you are looking at a variable that might *bias* the other tests or comparisons, then it is proper, if not mandatory, to report the single test with its nominal p-value -- These are warnings, and you don't want to under-rate a warning. If a variable plays both roles (being a possible bias-factor, and being of intrinsic interest), you need to report both sorts of test-outcome if they are different. "Sex shows enough difference between groups that it could be an important biasing factor, even though the p-value is not significant after it is corrected for the multiple testing." Some tests fall "under" the original important tests, so that they may be regarded as (say) explaining or detailing the reasons for the significant (or non-significant) results in the primary tests. You can point to the original, significant test as an "overall" test on the area, which then justifies using the nominal test size in followup. - Of course, that organization of tests should have been done before you ever collected the data. Then, you probably would have done some things a little different. After the fact, you can only try to achieve the same "fair" state of mind; do not try to draw on what you have seen in the results, because whatever critics exist will probably catch you at it. Hope this helps. -- Rich Ulrich > Date: Sun, 14 Aug 2011 12:52:01 -0400 > From: [hidden email] > Subject: How should I account for multiple comparisons when looking at p values? > To: [hidden email] > > I am comparing 7 groups on multiple dimensions (demographics, attitudes, > actions). Some comparisons are across all groups and some are among 2 or 3 > or 4 of the groups. They include ANOVAs, chi-squares and t tests. I know > with so many comparisons some will be significant by chance. How should I > adjust for this? Also, should such an adjustment be made within types of > questions (e.g. within demographics, within attitudes, and within actions) > or across all items compared? > > Thanks for any help you can provide. > |
Free forum by Nabble | Edit this page |