How should I account for multiple comparisons when looking at p values?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

How should I account for multiple comparisons when looking at p values?

Martha Hewett
I am comparing 7 groups on multiple dimensions (demographics, attitudes,
actions).  Some comparisons are across all groups and some are among 2 or 3
or 4 of the groups.  They include ANOVAs, chi-squares and t tests.  I know
with so many comparisons some will be significant by chance.  How should I
adjust for this?  Also, should such an adjustment be made within types of
questions (e.g. within demographics, within attitudes, and within actions)
or across all items compared?

Thanks for any help you can provide.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: How should I account for multiple comparisons when looking at p values?

Maguin, Eugene
Hi Martha,

I don't claim to be an expert about what follows; others know more. If you
step back a minute and think about your results, I think there two things to
consider, in addition to the multiple comparisons per se. One is correlation
among results and the other is independent variables subset comparisons,
i.e., comparions involving differing numbers of groups. My only thought on
that is to ignore the subset element and treat them as full set
coparisons--unless the set of all results includes both subset comparisons
and full set comparions of the same DV. To the extant that the DVs are
correlated, the test statistics will be correlated.

The traditional way of controlling for muliple comparisons has been the
Bonferroni adjustment--if the nomial significance threshold is set at .05
and you do 10 tests, reset the threshold to .05/10=.005. Bonferroni is
criticized because power is (much) reduced. An alternative is the false
discovery rate (FDR). I think the procedures have undergone some development
since first being published. The key names are Benjamini, Yoav and Hochberg,
Yosef. I have also seen a reference to Holm, S. Look at the Wikipedia
article on false discovery rate. It's important to understand that there is
a difference between Bonferroni and FDR in terms of what is being controlled
for. The wiki article shows an adjustement to the FDR computation for
correlated tests that I don't recall seeing in the Benjamini et al.
articles. The FDR procedure for uncorrelated tests is to rank the test
results from most significant to least significant and then test the first
at .05 (the assumed threshold), the second at .05/2=.025, the third at
.05/3=.01667, etc until a result fails to pass the threshold.

I think there was a discussion on the list sometime earlier this year on FDR
and correlated tests and you might find something about it in the archives.

Gene Maguin


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Martha Hewett
Sent: Sunday, August 14, 2011 12:52 PM
To: [hidden email]
Subject: How should I account for multiple comparisons when looking at p
values?

I am comparing 7 groups on multiple dimensions (demographics, attitudes,
actions).  Some comparisons are across all groups and some are among 2 or 3
or 4 of the groups.  They include ANOVAs, chi-squares and t tests.  I know
with so many comparisons some will be significant by chance.  How should I
adjust for this?  Also, should such an adjustment be made within types of
questions (e.g. within demographics, within attitudes, and within actions)
or across all items compared?

Thanks for any help you can provide.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: How should I account for multiple comparisons when looking at p values?

Rich Ulrich
In reply to this post by Martha Hewett
What kind of statement are you making?
Who are you making it to, and what do they expect?

What is the N?  If you have tens of thousands, a journal like NEJM
will suggest that you ignore all tests, and focus on "effect size",
because your power would be so large. 

On the other hand, with 7 groups, observational data, and a much
smaller N...  you could have a serious problem with power for some
analyses, especially if your group sizes are grossly unequal.  I'll skip
by those concerns.

If the study is exploratory, then it is probably fair to report the straight
p-values -- with suitable warning to the readers.  That is the simplest
case.  Otherwise, corrections for multiple tests are needed for the
important tests.

If you want to actually *conclude*  something, about *hypotheses*,
then you should draw up your small number of hypotheses in advance,
and figure out what variables or composite scores will be able to test
them.  As you describe it, there are dozens of possible hypotheses.
These should be arranged in a hierarchy:  These few are *primary*,
the main reason we collected the data; and (if any will be), these will
be tested with correction;  these next are also interesting in an exploratory
mode, and are reported without correction.

When you are looking at a variable that might *bias* the other
tests or comparisons, then it is proper, if not mandatory, to report the
single test with its nominal p-value -- These are warnings, and you don't
want to under-rate a warning.  If a variable plays both roles (being a
possible bias-factor, and being of intrinsic interest), you need to report
both sorts of test-outcome if they are different.  "Sex shows enough
difference between groups that it could be an important biasing factor,
even though the p-value is not significant after it is corrected for the
multiple testing."

Some tests fall "under" the original important tests, so that they may
be regarded as (say) explaining or detailing the reasons for the
significant (or non-significant) results in the primary tests.  You can point
to the original, significant test as an "overall" test on the area, which
then justifies using the nominal test size in followup.

 - Of course, that organization of tests should have been done before
you ever collected the data.  Then, you probably would have done some
things a little different.  After the fact, you can only try to achieve the
same "fair" state of mind; do not try to draw on what you have seen in
the results, because whatever critics exist will probably catch you at it.

Hope this helps.
--
Rich Ulrich


> Date: Sun, 14 Aug 2011 12:52:01 -0400

> From: [hidden email]
> Subject: How should I account for multiple comparisons when looking at p values?
> To: [hidden email]
>
> I am comparing 7 groups on multiple dimensions (demographics, attitudes,
> actions). Some comparisons are across all groups and some are among 2 or 3
> or 4 of the groups. They include ANOVAs, chi-squares and t tests. I know
> with so many comparisons some will be significant by chance. How should I
> adjust for this? Also, should such an adjustment be made within types of
> questions (e.g. within demographics, within attitudes, and within actions)
> or across all items compared?
>
> Thanks for any help you can provide.
>

Reply | Threaded
Open this post in threaded view
|

Re: How should I account for multiple comparisons when looking at p values?

Martha Hewett

Rich - Thanks very much for your input.

The respondent n's for the 7 groups surveyed range from 380 to 569.  Mailout for each group was 800, except for the smallest group where the max possible mailout was 441.  Excluding that group, respondent ns range from 467 to 569 (We worked very hard to get these high response rates.)

A key question is whether the treatments caused people to take any actions that fall within a broad category of actions.  To examine this we asked about many actions (circa 40 without actually counting them).  These are being tested individually and will probably also be put into groups of about 3 or 4 broad types of actions and tested as composites.

I should explain that there is an objective measure of the impact of the treatments, and that is also being analyzed, but the examination of the self-reported actions attempts to determine why there is or isn't a measurable impact.  





From: Rich Ulrich <[hidden email]>
To: <[hidden email]>, <[hidden email]>
Date: 08/14/2011 02:56 PM
Subject: RE: How should I account for multiple comparisons when looking at p              values?





What kind of statement are you making?
Who are you making it to, and what do they expect?

What is the N?  If you have tens of thousands, a journal like NEJM
will suggest that you ignore all tests, and focus on "effect size",
because your power would be so large.  

On the other hand, with 7 groups, observational data, and a much
smaller N...  you could have a serious problem with power for some
analyses, especially if your group sizes are grossly unequal.  I'll skip
by those concerns.

If the study is exploratory, then it is probably fair to report the straight
p-values -- with suitable warning to the readers.  That is the simplest
case.  Otherwise, corrections for multiple tests are needed for the
important tests.

If you want to actually *conclude*  something, about *hypotheses*,
then you should draw up your small number of hypotheses in advance,
and figure out what variables or composite scores will be able to test
them.  As you describe it, there are dozens of possible hypotheses.
These should be arranged in a hierarchy:  These few are *primary*,
the main reason we collected the data; and (if any will be), these will
be tested with correction;  these next are also interesting in an exploratory
mode, and are reported without correction.

When you are looking at a variable that might *bias* the other
tests or comparisons, then it is proper, if not mandatory, to report the
single test with its nominal p-value -- These are warnings, and you don't
want to under-rate a warning.  If a variable plays both roles (being a
possible bias-factor, and being of intrinsic interest), you need to report
both sorts of test-outcome if they are different.  "Sex shows enough
difference between groups that it could be an important biasing factor,
even though the p-value is not significant after it is corrected for the
multiple testing."

Some tests fall "under" the original important tests, so that they may
be regarded as (say) explaining or detailing the reasons for the
significant (or non-significant) results in the primary tests.  You can point
to the original, significant test as an "overall" test on the area, which
then justifies using the nominal test size in followup.

- Of course, that organization of tests should have been done before
you ever collected the data.  Then, you probably would have done some
things a little different.  After the fact, you can only try to achieve the
same "fair" state of mind; do not try to draw on what you have seen in
the results, because whatever critics exist will probably catch you at it.

Hope this helps.
--
Rich Ulrich


> Date: Sun, 14 Aug 2011 12:52:01 -0400
> From: [hidden email]
> Subject: How should I account for multiple comparisons when looking at p values?
> To: [hidden email]
>
> I am comparing 7 groups on multiple dimensions (demographics, attitudes,
> actions). Some comparisons are across all groups and some are among 2 or 3
> or 4 of the groups. They include ANOVAs, chi-squares and t tests. I know
> with so many comparisons some will be significant by chance. How should I
> adjust for this? Also, should such an adjustment be made within types of
> questions (e.g. within demographics, within attitudes, and within actions)
> or across all items compared?
>
> Thanks for any help you can provide.
>


Reply | Threaded
Open this post in threaded view
|

Re: How should I account for multiple comparisons when looking at p values?

Rich Ulrich
Okay, here is your comment - certain reported actions will be (probably, you say)
"put into groups of about 3 or 4 broad types of actions and tested as composites."

Both for hypothesis testing and for clarity of presentation, I would create
the composites and test them for fairer tests (less multiplicity) and the
important overview they provide.  If you want to correct for multiple testing,
this gives you "3"  to correct for, instead of "40".

Any time I have the N for it, I factor analyze rating scales that come to me,
even if there are well known factors.  I  want to know that my sample
is similar to the norming samples, even if there is nothing else to be known.
And I make sure that there are no screw-ups in the data, like reverse-scored
items that I hadn't been warned of, or inexplicable non-loadings. 

In any case, going back to my start on rating scales in 1970, I don't remember
when I have *ever* analyzed a 40 item scale without having subscales
or factors of some kind that I included in tests.  If there were only 10 items,
maybe I could be satisfied to eyeball them, and not do a factoring; but I
don't know why I should break what seems to be a good habit.


"Yes (you may be able to say), THIS composite shows a strong difference
among groups.  And the elements that contribute most were such-and-so,"
with some detail on separate effect sizes and p-values.  If the broad
category does or does not show any difference, that should carry weight on
how readily you generalize to the items.

 - In recent years, I moved to using "average item score" for my composites
of commensurate items, because that allows everyone to point to the original
anchor labels.  For other composites, I finally started setting the result to
mean=50, SD=10 (at baseline, if there are several periods).  That made it
easy to spot outlying groups and individuals, and to do it without writing
scores with decimals.

In the same vein of improving power for tests, I wonder -Are all 7 groups
of equal importance?  Would a couple of the groups provide a clearer
*test*  of whatever precise dynamics you are exploring?  Or, could you
categorize the 7 groups so that there are only two or three Categories
of groups.  There is a very good  reason that experimental designs are
so very often limited to two-group comparisons, and that is because the
ability to detect differences (as "significant") becomes so much poorer.
(Anything more than 5 SD away from the mean is unlikely to belong in
any analysis, for one reason or another.)

--
Rich Ulrich



Date: Sun, 14 Aug 2011 15:10:41 -0500
From: [hidden email]
Subject: Re: How should I account for multiple comparisons when looking at p values?
To: [hidden email]


Rich - Thanks very much for your input.

The respondent n's for the 7 groups surveyed range from 380 to 569.  Mailout for each group was 800, except for the smallest group where the max possible mailout was 441.  Excluding that group, respondent ns range from 467 to 569 (We worked very hard to get these high response rates.)

A key question is whether the treatments caused people to take any actions that fall within a broad category of actions.  To examine this we asked about many actions (circa 40 without actually counting them).  These are being tested individually and will probably also be put into groups of about 3 or 4 broad types of actions and tested as composites.

I should explain that there is an objective measure of the impact of the treatments, and that is also being analyzed, but the examination of the self-reported actions attempts to determine why there is or isn't a measurable impact.  





From: Rich Ulrich <[hidden email]>
To: <[hidden email]>, <[hidden email]>
Date: 08/14/2011 02:56 PM
Subject: RE: How should I account for multiple comparisons when looking at p              values?





What kind of statement are you making?
Who are you making it to, and what do they expect?

What is the N?  If you have tens of thousands, a journal like NEJM
will suggest that you ignore all tests, and focus on "effect size",
because your power would be so large.  

On the other hand, with 7 groups, observational data, and a much
smaller N...  you could have a serious problem with power for some
analyses, especially if your group sizes are grossly unequal.  I'll skip
by those concerns.

If the study is exploratory, then it is probably fair to report the straight
p-values -- with suitable warning to the readers.  That is the simplest
case.  Otherwise, corrections for multiple tests are needed for the
important tests.

If you want to actually *conclude*  something, about *hypotheses*,
then you should draw up your small number of hypotheses in advance,
and figure out what variables or composite scores will be able to test
them.  As you describe it, there are dozens of possible hypotheses.
These should be arranged in a hierarchy:  These few are *primary*,
the main reason we collected the data; and (if any will be), these will
be tested with correction;  these next are also interesting in an exploratory
mode, and are reported without correction.

When you are looking at a variable that might *bias* the other
tests or comparisons, then it is proper, if not mandatory, to report the
single test with its nominal p-value -- These are warnings, and you don't
want to under-rate a warning.  If a variable plays both roles (being a
possible bias-factor, and being of intrinsic interest), you need to report
both sorts of test-outcome if they are different.  "Sex shows enough
difference between groups that it could be an important biasing factor,
even though the p-value is not significant after it is corrected for the
multiple testing."

Some tests fall "under" the original important tests, so that they may
be regarded as (say) explaining or detailing the reasons for the
significant (or non-significant) results in the primary tests.  You can point
to the original, significant test as an "overall" test on the area, which
then justifies using the nominal test size in followup.

- Of course, that organization of tests should have been done before
you ever collected the data.  Then, you probably would have done some
things a little different.  After the fact, you can only try to achieve the
same "fair" state of mind; do not try to draw on what you have seen in
the results, because whatever critics exist will probably catch you at it.

Hope this helps.
--
Rich Ulrich


> Date: Sun, 14 Aug 2011 12:52:01 -0400
> From: [hidden email]
> Subject: How should I account for multiple comparisons when looking at p values?
> To: [hidden email]
>
> I am comparing 7 groups on multiple dimensions (demographics, attitudes,
> actions). Some comparisons are across all groups and some are among 2 or 3
> or 4 of the groups. They include ANOVAs, chi-squares and t tests. I know
> with so many comparisons some will be significant by chance. How should I
> adjust for this? Also, should such an adjustment be made within types of
> questions (e.g. within demographics, within attitudes, and within actions)
> or across all items compared?
>
> Thanks for any help you can provide.
>