Multiple testing

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Multiple testing

Katharina Reinecke
Hi all,

I have a question about when to correct for multiple testing. The scenario seems to be a very common one but we have been struggling to figure out the best way to analyze it, so any help would be greatly appreciated.

We have conducted a usability test in which 40 participants were asked to fulfill three tasks with two different software versions (the main experimental factor) in a repeated measures design.

The hypothesis was that one of the software versions would be more usable, and this was tested by collecting 3 objective measures (time, number of clicks, number of errors).

For each of the objective measures and for each of the three tasks we've then used Wilcoxon signed-rank tests (because the data was not normally distributed) for paired samples with a 1-tailed test.

Question: Do we have to correct for multiple testing in this setting, because the data was collected from the same participants?

Many thanks in advance,
Katharina

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Multiple testing

Bruce Weaver
Administrator
Katharina Reinecke wrote
Hi all,

I have a question about when to correct for multiple testing. The scenario seems to be a very common one but we have been struggling to figure out the best way to analyze it, so any help would be greatly appreciated.

We have conducted a usability test in which 40 participants were asked to fulfill three tasks with two different software versions (the main experimental factor) in a repeated measures design.

The hypothesis was that one of the software versions would be more usable, and this was tested by collecting 3 objective measures (time, number of clicks, number of errors).

For each of the objective measures and for each of the three tasks we've then used Wilcoxon signed-rank tests (because the data was not normally distributed) for paired samples with a 1-tailed test.

Question: Do we have to correct for multiple testing in this setting, because the data was collected from the same participants?

Many thanks in advance,
Katharina
I don't think there is a single correct answer to your question.  For example, where does your study fall on the "exploratory to confirmatory" spectrum?  The closer you are to the confirmatory end, the greater the need to correct for multiple tests, IMO.  I know of at least one article that argues no correction needs to be applied for purely exploratory studies, on the other hand.

   http://plog.yejh.tc.edu.tw/gallery/53/%E5%88%A4%E6%96%B7%E5%A4%9A%E5%85%83%E8%A9%95%E9%87%8F.pdf

The importance of correcting for multiple tests is also related to the number of tests, obviously.  If you have only 3, it's not as big an issue as if you have dozens or hundreds.  In the latter case, it's possible to draw some pretty outlandish conclusions.  E.g.,

  http://prefrontal.org/files/posters/Bennett-Salmon-2009.pdf

HTH.
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Multiple testing

Maguin, Eugene
I thought of a couple of questions when I read this current thread. If
Katharina elects to correct for multiple tests, I'd guess that she'd use a
Bonferroni correction. Her data are repeated measures and I'd expect, a
priori, that the measures are correlated. And, when samples are drawn
repeatedly from a population and analyzed, the test statistic would also be
correlated. Given correlated data, is a Bonferroni correction correct, in
the sense of preserving a specific overall p value? I'd think not but maybe
I'm wrong. If I'm not, however, what would be the correct correction?

Gene Maguin


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Bruce Weaver
Sent: Thursday, August 26, 2010 9:39 AM
To: [hidden email]
Subject: Re: Multiple testing

Katharina Reinecke wrote:

>
> Hi all,
>
> I have a question about when to correct for multiple testing. The scenario
> seems to be a very common one but we have been struggling to figure out
> the best way to analyze it, so any help would be greatly appreciated.
>
> We have conducted a usability test in which 40 participants were asked to
> fulfill three tasks with two different software versions (the main
> experimental factor) in a repeated measures design.
>
> The hypothesis was that one of the software versions would be more usable,
> and this was tested by collecting 3 objective measures (time, number of
> clicks, number of errors).
>
> For each of the objective measures and for each of the three tasks we've
> then used Wilcoxon signed-rank tests (because the data was not normally
> distributed) for paired samples with a 1-tailed test.
>
> Question: Do we have to correct for multiple testing in this setting,
> because the data was collected from the same participants?
>
> Many thanks in advance,
> Katharina
>
>

I don't think there is a single correct answer to your question.  For
example, where does your study fall on the "exploratory to confirmatory"
spectrum?  The closer you are to the confirmatory end, the greater the need
to correct for multiple tests, IMO.  I know of at least one article that
argues no correction needs to be applied for purely exploratory studies, on
the other hand.


<a href="http://plog.yejh.tc.edu.tw/gallery/53/%E5%88%A4%E6%96%B7%E5%A4%9A%E5%85%83%E">http://plog.yejh.tc.edu.tw/gallery/53/%E5%88%A4%E6%96%B7%E5%A4%9A%E5%85%83%E
8%A9%95%E9%87%8F.pdf

The importance of correcting for multiple tests is also related to the
number of tests, obviously.  If you have only 3, it's not as big an issue as
if you have dozens or hundreds.  In the latter case, it's possible to draw
some pretty outlandish conclusions.  E.g.,

  http://prefrontal.org/files/posters/Bennett-Salmon-2009.pdf

HTH.


-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Multiple-testing-tp2652912p271
1864.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Multiple testing

Bruce Weaver
Administrator
Gene Maguin wrote
I thought of a couple of questions when I read this current thread. If
Katharina elects to correct for multiple tests, I'd guess that she'd use a
Bonferroni correction. Her data are repeated measures and I'd expect, a
priori, that the measures are correlated. And, when samples are drawn
repeatedly from a population and analyzed, the test statistic would also be
correlated. Given correlated data, is a Bonferroni correction correct, in
the sense of preserving a specific overall p value? I'd think not but maybe
I'm wrong. If I'm not, however, what would be the correct correction?

Gene Maguin
Hi Gene.  Here are a couple excerpts from the Bender & Lange article.

"Bonferroni corrections should only be used in cases where the number of tests is quite small (say, less than 5) and the correlations among the test statistics are quite low." (p. 345)

And a longer one on pages 345-346 :

--- start of excerpt ---
The case of multiple endpoints is one of the most common multiplicity problems in clinical trials [29,30]. There are several possible strategies to deal with multiple endpoints. The simplest approach, which should always be considered first, is to specify a single primary endpoint. This approach makes adjustments for multiple endpoints unnecessary. However, all other endpoints are then subsidiary and results concerning secondary endpoints can only have an exploratory rather than a confirmatory interpretation. The second possibility is to combine the outcomes in one aggregated endpoint (e.g., a summary score for quality of life
data or the time to the first event in the case of survival data). The approach is adequate only if one is not interested in the results of the individual endpoints. Thirdly, for significance testing multivariate methods [e.g., multivariate analysis of variance (MANOVA) or Hotelling’s T^2 test] and global test statistics developed by O’Brien [31] and extended by Pocock et al. [32] can be used. Exact tests suitable
for a large number of endpoints and small sample size have been developed by Läuter [33]. All these methods provide an overall assessment of effects in terms of statistical significance but offer no estimate of the magnitude of the effects. Again, information about the effects concerning the individual endpoints is lacking. In addition, Hotelling’s T^2 test lacks power since it tests for unstructured alternative hypotheses, when in fact one is really interested in evidence from several outcomes pointing in the same direction [34]. Hence, in the case of several equally important endpoints for which individual results are of interest, multiple test adjustments are required, either alone or in combination with previously mentioned approaches. Possible methods to adjust for multiple testing in the case of multiple endpoints are given by the general adjustment methods based upon P values [35] and the resampling methods [22] introduced above. It is also possible to allocate different type 1 error rates to several not equally important endpoints [36,37].
--- end of excerpt ---

Cheers,
Bruce
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Multiple testing

Mike
Bruce's points below, especially about using Bonferroni corrections
in situations where one is doing few tests and there are low correlations
among measurement, are good advice.  As the number of tests
increase and/or the correlations increase, the Bonferroni correction
becomes too conservative.  There are procedures for using the
information about correlations among measures though I don't
have a reference on this immediately at hand.  However, the SISA
software webpages does allow one to use a web app to calculate
the corrected per comparison alpha; see:
http://www.quantitativeskills.com/sisa/calculations/bonfer.htm
The output provides both Bonferroni and Sidak correction for
the case of r=0.00 (independent measures) and the user specified
correlation (i.e., mean correlation among measures).
SISA provides background on this page but do not provide details
on the calculations; see:
http://www.quantitativeskills.com/sisa/calculations/bonhlp.htm

For those with more technical statistical background, the situation
we're discussing actually comes up often in genetics research and
one article that compares different adjustment procedures is
available on the PubMed website; see:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2276357/

The reference is:
Conneely, K.N. & Boehnke, M. (2007) So many correlated tests,
so little time! Rapid adjustment of p values for multiple correlated
tests. American Journal of Human Genetics, 81(6), 1158-1168.

The article points out that in genetics, permutation tests are often
used for correlated tests (the Bonferroni and Sidak adjustments
are too conservative for most situations) and the authors provide
a new method for calculating an adjusted p value (Pact) that takes
less time to calculate than then permutation results.

For people proficient in R, the authors provide access to their
R code for calculating Pact;
Authors' Web site, http://csg.sph.umich.edu/boehnke/p_act.php
(for R code for computation of PACT)

-Mike Palij
New York University
[hidden email]


----- Original Message -----
From: "Bruce Weaver" <[hidden email]>
To: <[hidden email]>
Sent: Thursday, August 26, 2010 11:58 AM
Subject: Re: Multiple testing


> Gene Maguin wrote:
>>
>> I thought of a couple of questions when I read this current thread. If
>> Katharina elects to correct for multiple tests, I'd guess that she'd use a
>> Bonferroni correction. Her data are repeated measures and I'd expect, a
>> priori, that the measures are correlated. And, when samples are drawn
>> repeatedly from a population and analyzed, the test statistic would also
>> be
>> correlated. Given correlated data, is a Bonferroni correction correct, in
>> the sense of preserving a specific overall p value? I'd think not but
>> maybe
>> I'm wrong. If I'm not, however, what would be the correct correction?
>>
>> Gene Maguin
>>
>>
>
> Hi Gene.  Here are a couple excerpts from the Bender & Lange article.
>
> "Bonferroni corrections should only be used in cases where the number of
> tests is quite small (say, less than 5) and the correlations among the test
> statistics are quite low." (p. 345)
>
> And a longer one on pages 345-346 :
>
> --- start of excerpt ---
> The case of multiple endpoints is one of the most common multiplicity
> problems in clinical trials [29,30]. There are several possible strategies
> to deal with multiple endpoints. The simplest approach, which should always
> be considered first, is to specify a single primary endpoint. This approach
> makes adjustments for multiple endpoints unnecessary. However, all other
> endpoints are then subsidiary and results concerning secondary endpoints can
> only have an exploratory rather than a confirmatory interpretation. The
> second possibility is to combine the outcomes in one aggregated endpoint
> (e.g., a summary score for quality of life
> data or the time to the first event in the case of survival data). The
> approach is adequate only if one is not interested in the results of the
> individual endpoints. Thirdly, for significance testing multivariate methods
> [e.g., multivariate analysis of variance (MANOVA) or Hotelling’s T^2 test]
> and global test statistics developed by O’Brien [31] and extended by Pocock
> et al. [32] can be used. Exact tests suitable
> for a large number of endpoints and small sample size have been developed by
> Läuter [33]. All these methods provide an overall assessment of effects in
> terms of statistical significance but offer no estimate of the magnitude of
> the effects. Again, information about the effects concerning the individual
> endpoints is lacking. In addition, Hotelling’s T^2 test lacks power since it
> tests for unstructured alternative hypotheses, when in fact one is really
> interested in evidence from several outcomes pointing in the same direction
> [34]. Hence, in the case of several equally important endpoints for which
> individual results are of interest, multiple test adjustments are required,
> either alone or in combination with previously mentioned approaches.
> Possible methods to adjust for multiple testing in the case of multiple
> endpoints are given by the general adjustment methods based upon P values
> [35] and the resampling methods [22] introduced above. It is also possible
> to allocate different type 1 error rates to several not equally important
> endpoints [36,37].
> --- end of excerpt ---
>
> Cheers,
> Bruce
>
>
> -----
> --
> Bruce Weaver
> [hidden email]
> http://sites.google.com/a/lakeheadu.ca/bweaver/
>
> "When all else fails, RTFM."
>
> NOTE: My Hotmail account is not monitored regularly.
> To send me an e-mail, please use the address shown above.
>
> --
> View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Multiple-testing-tp2652912p2723296.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD