validity and screening of a large number of bivar correlations?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

validity and screening of a large number of bivar correlations?

Ian Martin-2
I often have to produce a table of Pearson correlations between a
suite of environmental impact variables (water chemistry, etc.) and a
suite of biological monitoring variables (number of species, number
of organisms, etc.).

Typically, even with some data reduction (PCA or similar), this
results in several hundred correlations, even when I reduce the
matrix to only those correlations between the 2 suites of variables.
I realize each correlation is independent, but potentially there are
a worrisome number of falsely significant correlations flagged.

As a screening tool to identify associations of interest, I have
sometimes used a pseudo-Bonferroni correction to  adjust the
significance level according to the number of bivariate correlations
in the table, and as well I usually spend quite a bit of time
generating scatterplots of possible associations.  However, I'm
wondering if this approach to screening the correlations is
defensible even in a pragmatic -- if not statistical -- sense, or
whether there is a better way to consider a large number of possible
associations between these variables?

I'd appreciate any thoughts or suggestions.

regards,
Ian

Ian D. Martin, Ph.D.

Tsuji Laboratory
University of Waterloo
Dept. of Environment & Resource Studies

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: validity and screening of a large number of bivar correlations?

SR Millis-3
You should consider using the false discovery rate method of Benjamini & Hochberg (1995) or q-value approach developed by Story (2002).


Scott R Millis, PhD, ABPP (CN,CL,RP), CStat, CSci
Professor & Director of Research
Dept of Physical Medicine & Rehabilitation
Dept of Emergency Medicine
Wayne State University School of Medicine
261 Mack Blvd
Detroit, MI 48201
Email:  [hidden email]
Tel: 313-993-8085
Fax: 313-966-7682


--- On Thu, 4/16/09, Ian Martin <[hidden email]> wrote:

> From: Ian Martin <[hidden email]>
> Subject: validity and screening of a large number of bivar correlations?
> To: [hidden email]
> Date: Thursday, April 16, 2009, 12:18 PM
> I often have to produce a table of Pearson correlations
> between a
> suite of environmental impact variables (water chemistry,
> etc.) and a
> suite of biological monitoring variables (number of
> species, number
> of organisms, etc.).
>
> Typically, even with some data reduction (PCA or similar),
> this
> results in several hundred correlations, even when I reduce
> the
> matrix to only those correlations between the 2 suites of
> variables.
> I realize each correlation is independent, but potentially
> there are
> a worrisome number of falsely significant correlations
> flagged.
>
> As a screening tool to identify associations of interest, I
> have
> sometimes used a pseudo-Bonferroni correction to  adjust
> the
> significance level according to the number of bivariate
> correlations
> in the table, and as well I usually spend quite a bit of
> time
> generating scatterplots of possible associations.  However,
> I'm
> wondering if this approach to screening the correlations is
> defensible even in a pragmatic -- if not statistical --
> sense, or
> whether there is a better way to consider a large number of
> possible
> associations between these variables?
>
> I'd appreciate any thoughts or suggestions.
>
> regards,
> Ian
>
> Ian D. Martin, Ph.D.
>
> Tsuji Laboratory
> University of Waterloo
> Dept. of Environment & Resource Studies
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body
> text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: validity and screening of a large number of bivar correlations?

Ian Martin-2
Scott, thanks very much for your suggestion.  If possible, I'd
appreciate a little more detail on the references you mention, so
that I can look them up at the library.

regards,
Ian

On 16 Apr, 2009, at 2:51 PM, SR Millis wrote:

>
> You should consider using the false discovery rate method of
> Benjamini & Hochberg (1995) or q-value approach developed by Story
> (2002).
>
>
> Scott R Millis, PhD, ABPP (CN,CL,RP), CStat, CSci
> Professor & Director of Research
> Dept of Physical Medicine & Rehabilitation
> Dept of Emergency Medicine
> Wayne State University School of Medicine
> 261 Mack Blvd
> Detroit, MI 48201
> Email:  [hidden email]
> Tel: 313-993-8085
> Fax: 313-966-7682
>
>
> --- On Thu, 4/16/09, Ian Martin <[hidden email]> wrote:
>
>> From: Ian Martin <[hidden email]>
>> Subject: validity and screening of a large number of bivar
>> correlations?
>> To: [hidden email]
>> Date: Thursday, April 16, 2009, 12:18 PM
>> I often have to produce a table of Pearson correlations
>> between a
>> suite of environmental impact variables (water chemistry,
>> etc.) and a
>> suite of biological monitoring variables (number of
>> species, number
>> of organisms, etc.).
>>
>> Typically, even with some data reduction (PCA or similar),
>> this
>> results in several hundred correlations, even when I reduce
>> the
>> matrix to only those correlations between the 2 suites of
>> variables.
>> I realize each correlation is independent, but potentially
>> there are
>> a worrisome number of falsely significant correlations
>> flagged.
>>
>> As a screening tool to identify associations of interest, I
>> have
>> sometimes used a pseudo-Bonferroni correction to  adjust
>> the
>> significance level according to the number of bivariate
>> correlations
>> in the table, and as well I usually spend quite a bit of
>> time
>> generating scatterplots of possible associations.  However,
>> I'm
>> wondering if this approach to screening the correlations is
>> defensible even in a pragmatic -- if not statistical --
>> sense, or
>> whether there is a better way to consider a large number of
>> possible
>> associations between these variables?
>>
>> I'd appreciate any thoughts or suggestions.
>>
>> regards,
>> Ian
>>
>> Ian D. Martin, Ph.D.
>>
>> Tsuji Laboratory
>> University of Waterloo
>> Dept. of Environment & Resource Studies
>>
>> =====================
>> To manage your subscription to SPSSX-L, send a message to
>> [hidden email] (not to SPSSX-L), with no body
>> text except the
>> command. To leave the list, send the command
>> SIGNOFF SPSSX-L
>> For a list of commands to manage subscriptions, send the
>> command
>> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: validity and screening of a large number of bivar correlations?

SR Millis-3
In reply to this post by Ian Martin-2
Ian,

Here are some references and resources:

Benjamini, Y., Drai, D., Elmer, G., Kafkafi, N., & Golani, I. (2001). Controlling the false discovery rate in behavior genetics research. Behav Brain Res, 125(1-2), 279-284.

Story, J. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B, 64, 479-498.

http://genomics.princeton.edu/storeylab/qvalue/

Scott R Millis, PhD, ABPP (CN,CL,RP), CStat, CSci
Professor & Director of Research
Dept of Physical Medicine & Rehabilitation
Dept of Emergency Medicine
Wayne State University School of Medicine
261 Mack Blvd
Detroit, MI 48201
Email:  [hidden email]
Tel: 313-993-8085
Fax: 313-966-7682


--- On Thu, 4/16/09, Ian Martin <[hidden email]> wrote:

> From: Ian Martin <[hidden email]>
> Subject: Re: validity and screening of a large number of bivar correlations?
> To: "SR Millis" <[hidden email]>
> Cc: [hidden email]
> Date: Thursday, April 16, 2009, 3:27 PM
> Scott, thanks very much for your suggestion.  If possible,
> I'd appreciate a little more detail on the references
> you mention, so that I can look them up at the library.
>
> regards,
> Ian
>
> On 16 Apr, 2009, at 2:51 PM, SR Millis wrote:
>
> >
> > You should consider using the false discovery rate
> method of Benjamini & Hochberg (1995) or q-value
> approach developed by Story (2002).
> >
> >
> > Scott R Millis, PhD, ABPP (CN,CL,RP), CStat, CSci
> > Professor & Director of Research
> > Dept of Physical Medicine & Rehabilitation
> > Dept of Emergency Medicine
> > Wayne State University School of Medicine
> > 261 Mack Blvd
> > Detroit, MI 48201
> > Email:  [hidden email]
> > Tel: 313-993-8085
> > Fax: 313-966-7682
> >
> >
> > --- On Thu, 4/16/09, Ian Martin
> <[hidden email]> wrote:
> >
> >> From: Ian Martin <[hidden email]>
> >> Subject: validity and screening of a large number
> of bivar correlations?
> >> To: [hidden email]
> >> Date: Thursday, April 16, 2009, 12:18 PM
> >> I often have to produce a table of Pearson
> correlations
> >> between a
> >> suite of environmental impact variables (water
> chemistry,
> >> etc.) and a
> >> suite of biological monitoring variables (number
> of
> >> species, number
> >> of organisms, etc.).
> >>
> >> Typically, even with some data reduction (PCA or
> similar),
> >> this
> >> results in several hundred correlations, even when
> I reduce
> >> the
> >> matrix to only those correlations between the 2
> suites of
> >> variables.
> >> I realize each correlation is independent, but
> potentially
> >> there are
> >> a worrisome number of falsely significant
> correlations
> >> flagged.
> >>
> >> As a screening tool to identify associations of
> interest, I
> >> have
> >> sometimes used a pseudo-Bonferroni correction to
> adjust
> >> the
> >> significance level according to the number of
> bivariate
> >> correlations
> >> in the table, and as well I usually spend quite a
> bit of
> >> time
> >> generating scatterplots of possible associations.
> However,
> >> I'm
> >> wondering if this approach to screening the
> correlations is
> >> defensible even in a pragmatic -- if not
> statistical --
> >> sense, or
> >> whether there is a better way to consider a large
> number of
> >> possible
> >> associations between these variables?
> >>
> >> I'd appreciate any thoughts or suggestions.
> >>
> >> regards,
> >> Ian
> >>
> >> Ian D. Martin, Ph.D.
> >>
> >> Tsuji Laboratory
> >> University of Waterloo
> >> Dept. of Environment & Resource Studies
> >>
> >> =====================
> >> To manage your subscription to SPSSX-L, send a
> message to
> >> [hidden email] (not to SPSSX-L), with
> no body
> >> text except the
> >> command. To leave the list, send the command
> >> SIGNOFF SPSSX-L
> >> For a list of commands to manage subscriptions,
> send the
> >> command
> >> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: validity and screening of a large number of bivar correlations?

Marta Garcia-Granero
In reply to this post by Ian Martin-2
Ian Martin wrote:
> Scott, thanks very much for your suggestion.  If possible, I'd
> appreciate a little more detail on the references you mention, so
> that I can look them up at the library.

You can find a lot of references at the end of this message I sent to
list in January:

http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0801&L=spssx-l&P=46191

(I still keep a copy of the program mentioned in the message).

Regards,
Marta GarcĂ­a-Granero

>> You should consider using the false discovery rate method of
>> Benjamini & Hochberg (1995) or q-value approach developed by Story
>> (2002).
>>
>>
>> --- On Thu, 4/16/09, Ian Martin <[hidden email]> wrote:
>>
>>> From: Ian Martin <[hidden email]>
>>> Subject: validity and screening of a large number of bivar
>>> correlations?
>>> To: [hidden email]
>>> Date: Thursday, April 16, 2009, 12:18 PM
>>> I often have to produce a table of Pearson correlations
>>> between a
>>> suite of environmental impact variables (water chemistry,
>>> etc.) and a
>>> suite of biological monitoring variables (number of
>>> species, number
>>> of organisms, etc.).
>>>
>>> Typically, even with some data reduction (PCA or similar),
>>> this
>>> results in several hundred correlations, even when I reduce
>>> the
>>> matrix to only those correlations between the 2 suites of
>>> variables.
>>> I realize each correlation is independent, but potentially
>>> there are
>>> a worrisome number of falsely significant correlations
>>> flagged.
>>>
>>> As a screening tool to identify associations of interest, I
>>> have
>>> sometimes used a pseudo-Bonferroni correction to  adjust
>>> the
>>> significance level according to the number of bivariate
>>> correlations
>>> in the table, and as well I usually spend quite a bit of
>>> time
>>> generating scatterplots of possible associations.  However,
>>> I'm
>>> wondering if this approach to screening the correlations is
>>> defensible even in a pragmatic -- if not statistical --
>>> sense, or
>>> whether there is a better way to consider a large number of
>>> possible
>>> associations between these variables?
>>>
>


--
For miscellaneous statistical stuff, visit:
http://gjyp.nl/marta/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Very short question

Marta Garcia-Granero
Ruben van den Berg wrote:
>
> Is it mal practice to use (oneway) ANOVA with a dichotomous dependent
> variable? The sampling distributions of the conditional (within group)
> means should follow normal distributions due to the central limit
> theorems, or am I missing something? Of course you'd normally use a
> chi2 independence test but there's no post hoc option (like Tukey's
> HSD) in there.
Try to find the thread concerning Marascuilo procedure (april 7, or so).
A method to perform multiple pairwise comparisons for binary outcomes
was presented (syntax provided), and compared with CTABLES procedure
with Bonferroni adjustment.

Although ANOVA is quite robust to departures from normality, binary
outcomes tend to present heterogeneity of variances (since variance will
be related to the proportion of cases in a group: Var(p)=p*(1-p)/n), and
that's a worse problem than lack of normality: it precludes the use of
Tukey's HSD method, for instance.

I'd rather get these questions at the list, not privately, since more
people might contribute to or benefit from the thread. I will therefore
address the answer to the whole list.

Nice weekend to you, too,
marta
>
> TIA and have a nice weekend!
>


--
For miscellaneous statistical stuff, visit:
http://gjyp.nl/marta/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: validity and screening of a large number of bivar correlations?

Art Kendall
In reply to this post by Ian Martin-2
Have tried exploring your data with canonical correlations rather than PCA?

IIRC the OVERALS procedure will do an n sets canonical correlations.
Macros ship with SPSS to do 2 set canonical correlations.

Art Kendall
Social Research Consultants.

Ian Martin wrote:

> I often have to produce a table of Pearson correlations between a
> suite of environmental impact variables (water chemistry, etc.) and a
> suite of biological monitoring variables (number of species, number
> of organisms, etc.).
>
> Typically, even with some data reduction (PCA or similar), this
> results in several hundred correlations, even when I reduce the
> matrix to only those correlations between the 2 suites of variables.
> I realize each correlation is independent, but potentially there are
> a worrisome number of falsely significant correlations flagged.
>
> As a screening tool to identify associations of interest, I have
> sometimes used a pseudo-Bonferroni correction to  adjust the
> significance level according to the number of bivariate correlations
> in the table, and as well I usually spend quite a bit of time
> generating scatterplots of possible associations.  However, I'm
> wondering if this approach to screening the correlations is
> defensible even in a pragmatic -- if not statistical -- sense, or
> whether there is a better way to consider a large number of possible
> associations between these variables?
>
> I'd appreciate any thoughts or suggestions.
>
> regards,
> Ian
>
> Ian D. Martin, Ph.D.
>
> Tsuji Laboratory
> University of Waterloo
> Dept. of Environment & Resource Studies
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Art Kendall
Social Research Consultants