Inter rater reliability - Fleiss's Kappa?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Inter rater reliability - Fleiss's Kappa?

mcolmar
Hi

I am a 4th year medical student, writing my first research paper and have very minimal (or zero) experience with statistics!

My project involves the creation of a behavior marking system which has 5 categories (eg leadership, communication etc). These categories are further broken down into subcategories to define the behavior (called  'elements'), the number of which is variable from 3-6 per category. The individuals are awarded a score of 1-5 for each element, with 5 representing excellent performance of the behavioral element. There are 3 separate raters.  

I am hoping to assess the inter rater reliability for each element. What is an appropriate measurement of this? I have done a little bit of research and it would suggest that Fleiss's kappa would be the best as there are 3 raters. Is this correct, and if so can I use the SPSSX / PASW software to do this?

Any help would be very much appreciated!

Thanks,
Maddy
Reply | Threaded
Open this post in threaded view
|

Re: Inter rater reliability - Fleiss's Kappa?

Rich Ulrich
Let me summarize what I understand here, in my own vocabulary.

You have 5 factor scores which are computed as sums (or averages,
for easier interpretation) of several items.  There are 3 to 6 items
for the 5 factors.  Each is a measure of "excellence of performance"
of a behavioral element.  Presumably, you could have a single overall
composite score or summative score for excellence.

Each behavior (of ? subjects) is rated by 3 separate raters.

You are asking because you are writing your first research paper,
and you ask if Fleiss's kappa is appropriate, computed on each item.

First.  Fleiss's kappa, according to Wikip, is used for nominal (categorical)
ratings, not for your (presumably) interval ratings of excellence.  So you
won't use that.

Now.  There are two alternate purposes for reliability statistics.  One is
to examine your data closely, especially at the pilot stage, in order to
improve the scales and methods.  The other is to provide an adequate
summary for presentation.

Unless you have many raters and sparse duplications, you ordinarily
want to *examine*  raters in pairs.  Then you can look both at
similarities (correlation) and differences (paired t-test).  You might
look at each item as well as at the total/ composite/ summary scores,
especially if this is the first use of the scale. 


I never like the multi-rater version of reliability statistics, unless the
data were sparse in duplications.  But they *are* useful for the
tersest summary in a publication, and maybe necessary if there are
a lot of raters.  Some editors or reviewers seem to like them even
when there are very few raters.  However, unless there is some
special emphasis required by the study, you would ordinarily use
them on the factor scores, and would seldom apply them to the
individual items.

--
Rich Ulrich



> Date: Wed, 26 Sep 2012 10:11:53 -0700

> From: [hidden email]
> Subject: Inter rater reliability - Fleiss's Kappa?
> To: [hidden email]
>
> Hi
>
> I am a 4th year medical student, writing my first research paper and have
> very minimal (or zero) experience with statistics!
>
> My project involves the creation of a behavior marking system which has 5
> categories (eg leadership, communication etc). These categories are further
> broken down into subcategories to define the behavior (called 'elements'),
> the number of which is variable from 3-6 per category. The individuals are
> awarded a score of 1-5 for each element, with 5 representing excellent
> performance of the behavioral element. There are 3 separate raters.
>
> I am hoping to assess the inter rater reliability for each element. What is
> an appropriate measurement of this? I have done a little bit of research and
> it would suggest that Fleiss's kappa would be the best as there are 3
> raters. Is this correct, and if so can I use the SPSSX / PASW software to do
> this?
>
> Any help would be very much appreciated!
>...
Reply | Threaded
Open this post in threaded view
|

Re: Inter rater reliability - Fleiss's Kappa?

bdates

 

 

Just to piggy-back on Rich Ulrich’s comments, I would recommend, since your data are at least ordinal, and more likely interval in nature, that you consider the ICC.  If you choose Fleiss’ kappa at all, it would be to look at the level of agreement, not the ‘reliability’.  The other potential advantage, if you choose, is that Fleiss’ kappa provides information on agreement for each category, so you could identify those ratings which raters found more difficult.  But overall, using Fleiss would be for elucidation of the results of the ICC.  It should not be used as the primary source of information about reliability.

 

Brian

 

> Date: Wed, 26 Sep 2012 10:11:53 -0700
> From: [hidden email]
> Subject: Inter rater reliability - Fleiss's Kappa?
> To: [hidden email]
>
> Hi
>
> I am a 4th year medical student, writing my first research paper and have
> very minimal (or zero) experience with statistics!
>
> My project involves the creation of a behavior marking system which has 5
> categories (eg leadership, communication etc). These categories are further
> broken down into subcategories to define the behavior (called 'elements'),
> the number of which is variable from 3-6 per category. The individuals are
> awarded a score of 1-5 for each element, with 5 representing excellent
> performance of the behavioral element. There are 3 separate raters.
>
> I am hoping to assess the inter rater reliability for each element. What is
> an appropriate measurement of this? I have done a little bit of research and
> it would suggest that Fleiss's kappa would be the best as there are 3
> raters. Is this correct, and if so can I use the SPSSX / PASW software to do
> this?
>
> Any help would be very much appreciated!
>...

Reply | Threaded
Open this post in threaded view
|

Re: Inter rater reliability - Fleiss's Kappa?

mcolmar
Thank you both for your help - I chose to use ICC in the end, as it appeared most appropriate. This is just a pilot study, so in future I might take Rich Ulrich's advice and change my project in order to be able to analyse the inter-rater reliability better.

Maddy