Hi
I am a 4th year medical student, writing my first research paper and have very minimal (or zero) experience with statistics! My project involves the creation of a behavior marking system which has 5 categories (eg leadership, communication etc). These categories are further broken down into subcategories to define the behavior (called 'elements'), the number of which is variable from 3-6 per category. The individuals are awarded a score of 1-5 for each element, with 5 representing excellent performance of the behavioral element. There are 3 separate raters. I am hoping to assess the inter rater reliability for each element. What is an appropriate measurement of this? I have done a little bit of research and it would suggest that Fleiss's kappa would be the best as there are 3 raters. Is this correct, and if so can I use the SPSSX / PASW software to do this? Any help would be very much appreciated! Thanks, Maddy |
Let me summarize what I understand here, in my own vocabulary.
You have 5 factor scores which are computed as sums (or averages, for easier interpretation) of several items. There are 3 to 6 items for the 5 factors. Each is a measure of "excellence of performance" of a behavioral element. Presumably, you could have a single overall composite score or summative score for excellence. Each behavior (of ? subjects) is rated by 3 separate raters. You are asking because you are writing your first research paper, and you ask if Fleiss's kappa is appropriate, computed on each item. First. Fleiss's kappa, according to Wikip, is used for nominal (categorical) ratings, not for your (presumably) interval ratings of excellence. So you won't use that. Now. There are two alternate purposes for reliability statistics. One is to examine your data closely, especially at the pilot stage, in order to improve the scales and methods. The other is to provide an adequate summary for presentation. Unless you have many raters and sparse duplications, you ordinarily want to *examine* raters in pairs. Then you can look both at similarities (correlation) and differences (paired t-test). You might look at each item as well as at the total/ composite/ summary scores, especially if this is the first use of the scale. I never like the multi-rater version of reliability statistics, unless the data were sparse in duplications. But they *are* useful for the tersest summary in a publication, and maybe necessary if there are a lot of raters. Some editors or reviewers seem to like them even when there are very few raters. However, unless there is some special emphasis required by the study, you would ordinarily use them on the factor scores, and would seldom apply them to the individual items. -- Rich Ulrich > Date: Wed, 26 Sep 2012 10:11:53 -0700 > From: [hidden email] > Subject: Inter rater reliability - Fleiss's Kappa? > To: [hidden email] > > Hi > > I am a 4th year medical student, writing my first research paper and have > very minimal (or zero) experience with statistics! > > My project involves the creation of a behavior marking system which has 5 > categories (eg leadership, communication etc). These categories are further > broken down into subcategories to define the behavior (called 'elements'), > the number of which is variable from 3-6 per category. The individuals are > awarded a score of 1-5 for each element, with 5 representing excellent > performance of the behavioral element. There are 3 separate raters. > > I am hoping to assess the inter rater reliability for each element. What is > an appropriate measurement of this? I have done a little bit of research and > it would suggest that Fleiss's kappa would be the best as there are 3 > raters. Is this correct, and if so can I use the SPSSX / PASW software to do > this? > > Any help would be very much appreciated! >... |
Just to piggy-back on Rich Ulrich’s comments, I would recommend,
since your data are at least ordinal, and more likely interval in nature, that
you consider the ICC. If you choose Fleiss’ kappa at all, it would
be to look at the level of agreement, not the ‘reliability’. The
other potential advantage, if you choose, is that Fleiss’ kappa provides
information on agreement for each category, so you could identify those ratings
which raters found more difficult. But overall, using Fleiss would be for
elucidation of the results of the ICC. It should not be used as the
primary source of information about reliability. Brian > Date: Wed, 26 Sep 2012 10:11:53 -0700 |
Thank you both for your help - I chose to use ICC in the end, as it appeared most appropriate. This is just a pilot study, so in future I might take Rich Ulrich's advice and change my project in order to be able to analyse the inter-rater reliability better.
Maddy |
Free forum by Nabble | Edit this page |