Let me summarize what I understand here, in my own vocabulary.
You have 5 factor scores which are computed as sums (or averages,
for easier interpretation) of several items. There are 3 to 6 items
for the 5 factors. Each is a measure of "excellence of performance"
of a behavioral element. Presumably, you could have a single overall
composite score or summative score for excellence.
Each behavior (of ? subjects) is rated by 3 separate raters.
You are asking because you are writing your first research paper,
and you ask if Fleiss's kappa is appropriate, computed on each item.
First. Fleiss's kappa, according to Wikip, is used for nominal (categorical)
ratings, not for your (presumably) interval ratings of excellence. So you
won't use that.
Now. There are two alternate purposes for reliability statistics. One is
to examine your data closely, especially at the pilot stage, in order to
improve the scales and methods. The other is to provide an adequate
summary for presentation.
Unless you have many raters and sparse duplications, you ordinarily
want to *examine* raters in pairs. Then you can look both at
similarities (correlation) and differences (paired t-test). You might
look at each item as well as at the total/ composite/ summary scores,
especially if this is the first use of the scale.
I never like the multi-rater version of reliability statistics, unless the
data were sparse in duplications. But they *are* useful for the
tersest summary in a publication, and maybe necessary if there are
a lot of raters. Some editors or reviewers seem to like them even
when there are very few raters. However, unless there is some
special emphasis required by the study, you would ordinarily use
them on the factor scores, and would seldom apply them to the
individual items.
--
Rich Ulrich
> Date: Wed, 26 Sep 2012 10:11:53 -0700
> From:
[hidden email]> Subject: Inter rater reliability - Fleiss's Kappa?
> To:
[hidden email]>
> Hi
>
> I am a 4th year medical student, writing my first research paper and have
> very minimal (or zero) experience with statistics!
>
> My project involves the creation of a behavior marking system which has 5
> categories (eg leadership, communication etc). These categories are further
> broken down into subcategories to define the behavior (called 'elements'),
> the number of which is variable from 3-6 per category. The individuals are
> awarded a score of 1-5 for each element, with 5 representing excellent
> performance of the behavioral element. There are 3 separate raters.
>
> I am hoping to assess the inter rater reliability for each element. What is
> an appropriate measurement of this? I have done a little bit of research and
> it would suggest that Fleiss's kappa would be the best as there are 3
> raters. Is this correct, and if so can I use the SPSSX / PASW software to do
> this?
>
> Any help would be very much appreciated!
>...