This post was updated on .
Hi,
I have 18 raters that scored twice a package of 44 ultrasound images using a Likert scale (0-3). I have to compute intra-rater and inter-rater reliability. Initaially I used ICC for both but the reviewers recommended to use Cohen's k for intra-rater and Light's k for inter-rater computations. I did so but k was extremely low. Taking into consideration that the prevalence of positive scores (1,2,3) was much higher in my package than in real-life I expected to get this kind of results. But not so small :( So I decided to repport both k (as the reviewers requested) and ICC in my paper. I am working now to support my decision (to include - and prefer - ICC). Could you recommend me some papers in this line? (inter-ratert & intra-rater reliability / k vs. ICC). Thank you! |
For what it's worth, the only articles I've seen that employ both inter- and intra-rater reliability do so on the same basis as for inter-rater reliability alone. They use the approach that best fits the data, i.e., if it's nominal, then kappa; if it's ordered (ordinal, interval), then it's the ICC. I've never read anything that mandates kappa for intra- and another analysis for inter-.
Brian -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of bflorian Sent: Sunday, March 31, 2013 5:07 PM To: [hidden email] Subject: ICC vs kappa Hi, I have 18 raters that scored twice a package of 44 ultrasound images using a Likert scale (0-3). I have to compute intra-rater and inter-rater reliability. Initaially I used ICC for both but the reviewers recommended to use Cohen's k for intra-rater and Light's k for inter-rater computations. I did so but k was extremely low. Taking into consideration that the prevalence of positive scores (1,2,3) was much haigher in my package than in real-life eI expected to get this kind of results. But not so small. So I decided to repport both k (as the reviewers requested) and ICC in my paper. I am working now to support my decision (to include - and prefer - ICC). Could you recommend me some papers in this line? (inter-ratert & intra-rater reliability / k vs. ICC). Thank you! -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/ICC-vs-kappa-tp5719210.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
I don't have articles to recommend, but in practice, I have
always tried to convince my consultees to favor intelligible indexes, instead of using kappas (which are hard to generalize across different sizes of tables and values of marginal frequencies). And if reviewers want to see ICCs, it is apt to be better for your own information in developing a scale or critiquing your own study to look at both similarities and *differences*. For instance, for two raters and a single scaled score, a simple paired t-test gives you both (a) the similarity, in the correlation, and (b) the difference, in the t-test. The same information for a dichotomous variable is kappa or r for the similarity, and McNemar's test on the differences. The ICC for the simplest two-rater case effectively computes the correlation from a formula where the combined mean replaces the separate rater's means -- so the ICC is reduced from the interclass r according to the observed difference in those two means. (I think it is a familiar shortcoming of articles on Cohen's kappa, including that in Wikipedia, that they seldom mention testing the differences -- because that usually *should* be a parallel interest. Wikip even gives an example where the difference immediately "explains" why two tables have different kappas, despite the same number of agreements. Maybe someone should add a note about that.) Further, for my own information on scales to be treated as nominal, I do want to look at the several dichotomies, like dx1 vs other, dx2 vs other, and so on -- and not merely the average of those scores. (Is that Light's kappa?) -- Rich Ulrich > Date: Mon, 1 Apr 2013 21:16:02 +0000 > From: [hidden email] > Subject: Re: ICC vs kappa > To: [hidden email] > > For what it's worth, the only articles I've seen that employ both inter- and intra-rater reliability do so on the same basis as for inter-rater reliability alone. They use the approach that best fits the data, i.e., if it's nominal, then kappa; if it's ordered (ordinal, interval), then it's the ICC. I've never read anything that mandates kappa for intra- and another analysis for inter-. > > Brian > > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of bflorian > Sent: Sunday, March 31, 2013 5:07 PM > To: [hidden email] > Subject: ICC vs kappa > > Hi, > > I have 18 raters that scored twice a package of 44 ultrasound images using a > Likert scale (0-3). I have to compute intra-rater and inter-rater > reliability. Initaially I used ICC for both but the reviewers recommended to > use Cohen's k for intra-rater and Light's k for inter-rater computations. > > I did so but k was extremely low. Taking into consideration that the > prevalence of positive scores (1,2,3) was much haigher in my package than in > real-life eI expected to get this kind of results. But not so small. So I > decided to repport both k (as the reviewers requested) and ICC in my paper. > > I am working now to support my decision (to include - and prefer - ICC). > > Could you recommend me some papers in this line? (inter-ratert & intra-rater > reliability / k vs. ICC). > > ... |
Rich, If I understand your last question correctly, if you were assigning diagnoses to patients, e.g., dx1, dx2, etc., the diagnoses would
comprise the categories. You are interested in getting category based kappa’s to identify which diagnoses are/are not significant in their agreement. There is no adaptation of Cohen’s kappa which does this. Light’s kappa averages the two rater kappa’s to
arrive at an overall kappa. Fleiss’ generalized kappa does analyze category as well as overall kappa’s. There are other approaches that do the same, but they’re pretty obscure statistics. There’s also the issue of one’s theoretical approach to agreement.
The original approach presumes equal distribution among categories (e.g., Bennett’s S). The second approach presumes an underlying category distribution (e.g., Fleiss’ generalized kappa, Krippendorff’s alpha, Gwet’s AC1). The last approach presumes an underlying
distribution of rater-category interaction (e.g., Cohen’s kappa, Hubert’s kappa, Light’s kappa). So it’s also a matter of a position of the nature of agreement as well as looking for category-based analysis. If you’re willing to go with the underlying category-based
approach, Fleiss would be my recommendation as it is a frequently used index. Before I ramble on too long, expanding one of the Cohen statistics to get category analysis would be pretty simple, but just hasn’t been done by anyone up to this point. Brian From: Rich Ulrich [mailto:[hidden email]]
I don't have articles to recommend, but in practice, I have > Date: Mon, 1 Apr 2013 21:16:02 +0000 |
Free forum by Nabble | Edit this page |