Rich,
I agree with you wholeheartedly. There is a lot of information that gets lost with global statistics like the ICC or measures of agreement. We had a thread a couple of weeks ago about generating two-rater Fleiss' statistics, and some syntax was developed and shared. We should very well do that for all of these statistics, for raters as well as for categories. We could identify raters who needed further training and categories that could benefit from better definition, or even collapsing. I've always been surprised that the IRT folks use the information to expand, collapse, amend their measures, but generally that kind of attention doesn't happen with other statistics.
Art, thanks for remembering Proximities. It's a great resource and includes most of the statistics that Podani's article covers.
For two raters, I've long preferred looking at (not the ICC but) the ordinary r (because
people are used to the size of it) along with a paired t-test to check any difference.
[For 2x2 data, that could be kappa and Kendall's test for changes.]
For the data I usually dealt with ... For 3 or more raters, I looked at them in pairs.
My emphasis is right, I think, whenever you are developing on your own ad-hoc scales.
However, for the summaries that get published, or for people using the scales
developed by others, what is required (publication) or sufficient (cross-check)
is an overall number like the ICC.
I think that you are right, if you are suggesting that much literature on reliability makes it
easy for people to overlook or to forget the possible complications of differences in level.
--
Rich Ulrich
Free forum by Nabble | Edit this page |