SPSSX Discussion

Help Performing Inter Rater Reliability Measures for Multiple Raters in SPSS

Classic

List

Threaded

4 messages Options

cara sauder

Help Performing Inter Rater Reliability Measures for Multiple Raters in SPSS

Hello,

I performed a study with 32 raters who rated severity (0-4 normal, mild,
moderate severity) of several visual perceptual parameters for 4 different
videos. All raters rated all parameters for all videos after being given
clinical information about the patient. Raters received incorrect clinical
information for 3 of 4 videos and correct info for 1 video.

I am trying to answer the following question, "Is there a statistically
significant difference in rater reliability when video stimuli are paired
with matched versus mismatched clinical vignettes? My plan was to make each
combination of video and clinical vignette a unique variable or a different
“treatment”. A permutation test procedure will be used to assess the
statistical significance of the difference between these two, Km-Kmm. This
analysis proceeds by determining a null hypothesis (agreement is the same
for matched and mismatched scenarios) distribution for this difference by
considering all possible reassignments (permutations) of the labels
“matched” and “mismatched” to the observed data (Mielke, 2007).

I do not know how to test this hypothesis in SPSS (version 24) on my Mac and
am also getting an error message "There are too few complete cases" with the
following syntax when I want to examine inter rater reliability of the
entire group of raters.

STATS FLEISS KAPPA VARIABLES=Rater11 Rater12 Rater13 Rater14 Rater15 Rater16
Rater17 Rater18
Rater21 Rater22 Rater23 Rater24 Rater25 Rater26 Rater27 Rater28 Rater31
Rater32 Rater33 Rater34
Rater35 Rater36 Rater37 Rater38 Rater41 Rater42 Rater43 Rater44 Rater45
Rater46 Rater47 Rater48
/OPTIONS CILEVEL=95.

Any assistance would be greatly appreciated. Thanks again!

Cara Sauder
Doctoral Student

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

bdates

Re: Help Performing Inter Rater Reliability Measures for Multiple Raters in SPSS

Cara,

Any agreement statistic that was created for nominal data has difficulty with missing data; therefore using Fleiss is not really a viable option. Based on an article in 1973 by Fleiss and Cohen (reference below) that demonstrated the equivalence of weighted kappa to ICC, I'd suggest the latter. Your measure(s) suggests that the data are at least ordinal if not interval in nature, and that justifies the use of ICC rather than a weighted kappa. Try that. You'll need to decide which model to use based on the following list.

Model 1: Raters are a random sample from a specified population of raters, and each rater does not rate all subjects/objects. Therefore, each subject/object is rated by a potentially different set of raters.

Model 2: Raters are a random sample from a specified population of raters, and each rater rates each subject/object.

Model 3: Raters constitute the entire population of raters, and each
rates each subject/object.

Good luck.

Brian

Fleiss, J.L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613-619.
________________________________________
From: SPSSX(r) Discussion [[hidden email]] on behalf of cara sauder [[hidden email]]
Sent: Thursday, November 23, 2017 12:45 AM
To: [hidden email]
Subject: Help Performing Inter Rater Reliability Measures for Multiple Raters in SPSS

Hello,

I performed a study with 32 raters who rated severity (0-4 normal, mild,
moderate severity) of several visual perceptual parameters for 4 different
videos. All raters rated all parameters for all videos after being given
clinical information about the patient. Raters received incorrect clinical
information for 3 of 4 videos and correct info for 1 video.

I am trying to answer the following question, "Is there a statistically
significant difference in rater reliability when video stimuli are paired
with matched versus mismatched clinical vignettes? My plan was to make each
combination of video and clinical vignette a unique variable or a different
“treatment”. A permutation test procedure will be used to assess the
statistical significance of the difference between these two, Km-Kmm. This
analysis proceeds by determining a null hypothesis (agreement is the same
for matched and mismatched scenarios) distribution for this difference by
considering all possible reassignments (permutations) of the labels
“matched” and “mismatched” to the observed data (Mielke, 2007).

I do not know how to test this hypothesis in SPSS (version 24) on my Mac and
am also getting an error message "There are too few complete cases" with the
following syntax when I want to examine inter rater reliability of the
entire group of raters.

STATS FLEISS KAPPA VARIABLES=Rater11 Rater12 Rater13 Rater14 Rater15 Rater16
Rater17 Rater18
Rater21 Rater22 Rater23 Rater24 Rater25 Rater26 Rater27 Rater28 Rater31
Rater32 Rater33 Rater34
Rater35 Rater36 Rater37 Rater38 Rater41 Rater42 Rater43 Rater44 Rater45
Rater46 Rater47 Rater48
/OPTIONS CILEVEL=95.

Any assistance would be greatly appreciated. Thanks again!

Cara Sauder
Doctoral Student

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Rich Ulrich

Re: Help Performing Inter Rater Reliability Measures for Multiple Raters in SPSS

In reply to this post by cara sauder

With a sample of 4, you should expect huge error-variance in any measure

like kappa or r or ICC, so you don't have enough data. Measures of reliability are

properly remembered as "reliability in THIS sample."

I've analyzed data where psychiatric residents (called "students") rated videos

of inpatient interviews for several "symptoms present." There were 5 sets of data, about

15 students in each class, and six videos for each class.

"HALLUCINATIONS" showed /excellent/ correlation, the best of all agreements, in

the /one/ class where one videoed patient (of 6) was hallucinating. "HALLU" showed

essentially zero correlation (though "agreement" was nearly perfect, ratings all near

None) in the other 4 classes. - So, it turned out that talking about "points" as effect-

size was more useful that correlations. A random N of 6 was not good for comparing

correlations across symptoms when the rate for any symptom was arbitrary.

Measures of correlation are measures of co-variation, with emphasis on "variation".

It is too optimistic to hope to say much about correlations with N=4, unless, say, you

have assured that the full range of <whatever> is present in the sample.

I think that comparing "means" - which is the usual starting point - should be your start.

Does creating/dropping an expectation cause a gain/loss in score? How many points?

"Points" should be more meaningful to readers than a "change in reliability", even if you

could measure the latter.

In any case -

I'm sure I'm unsure about your design. I gather that most of the time (3/4), raters are

misinformed about the diagnosis, which seems like an odd choice. I would have opted

for the larger subset to have "good" information so I could assess the baseline, "best"

performance, before moving on to look at the effect of misinformation.

Rich Ulrich

From: SPSSX(r) Discussion <[hidden email]> on behalf of cara sauder <[hidden email]>
Sent: Thursday, November 23, 2017 12:45:16 AM
To: [hidden email]
Subject: Help Performing Inter Rater Reliability Measures for Multiple Raters in SPSS

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

cara sauder

Re: Help Performing Inter Rater Reliability Measures for Multiple Raters in SPSS

In reply to this post by cara sauder

Thank you both for your insight. Rich, you are right in thinking that the
question addressed a secondary hypothesis. The primary aim is to see if
severity is increased along a dimension when a history leads a rater to
expect more abnormality along that particular dimension (similar to the
inattentional blindness designs in the psych literature). This is a more
straightforward analysis and the results will be easier to interpret.

However, since rater reliability is notoriously poor in this area of study
and is problematic, I wondered if the clinical information was a source of
possible error. Per Rich's discussion, I wonder if it was really as poor as
it seemed based on the kappa statistic that is commonly reported and that if
absolute agreement were reported in these studies, it actually might not be
as poor as it looks. I plan to report % absolute agreement as well.

My null for the secondary aim is that there is no difference in rater
reliability when clinical information was matched or mismatched to the video
presentation. The stats consultant recommended using Fleiss Kappa and a
permutation test to test this hypothesis formally rather than reporting
descriptive statistics for each group before I began the study. However,
he has since retired and I didn't get to follow up with him regarding how I
might be able to go about performing this analysis before he retired. I
realize that it was not clear in my description that all 32 raters rated all
10 parameters for all 4 videos. The clinical information is matched to one
video for each group of 8 raters. The reason that the mismatched and
matched conditions are not 2 and 2 is related to the primary hypothesis. Am
I correct in thinking that the general consensus is that this is not a good
approach anyway?

Thanks again in advance!

Cara

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD