Hello,
I performed a study with 32 raters who rated severity (0-4 normal, mild, moderate severity) of several visual perceptual parameters for 4 different videos. All raters rated all parameters for all videos after being given clinical information about the patient. Raters received incorrect clinical information for 3 of 4 videos and correct info for 1 video. I am trying to answer the following question, "Is there a statistically significant difference in rater reliability when video stimuli are paired with matched versus mismatched clinical vignettes? My plan was to make each combination of video and clinical vignette a unique variable or a different “treatment”. A permutation test procedure will be used to assess the statistical significance of the difference between these two, Km-Kmm. This analysis proceeds by determining a null hypothesis (agreement is the same for matched and mismatched scenarios) distribution for this difference by considering all possible reassignments (permutations) of the labels “matched” and “mismatched” to the observed data (Mielke, 2007). I do not know how to test this hypothesis in SPSS (version 24) on my Mac and am also getting an error message "There are too few complete cases" with the following syntax when I want to examine inter rater reliability of the entire group of raters. STATS FLEISS KAPPA VARIABLES=Rater11 Rater12 Rater13 Rater14 Rater15 Rater16 Rater17 Rater18 Rater21 Rater22 Rater23 Rater24 Rater25 Rater26 Rater27 Rater28 Rater31 Rater32 Rater33 Rater34 Rater35 Rater36 Rater37 Rater38 Rater41 Rater42 Rater43 Rater44 Rater45 Rater46 Rater47 Rater48 /OPTIONS CILEVEL=95. Any assistance would be greatly appreciated. Thanks again! Cara Sauder Doctoral Student -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Cara,
Any agreement statistic that was created for nominal data has difficulty with missing data; therefore using Fleiss is not really a viable option. Based on an article in 1973 by Fleiss and Cohen (reference below) that demonstrated the equivalence of weighted kappa to ICC, I'd suggest the latter. Your measure(s) suggests that the data are at least ordinal if not interval in nature, and that justifies the use of ICC rather than a weighted kappa. Try that. You'll need to decide which model to use based on the following list. Model 1: Raters are a random sample from a specified population of raters, and each rater does not rate all subjects/objects. Therefore, each subject/object is rated by a potentially different set of raters. Model 2: Raters are a random sample from a specified population of raters, and each rater rates each subject/object. Model 3: Raters constitute the entire population of raters, and each rates each subject/object. Good luck. Brian Fleiss, J.L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613-619. ________________________________________ From: SPSSX(r) Discussion [[hidden email]] on behalf of cara sauder [[hidden email]] Sent: Thursday, November 23, 2017 12:45 AM To: [hidden email] Subject: Help Performing Inter Rater Reliability Measures for Multiple Raters in SPSS Hello, I performed a study with 32 raters who rated severity (0-4 normal, mild, moderate severity) of several visual perceptual parameters for 4 different videos. All raters rated all parameters for all videos after being given clinical information about the patient. Raters received incorrect clinical information for 3 of 4 videos and correct info for 1 video. I am trying to answer the following question, "Is there a statistically significant difference in rater reliability when video stimuli are paired with matched versus mismatched clinical vignettes? My plan was to make each combination of video and clinical vignette a unique variable or a different “treatment”. A permutation test procedure will be used to assess the statistical significance of the difference between these two, Km-Kmm. This analysis proceeds by determining a null hypothesis (agreement is the same for matched and mismatched scenarios) distribution for this difference by considering all possible reassignments (permutations) of the labels “matched” and “mismatched” to the observed data (Mielke, 2007). I do not know how to test this hypothesis in SPSS (version 24) on my Mac and am also getting an error message "There are too few complete cases" with the following syntax when I want to examine inter rater reliability of the entire group of raters. STATS FLEISS KAPPA VARIABLES=Rater11 Rater12 Rater13 Rater14 Rater15 Rater16 Rater17 Rater18 Rater21 Rater22 Rater23 Rater24 Rater25 Rater26 Rater27 Rater28 Rater31 Rater32 Rater33 Rater34 Rater35 Rater36 Rater37 Rater38 Rater41 Rater42 Rater43 Rater44 Rater45 Rater46 Rater47 Rater48 /OPTIONS CILEVEL=95. Any assistance would be greatly appreciated. Thanks again! Cara Sauder Doctoral Student -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by cara sauder
With a sample of 4, you should expect huge error-variance in any measure like kappa or r or ICC, so you don't have enough data. Measures of reliability are properly remembered as "reliability in THIS sample."
I've analyzed data where psychiatric residents (called "students") rated videos
of inpatient interviews for several "symptoms present." There were 5 sets of data, about 15 students in each class, and six videos for each class.
"HALLUCINATIONS" showed /excellent/ correlation, the best of all agreements, in the /one/ class where one videoed patient (of 6) was hallucinating. "HALLU" showed
essentially zero correlation (though "agreement" was nearly perfect, ratings all near
None) in the other 4 classes. - So, it turned out that talking about "points" as effect- size was more useful that correlations. A random N of 6 was not good for comparing correlations across symptoms when the rate for any symptom was arbitrary.
Measures of correlation are measures of co-variation, with emphasis on "variation".
It is too optimistic to hope to say much about correlations with N=4, unless, say, you
have assured that the full range of <whatever> is present in the sample.
I think that comparing "means" - which is the usual starting point - should be your start.
Does creating/dropping an expectation cause a gain/loss in score? How many points? "Points" should be more meaningful to readers than a "change in reliability", even if you could measure the latter. In any case - I'm sure I'm unsure about your design. I gather that most of the time (3/4), raters are
misinformed about the diagnosis, which seems like an odd choice. I would have opted for the larger subset to have "good" information so I could assess the baseline, "best" performance, before moving on to look at the effect of misinformation.
-- Rich Ulrich
From: SPSSX(r) Discussion <[hidden email]> on behalf of cara sauder <[hidden email]>
Sent: Thursday, November 23, 2017 12:45:16 AM To: [hidden email] Subject: Help Performing Inter Rater Reliability Measures for Multiple Raters in SPSS Hello,
I performed a study with 32 raters who rated severity (0-4 normal, mild, moderate severity) of several visual perceptual parameters for 4 different videos. All raters rated all parameters for all videos after being given clinical information about the patient. Raters received incorrect clinical information for 3 of 4 videos and correct info for 1 video. I am trying to answer the following question, "Is there a statistically significant difference in rater reliability when video stimuli are paired with matched versus mismatched clinical vignettes? My plan was to make each combination of video and clinical vignette a unique variable or a different “treatment”. A permutation test procedure will be used to assess the statistical significance of the difference between these two, Km-Kmm. This analysis proceeds by determining a null hypothesis (agreement is the same for matched and mismatched scenarios) distribution for this difference by considering all possible reassignments (permutations) of the labels “matched” and “mismatched” to the observed data (Mielke, 2007). I do not know how to test this hypothesis in SPSS (version 24) on my Mac and am also getting an error message "There are too few complete cases" with the following syntax when I want to examine inter rater reliability of the entire group of raters. STATS FLEISS KAPPA VARIABLES=Rater11 Rater12 Rater13 Rater14 Rater15 Rater16 Rater17 Rater18 Rater21 Rater22 Rater23 Rater24 Rater25 Rater26 Rater27 Rater28 Rater31 Rater32 Rater33 Rater34 Rater35 Rater36 Rater37 Rater38 Rater41 Rater42 Rater43 Rater44 Rater45 Rater46 Rater47 Rater48 /OPTIONS CILEVEL=95. Any assistance would be greatly appreciated. Thanks again! Cara Sauder Doctoral Student -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by cara sauder
Thank you both for your insight. Rich, you are right in thinking that the
question addressed a secondary hypothesis. The primary aim is to see if severity is increased along a dimension when a history leads a rater to expect more abnormality along that particular dimension (similar to the inattentional blindness designs in the psych literature). This is a more straightforward analysis and the results will be easier to interpret. However, since rater reliability is notoriously poor in this area of study and is problematic, I wondered if the clinical information was a source of possible error. Per Rich's discussion, I wonder if it was really as poor as it seemed based on the kappa statistic that is commonly reported and that if absolute agreement were reported in these studies, it actually might not be as poor as it looks. I plan to report % absolute agreement as well. My null for the secondary aim is that there is no difference in rater reliability when clinical information was matched or mismatched to the video presentation. The stats consultant recommended using Fleiss Kappa and a permutation test to test this hypothesis formally rather than reporting descriptive statistics for each group before I began the study. However, he has since retired and I didn't get to follow up with him regarding how I might be able to go about performing this analysis before he retired. I realize that it was not clear in my description that all 32 raters rated all 10 parameters for all 4 videos. The clinical information is matched to one video for each group of 8 raters. The reason that the mismatched and matched conditions are not 2 and 2 is related to the primary hypothesis. Am I correct in thinking that the general consensus is that this is not a good approach anyway? Thanks again in advance! Cara -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |