SPSSX Discussion - Re: inter-rater reliability with multiple raters

Re: inter-rater reliability with multiple raters

Posted by Rich Ulrich on Jun 14, 2014; 5:36am
URL: http://spssx-discussion.165.s1.nabble.com/inter-rater-reliability-with-multiple-raters-tp5726465p5726467.html

This is not really about using SPSS, but about analyzing data. You can
continue this by email if you want, especially for the clinical aspects.
My own work experience was psychiatric research (though "outpatient"
for the most part).

Whatever you do that fits some good model is going to select a small
part of the data that exists. For instance, you *might* look at the first
pair of ratings for each patient (selecting where a pair exists), and
nothing else, in order to produce a simple, fairly ordinary ICC. That is
mainly useful if you have at least a few dozen patients. There is, of
course, a difference in what you should expect if the basis of the two
raters is not based on "observation that they share." That is, in my
experience with psychiatric data, most ratings came from interviews, or
from viewing tapes of interactions. Two raters independently interacting
during a shift will have different experiences.

Whatever you do, you should start by documenting how much data
you actually have: How many patients? How many raters? How many
ratings? How many periods with at least a pair of ratings?

And then: Who are you trying to impress with the data? What comes next?
Is this regarded as a pilot experience for something else? Would you
consider it as a tool for training the raters to achieve better consistency,
or for discussing *differences* so that you might review and revise the
anchors that describe the behaviors? (Have you looked at the manuals
for IMPS and BPRS? Did you start with them?)

Just to see how the variation exists, I would do a set of ANOVAs that
tests PatientID, RaterID, and something to do with duration of stay.
I would test those, also, for "first week only" and "later weeks".

Assuming that there are new admissions, psychiatric patients show
most pathology in the first week. It might be that the useful variation only
exists within the first week, and that you can ignore the data after that
with very little loss of generality... or, with a special point to be made about
ratings differences that exist later on during the stay.

As to "validity" -- the early-admission ratings should correlate with diagnosis,
assuming that there is some wide variation in diagnosis... which is not entirely
likely if these are all from one unit with the same basic Dx. Anything to do with
outcome might be a little bit interesting, but you cannot separate cause from
effect, since you have to assume that the doctors and nurses do pay *some*
attention to the experience and opinions of each other.

--
Rich Ulrich

> Date: Fri, 13 Jun 2014 19:54:56 -0700

> From: [hidden email]
> Subject: inter-rater reliability with multiple raters
> To: [hidden email]
>
> Hi everyone! I need help with a research assignment. I'm new to IBM SPSS
> statistics, and actually statistics in general, so i'm pretty overwhelmed.
>
> My coworkers and I created a new observation scale to improve the concise
> transfer of information between nurses and other psychiatric staff. This
> scale is designed to facilitate clinical care and outcomes related research.
> Nurses and other staff members on our particular inpatient unit will use
> standard clinical observations to rate patient behaviors in eight categories
> (abnormal motor activity, activities of daily living, bizarre/disorganized
> behavior, medication adherence, aggression, observation status,
> participation in assessment, and quality of social interactions). Each
> category will be given a score 0-4, and those ratings will be summed to
> create a a total rating. At least two nurses will rate each patient during
> each shift, morning and evening (so one patient should theoretically have at
> least four ratings per day).
>
> My assignment is to examine the reliability and validity of this new scale,
> and determine its utility for transfer of information.
>
> Right now I'm trying to figure out how to examine inter-rater reliability.
> IBM SPSS doesn't have a program to calculate Fleiss kappa (that I know of)
> and I'm not sure if that's what I should be calculating anyway...I'm
> confused because there are multiple raters, multiple patients, and multiple
> dates/times/shifts. The raters differ from day to day even on the same
> patient's chart, so there is a real lack of consistency in the data. Also
> sometimes only one rating is done on a shift...sometimes the nurses skip a
> shift of rating altogether. Also there are different lengths of stay for
> each patient, so the amount of data collected for each one differs
> dramatically.
>
> I've attached a screenshot of part of our unidentified data. Can anyone
> please help me figure out how to determine inter-rater reliability? (Or if
> anyone has any insight into how to determine validity, that'd be great too!)
>
> Thanks so much!
>
> <http://spssx-discussion.1045642.n5.nabble.com/file/n5726465/deidentified_data.jpg>
>
>