inter-rater reliability with multiple raters

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

inter-rater reliability with multiple raters

redeyedfrog
Hi everyone! I need help with a research assignment. I'm new to IBM SPSS statistics, and actually statistics in general, so i'm pretty overwhelmed.

My coworkers and I created a new observation scale to improve the concise transfer of information between nurses and other psychiatric staff. This scale is designed to facilitate clinical care and outcomes related research.  Nurses and other staff members on our particular inpatient unit will use standard clinical observations to rate patient behaviors in eight categories (abnormal motor activity, activities of daily living, bizarre/disorganized behavior, medication adherence, aggression, observation status, participation in assessment, and quality of social interactions).  Each category will be given a score 0-4, and those ratings will be summed to create a a total rating. At least two nurses will rate each patient during each shift, morning and evening (so one patient should theoretically have at least four ratings per day).

My assignment is to examine the reliability and validity of this new scale, and determine its utility for transfer of information.

Right now I'm trying to figure out how to examine inter-rater reliability. IBM SPSS doesn't have a program to calculate Fleiss kappa (that I know of) and I'm not sure if that's what I should be calculating anyway...I'm confused because there are multiple raters, multiple patients, and multiple dates/times/shifts. The raters differ from day to day even on the same patient's chart, so there is a real lack of consistency in the data. Also sometimes only one rating is done on a shift...sometimes the nurses skip a shift of rating altogether. Also there are different lengths of stay for each patient, so the amount of data collected for each one differs dramatically.

I've attached a screenshot of part of our unidentified data. Can anyone please help me figure out how to determine inter-rater reliability? (Or if anyone has any insight into how to determine validity, that'd be great too!)

Thanks so much!

Reply | Threaded
Open this post in threaded view
|

Re: inter-rater reliability with multiple raters

MaxJasper

Check these out may help:

ftp://ftp.boulder.ibm.com/software/analytics/spss/support/Stats/Docs/Statistics/Macros/Iccsf.htm

 

 

 

Hi everyone! I need help with a research assignment. I'm new to IBM SPSS statistics, and actually statistics in general, so i'm pretty overwhelmed.

My coworkers and I created a new observation scale to improve the concise transfer of information between nurses and other psychiatric staff. This scale is designed to facilitate clinical care and outcomes related research.  

Nurses and other staff members on our particular inpatient unit will use standard clinical observations to rate patient behaviors in eight categories (

1.      abnormal motor activity,

2.      activities of daily living,

3.      bizarre/disorganized behavior,

4.      medication adherence,

5.      aggression,

6.      observation status,

7.      participation in assessment, and

8.      quality of social interactions).  

Each category will be given a score 0-4, and those ratings will be summed to create a a total rating. At least two nurses will rate each patient during each shift, morning and evening (so one patient should theoretically have at least four ratings per day).

My assignment is to examine the reliability and validity of this new scale, and determine its utility for transfer of information.

Right now I'm trying to figure out how to examine inter-rater reliability. IBM SPSS doesn't have a program to calculate Fleiss kappa (that I know of) and I'm not sure if that's what I should be calculating anyway...I'm confused because there are multiple raters, multiple patients, and multiple dates/times/shifts. The raters differ from day to day even on the same patient's chart, so there is a real lack of consistency in the data. Also sometimes only one rating is done on a shift...sometimes the nurses skip a shift of rating altogether. Also there are different lengths of stay for each patient, so the amount of data collected for each one differs dramatically.

I've attached a screenshot of part of our unidentified data. Can anyone please help me figure out how to determine inter-rater reliability? (Or if anyone has any insight into how to determine validity, that'd be great too!)

Thanks so much!


Reply | Threaded
Open this post in threaded view
|

Re: inter-rater reliability with multiple raters

Rich Ulrich
In reply to this post by redeyedfrog
This is not really about using SPSS, but about analyzing data.  You can
continue this by email if you want, especially for the clinical aspects. 
My own work experience was psychiatric research (though "outpatient"
for the most part).

Whatever you do that fits some good model is going to select a small
part of the data that exists.  For instance, you *might* look at the first
pair of ratings for each patient (selecting where a pair exists), and 
nothing else, in order to produce a simple, fairly ordinary ICC.  That is
mainly useful if you have at least a few dozen patients.  There is, of
course, a difference in what you should expect if the basis of the two
raters is not based on "observation that they share."  That is, in my
experience with psychiatric data, most ratings came from interviews, or
from viewing tapes of interactions.  Two raters independently interacting
during a shift will have different experiences.

Whatever you do, you should start by documenting how much data
you actually have:  How many patients?  How many raters?  How many
ratings?  How many periods with at least a pair of ratings? 

And then:  Who are you trying to impress with the data?  What comes next? 
Is this regarded as a pilot experience for something else?  Would you
consider it as a tool for training the raters to achieve better consistency,
or for discussing *differences* so that you might review and revise the
anchors that describe the behaviors?  (Have you looked at the manuals
for IMPS and BPRS?  Did you start with them?)


Just to see how the variation exists, I would do a set of ANOVAs that
tests PatientID, RaterID, and something to do with duration of stay.
I would test those, also, for "first week only" and "later weeks". 

Assuming that there are new admissions, psychiatric patients show
most pathology in the first week.  It might be that the useful variation only
exists within the first week, and that you can ignore the data after that
with very little loss of generality...  or, with a special point to be made about
ratings differences that exist later on during the stay.

As to "validity" -- the early-admission ratings should correlate with diagnosis,
assuming that there is some wide variation in diagnosis... which is not entirely
likely if these are all from one unit with the same basic Dx.  Anything to do with
outcome might be a little bit interesting, but you cannot separate cause from
effect, since you have to assume that the doctors and nurses do pay *some*
attention to the experience and opinions of each other.

--
Rich Ulrich



> Date: Fri, 13 Jun 2014 19:54:56 -0700

> From: [hidden email]
> Subject: inter-rater reliability with multiple raters
> To: [hidden email]
>
> Hi everyone! I need help with a research assignment. I'm new to IBM SPSS
> statistics, and actually statistics in general, so i'm pretty overwhelmed.
>
> My coworkers and I created a new observation scale to improve the concise
> transfer of information between nurses and other psychiatric staff. This
> scale is designed to facilitate clinical care and outcomes related research.
> Nurses and other staff members on our particular inpatient unit will use
> standard clinical observations to rate patient behaviors in eight categories
> (abnormal motor activity, activities of daily living, bizarre/disorganized
> behavior, medication adherence, aggression, observation status,
> participation in assessment, and quality of social interactions). Each
> category will be given a score 0-4, and those ratings will be summed to
> create a a total rating. At least two nurses will rate each patient during
> each shift, morning and evening (so one patient should theoretically have at
> least four ratings per day).
>
> My assignment is to examine the reliability and validity of this new scale,
> and determine its utility for transfer of information.
>
> Right now I'm trying to figure out how to examine inter-rater reliability.
> IBM SPSS doesn't have a program to calculate Fleiss kappa (that I know of)
> and I'm not sure if that's what I should be calculating anyway...I'm
> confused because there are multiple raters, multiple patients, and multiple
> dates/times/shifts. The raters differ from day to day even on the same
> patient's chart, so there is a real lack of consistency in the data. Also
> sometimes only one rating is done on a shift...sometimes the nurses skip a
> shift of rating altogether. Also there are different lengths of stay for
> each patient, so the amount of data collected for each one differs
> dramatically.
>
> I've attached a screenshot of part of our unidentified data. Can anyone
> please help me figure out how to determine inter-rater reliability? (Or if
> anyone has any insight into how to determine validity, that'd be great too!)
>
> Thanks so much!
>
> <http://spssx-discussion.1045642.n5.nabble.com/file/n5726465/deidentified_data.jpg>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: inter-rater reliability with multiple raters

tzoblotsky
In reply to this post by redeyedfrog
There was a discussion of this topic in an earlier thread: http://spssx-discussion.1045642.n5.nabble.com/Fleiss-Kappa-Inter-rater-reliability-questions-td5717299.html

You may also want to look into G Theory as an option, as it allows you to look at different combinations of sources of variance in you outcomes (e.g., raters, occasions).  For example, see http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=85843 (sorry, but could not find a link to a free version of the article).
Reply | Threaded
Open this post in threaded view
|

Re: inter-rater reliability with multiple raters

Bruce Weaver
Administrator
tzoblotsky wrote
There was a discussion of this topic in an earlier thread: http://spssx-discussion.1045642.n5.nabble.com/Fleiss-Kappa-Inter-rater-reliability-questions-td5717299.html

You may also want to look into G Theory as an option, as it allows you to look at different combinations of sources of variance in you outcomes (e.g., raters, occasions).  For example, see http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=85843 (sorry, but could not find a link to a free version of the article).
Here's a free version:  http://www.researchgate.net/publication/11770441_G_theory_and_the_reliability_of_psychophysiological_measures_a_tutorial/file/9fcfd50dd7cbc6d6bf.pdf
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: inter-rater reliability with multiple raters

Jon K Peck
In reply to this post by MaxJasper
"IBM SPSS doesn't have a program to calculate Fleiss kappa (that I know of) "

See the STATS FLEISS KAPPA custom dialog.  It requires the Python Essentials and can be downloaded from the Utilities menu in Statistics 22 or from the Extension Commands collection of the SPSS Community website (www.ibm.com/developerworks/spssdevcentral)

Provides overall estimate of kappa, along with asymptotic standard error, Z statistic, significance or p value under the null hypothesis of chance agreement and confidence interval for kappa. (Standard errors are based on Fleiss et al., 1979 and Fleiss et al., 2003. Test statistic is based on Fleiss et al., 2003.) Also provides these statistics for individual categories, as well as conditional probabilities for categories, which according to Fleiss (1971, p. 381) are probabilities of a second object being assigned to a category given that the first object was assigned to that category.






Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Max Jasper <[hidden email]>
To:        [hidden email],
Date:        06/16/2014 08:33 AM
Subject:        Re: [SPSSX-L] inter-rater reliability with multiple raters
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Check these out may help:
ftp://ftp.boulder.ibm.com/software/analytics/spss/support/Stats/Docs/Statistics/Macros/Iccsf.htm
 
 
 
Hi everyone! I need help with a research assignment. I'm new to IBM SPSS statistics, and actually statistics in general, so i'm pretty overwhelmed.

My coworkers and I created a new observation scale to improve the concise transfer of information between nurses and other psychiatric staff. This scale is designed to facilitate clinical care and outcomes related research.  

Nurses and other staff members on our particular inpatient unit will use standard clinical observations to rate patient behaviors in eight categories (
1.      abnormal motor activity,
2.      activities of daily living,
3.      bizarre/disorganized behavior,
4.      medication adherence,
5.      aggression,
6.      observation status,
7.      participation in assessment, and
8.      quality of social interactions).  
Each category will be given a score 0-4, and those ratings will be summed to create a a total rating. At least two nurses will rate each patient during each shift, morning and evening (so one patient should theoretically have at least four ratings per day).

My assignment is to examine the
reliability and validity of this new scale, and determine its utility for transfer of information.

Right now I'm trying to figure out how to examine
inter-rater reliability. IBM SPSS doesn't have a program to calculate Fleiss kappa (that I know of) and I'm not sure if that's what I should be calculating anyway...I'm confused because there are multiple raters, multiple patients, and multiple dates/times/shifts. The raters differ from day to day even on the same patient's chart, so there is a real lack of consistency in the data. Also sometimes only one rating is done on a shift...sometimes the nurses skip a shift of rating altogether. Also there are different lengths of stay for each patient, so the amount of data collected for each one differs dramatically.

I've attached a screenshot of part of our unidentified data. Can anyone please help me figure out how to determine inter-rater reliability? (Or if anyone has any insight into how to determine validity, that'd be great too!)

Thanks so much!


Reply | Threaded
Open this post in threaded view
|

Re: inter-rater reliability with multiple raters

bdates
In reply to this post by MaxJasper

Fleiss’ kappa was designed for nominal data.  If your data are ordinal, interval, ratio, then use the ICC or other related procedure for continuous data.  The ICC provides analyses which have been found analogous to Fleiss’ weighted kappa (Fleiss and Cohen, 1973).  The syntax that Max refers to looks like the most promising alternative, as long as you know what model you have.  If you need Fleiss’ kappa syntax because you have nominal data, I can send that to you offline.

 

Brian

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Max Jasper
Sent: Friday, June 13, 2014 11:26 PM
To: [hidden email]
Subject: Re: inter-rater reliability with multiple raters

 

Check these out may help:

ftp://ftp.boulder.ibm.com/software/analytics/spss/support/Stats/Docs/Statistics/Macros/Iccsf.htm

 

 

 

Hi everyone! I need help with a research assignment. I'm new to IBM SPSS statistics, and actually statistics in general, so i'm pretty overwhelmed.

My coworkers and I created a new observation scale to improve the concise transfer of information between nurses and other psychiatric staff. This scale is designed to facilitate clinical care and outcomes related research.  

Nurses and other staff members on our particular inpatient unit will use standard clinical observations to rate patient behaviors in eight categories (

1.   abnormal motor activity,

2.   activities of daily living,

3.   bizarre/disorganized behavior,

4.   medication adherence,

5.   aggression,

6.   observation status,

7.   participation in assessment, and

8.   quality of social interactions).  

Each category will be given a score 0-4, and those ratings will be summed to create a a total rating. At least two nurses will rate each patient during each shift, morning and evening (so one patient should theoretically have at least four ratings per day).

My assignment is to examine the reliability and validity of this new scale, and determine its utility for transfer of information.

Right now I'm trying to figure out how to examine inter-rater reliability. IBM SPSS doesn't have a program to calculate Fleiss kappa (that I know of) and I'm not sure if that's what I should be calculating anyway...I'm confused because there are multiple raters, multiple patients, and multiple dates/times/shifts. The raters differ from day to day even on the same patient's chart, so there is a real lack of consistency in the data. Also sometimes only one rating is done on a shift...sometimes the nurses skip a shift of rating altogether. Also there are different lengths of stay for each patient, so the amount of data collected for each one differs dramatically.

I've attached a screenshot of part of our unidentified data. Can anyone please help me figure out how to determine inter-rater reliability? (Or if anyone has any insight into how to determine validity, that'd be great too!)

Thanks so much!

Reply | Threaded
Open this post in threaded view
|

Re: inter-rater reliability with multiple raters

Ryan
An ICC to estimate interrater reliability can be calculated using the MIXED procedure in SPSS, and can handle various designs. I believe I have posted on this topic in the past for at least one scenario, perhaps two. 

Ryan


On Mon, Jun 16, 2014 at 10:49 AM, Dates, Brian <[hidden email]> wrote:

Fleiss’ kappa was designed for nominal data.  If your data are ordinal, interval, ratio, then use the ICC or other related procedure for continuous data.  The ICC provides analyses which have been found analogous to Fleiss’ weighted kappa (Fleiss and Cohen, 1973).  The syntax that Max refers to looks like the most promising alternative, as long as you know what model you have.  If you need Fleiss’ kappa syntax because you have nominal data, I can send that to you offline.

 

Brian

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Max Jasper
Sent: Friday, June 13, 2014 11:26 PM
To: [hidden email]
Subject: Re: inter-rater reliability with multiple raters

 

Check these out may help:

ftp://ftp.boulder.ibm.com/software/analytics/spss/support/Stats/Docs/Statistics/Macros/Iccsf.htm

 

 

 

Hi everyone! I need help with a research assignment. I'm new to IBM SPSS statistics, and actually statistics in general, so i'm pretty overwhelmed.

My coworkers and I created a new observation scale to improve the concise transfer of information between nurses and other psychiatric staff. This scale is designed to facilitate clinical care and outcomes related research.  

Nurses and other staff members on our particular inpatient unit will use standard clinical observations to rate patient behaviors in eight categories (

1.   abnormal motor activity,

2.   activities of daily living,

3.   bizarre/disorganized behavior,

4.   medication adherence,

5.   aggression,

6.   observation status,

7.   participation in assessment, and

8.   quality of social interactions).  

Each category will be given a score 0-4, and those ratings will be summed to create a a total rating. At least two nurses will rate each patient during each shift, morning and evening (so one patient should theoretically have at least four ratings per day).

My assignment is to examine the reliability and validity of this new scale, and determine its utility for transfer of information.

Right now I'm trying to figure out how to examine inter-rater reliability. IBM SPSS doesn't have a program to calculate Fleiss kappa (that I know of) and I'm not sure if that's what I should be calculating anyway...I'm confused because there are multiple raters, multiple patients, and multiple dates/times/shifts. The raters differ from day to day even on the same patient's chart, so there is a real lack of consistency in the data. Also sometimes only one rating is done on a shift...sometimes the nurses skip a shift of rating altogether. Also there are different lengths of stay for each patient, so the amount of data collected for each one differs dramatically.

I've attached a screenshot of part of our unidentified data. Can anyone please help me figure out how to determine inter-rater reliability? (Or if anyone has any insight into how to determine validity, that'd be great too!)

Thanks so much!


Reply | Threaded
Open this post in threaded view
|

Re: inter-rater reliability with multiple raters

Rich Ulrich
Estimators of ICCs generally want "balanced designs", don't they?
Did you have an answer to that complication?

 - Another poster mention G-theory, and I would expect the same
problem.  I never used G-theory and what I read was long ago, but
I don't remember the issue being dealt with.

--
Rich Ulrich


Date: Mon, 16 Jun 2014 11:11:44 -0400
From: [hidden email]
Subject: Re: inter-rater reliability with multiple raters
To: [hidden email]

An ICC to estimate interrater reliability can be calculated using the MIXED procedure in SPSS, and can handle various designs. I believe I have posted on this topic in the past for at least one scenario, perhaps two. 

Ryan


On Mon, Jun 16, 2014 at 10:49 AM, Dates, Brian <[hidden email]> wrote:

Fleiss’ kappa was designed for nominal data.  If your data are ordinal, interval, ratio, then use the ICC or other related procedure for continuous data.  The ICC provides analyses which have been found analogous to Fleiss’ weighted kappa (Fleiss and Cohen, 1973).  The syntax that Max refers to looks like the most promising alternative, as long as you know what model you have.  If you need Fleiss’ kappa syntax because you have nominal data, I can send that to you offline.

 

Brian

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Max Jasper
Sent: Friday, June 13, 2014 11:26 PM
To: [hidden email]
Subject: Re: inter-rater reliability with multiple raters

 

Check these out may help:

ftp://ftp.boulder.ibm.com/software/analytics/spss/support/Stats/Docs/Statistics/Macros/Iccsf.htm

 

 

 

Hi everyone! I need help with a research assignment. I'm new to IBM SPSS statistics, and actually statistics in general, so i'm pretty overwhelmed.

My coworkers and I created a new observation scale to improve the concise transfer of information between nurses and other psychiatric staff. This scale is designed to facilitate clinical care and outcomes related research.  

Nurses and other staff members on our particular inpatient unit will use standard clinical observations to rate patient behaviors in eight categories (

1.   abnormal motor activity,

2.   activities of daily living,

3.   bizarre/disorganized behavior,

4.   medication adherence,

5.   aggression,

6.   observation status,

7.   participation in assessment, and

8.   quality of social interactions).  

Each category will be given a score 0-4, and those ratings will be summed to create a a total rating. At least two nurses will rate each patient during each shift, morning and evening (so one patient should theoretically have at least four ratings per day).

My assignment is to examine the reliability and validity of this new scale, and determine its utility for transfer of information.

Right now I'm trying to figure out how to examine inter-rater reliability. IBM SPSS doesn't have a program to calculate Fleiss kappa (that I know of) and I'm not sure if that's what I should be calculating anyway...I'm confused because there are multiple raters, multiple patients, and multiple dates/times/shifts. The raters differ from day to day even on the same patient's chart, so there is a real lack of consistency in the data. Also sometimes only one rating is done on a shift...sometimes the nurses skip a shift of rating altogether. Also there are different lengths of stay for each patient, so the amount of data collected for each one differs dramatically.

I've attached a screenshot of part of our unidentified data. Can anyone please help me figure out how to determine inter-rater reliability? (Or if anyone has any insight into how to determine validity, that'd be great too!)

Thanks so much!


Reply | Threaded
Open this post in threaded view
|

Re: inter-rater reliability with multiple raters

Ryan
Okay. Let me start by saying I'm a bit (okay maybe very) under the weather and paid work is catching up with me, so apologies for any typos/mistakes.  Having said that, this topic is quite interesting as it relates to showing some connections between generalizability coefficients, various forms of ICCs, and coefficient alpha. 

Before discussing those connections, however, the short answer to the question about whether there are valid estimators of an ICC in an unbalanced design would be a solid "it depends." I would argue that to obtain a valid ICC we need to appropriately decompose the variance to obtain between subject variance and variance attributable to all other sources. This could prove challenging.

At any rate, if the raters tend to agree in their ratings of each subject, then the between subject variance will tend to be much larger than other sources of variances, and the ICC should approach 1.0. 

With that said, the ICC is defined as:

ICC = var(between Ss) / Total Variance

where,

Total Variance = 
(1) between Ss variance
(2) between Rater Variance
(3) error variance

If one initially planned a crossed design (all subjects were intended to be rated by all raters) but due to random circumstances some raters were unable to rate some subjects and those missing data can be assumed to be missing at random (MAR), then I would suggest that one could theoretically estimate a valid estimate of the ICC using an ML estimator via the MIXED procedure in SPSS from which the estimated variance components would be inserted into the following ICC equation:

ICC = var(between Ss )/ [var(between Ss) + error variance]

I believe there are more sophisticated ways to deal with unbalanced designs that have been published in the past 5 years, but I am not fully versed in such methods. With that said, please see below for a small demonstration using SPSS syntax that might help make connections between generalizability coefficients from a one-facet design, ICC, and coefficient alpha using various procedures in SPSS:

DATA LIST LIST / rater1 rater2 rater3 (3f1.0).
BEGIN DATA
3 4 4
1 2 3
4 4 5
7 6 10
1 2 3
END DATA.

*Calculate ICC via Reliabillity Procedure: MIXED model with CONSISTENCY type. 
RELIABILITY
  /VARIABLES=rater1 rater2 rater3
  /SCALE('ALL VARIABLES') ALL
  /MODEL=ALPHA
  /ICC=MODEL(MIXED) TYPE(CONSISTENCY) CIN=95 TESTVAL=0.

*Calculate ICC via Reliabillity Procedure: RANDOM model with ABSOLUTE type. 
RELIABILITY
  /VARIABLES=rater1 rater2 rater3
  /SCALE('ALL VARIABLES') ALL
  /MODEL=ALPHA
  /ICC=MODEL(RANDOM) TYPE(ABSOLUTE) CIN=95 TESTVAL=0.

*Restructure dataset from wide to long to more easily obtain variance components.
VARSTOCASES
  /ID=id
  /MAKE rating FROM rater1 rater2 rater3
  /INDEX=rater(3) 
  /KEEP=
  /NULL=KEEP.

MIXED rating BY id rater
  /FIXED=| SSTYPE(3)
  /METHOD=REML
  /RANDOM=id rater | COVTYPE(VC).

*ICC version where all subjects are rated by same/consistent random subset of all possible raters. 
*ICC matches the Single Measure ICC via Reliability Procedure: RANDOM model with ABSOLUTE type. 
*Note: ICC_1 = var(between Ss) / [var(between Ss) + var(raters) + error].
COMPUTE ICC_1 = 5.200 / (5.200 + 0.767 + 0.633).
EXECUTE.

MIXED rating BY id rater
  /FIXED=rater| SSTYPE(3)
  /METHOD=REML
  /RANDOM=id | COVTYPE(VC).

*ICC version where subjects are rated by the assumed population of all possible raters. 
*Matches Single Measure ICC from Reliability Procedure.
*Note: ICC_2 = var(between Ss) / [var(between Ss) + error].
COMPUTE ICC_2= 5.200 / (5.200 + 0.633).
EXECUTE.

MIXED rating BY id rater
  /FIXED=| SSTYPE(3)
  /METHOD=REML
  /RANDOM=id | COVTYPE(VC).

*ICC version where subjects are assumed to be randomly assigned to a sample of raters, assuming a balanced design.
*In the MIXED model above, note that rater is removed from the model entirely.
COMPUTE ICC_3 = 4.944 / (4.944 + 1.400).
EXECUTE.

*Next, let's employ a one-facet G theory model using ANOVA.
DATASET DECLARE vc.
VARCOMP rating BY rater id 
  /RANDOM=rater id
  /OUTFILE=VAREST (vc)
  /METHOD=SSTYPE(3)
  /PRINT=SS
  /DESIGN=rater id
  /INTERCEPT=INCLUDE.

DATASET ACTIVATE vc.
*Using variance components estimated from the mean squares of the ANOVA above.
*the relative and absolute G coefficients are estimated. 
*Note that the relative G coefficient = coefficient alpha.
*Note, also, ICC matches the average measure ICC via Reliability Procedure:. 
*RANDOM model with CONSISTENCY type.
*VC1 = var(rater).
*VC2 = var(subject).
*VC3 = error variance.
*Based on the equation below, it becomes apparent that coefficient alpha assumes.
*that between rater variance ("VC1") = zero. 
COMPUTE relative_g_coeff_3raters = VC2 / (VC2 + VC3/3).

*If we assume the number of raters is 1 (one), we obtain the.
*ICC for single measures calculated from Reliability Procedure.
*and the second MIXED procedure.
COMPUTE relative_g_coeff_1rater = VC2 / (VC2 + VC3/1).

*Next, compute the absolute g coefficient assuming 3 raters.
*Note: Also, ICC matches the average measure ICC via Reliability Procedure:. 
*RANDOM model with ABSOLUTE type. 
COMPUTE absolute_g_coeff_3raters = VC2 / (VC1/3 + VC2 + VC3/3).

*Next, compute the absolute g coefficient assuming 1 rater.
*Note: Matches ICC estimated from first MIXED model which assumes.
*that subjects and raters are random.
*Note: Also, ICC matches the single measure ICC via Reliability Procedure:. 
*RANDOM model with ABSOLUTE type. 
COMPUTE absolute_g_coeff_1rater = VC2 / (VC1/1 + VC2 + VC3/1).
EXECUTE.


Reply | Threaded
Open this post in threaded view
|

Re: inter-rater reliability with multiple raters

Rich Ulrich

ICCs with unequal Ns -

I got a reference in 1995   for computing a simple ICC with unequal Ns from the
Usenet stats group, sci.stat.consult.  I put the formula (below) in my stats-FAQ,
which I maintained from 1997 to about 2006.  No one ever provided newer references.

This is owing to Ernest Haggard, Intraclass Correlation and the Analysis of Variance
(1958) , as posted by Michael Bailey and reformatted and adapted by me.  I hope I have
not screwed it up.

Let, R=intraclass correlation,
    BSMS=between Subject mean square,
    WMS=mean square within,
    c=number of Subjects, and
    ki=number for the ith Subject. Then:

         R = (BSMS - WMS) /  ( BSMS + (k' -1)*WMS )
         where k' =  [ sum(ki) -   (sum(ki**2))/sum(ki) ]   / (c-1)


The value of k'   does need to work out to something in the range of an average ki
number of ratings for the subjects.  I think that I remember using the "reciprocal
mean" of the counts, but I don't remember using this formula for k' for getting it.



--
Rich Ulrich


Date: Tue, 17 Jun 2014 22:49:46 -0400
From: [hidden email]
Subject: Re: inter-rater reliability with multiple raters
To: [hidden email]

Okay. Let me start by saying I'm a bit (okay maybe very) under the weather and paid work is catching up with me, so apologies for any typos/mistakes.  Having said that, this topic is quite interesting as it relates to showing some connections between generalizability coefficients, various forms of ICCs, and coefficient alpha. 

Before discussing those connections, however, the short answer to the question about whether there are valid estimators of an ICC in an unbalanced design would be a solid "it depends." I would argue that to obtain a valid ICC we need to appropriately decompose the variance to obtain between subject variance and variance attributable to all other sources. This could prove challenging.

At any rate, if the raters tend to agree in their ratings of each subject, then the between subject variance will tend to be much larger than other sources of variances, and the ICC should approach 1.0. 

With that said, the ICC is defined as:

ICC = var(between Ss) / Total Variance

where,

Total Variance = 
(1) between Ss variance
(2) between Rater Variance
(3) error variance

If one initially planned a crossed design (all subjects were intended to be rated by all raters) but due to random circumstances some raters were unable to rate some subjects and those missing data can be assumed to be missing at random (MAR), then I would suggest that one could theoretically estimate a valid estimate of the ICC using an ML estimator via the MIXED procedure in SPSS from which the estimated variance components would be inserted into the following ICC equation:

ICC = var(between Ss )/ [var(between Ss) + error variance]

I believe there are more sophisticated ways to deal with unbalanced designs that have been published in the past 5 years, but I am not fully versed in such methods. With that said, please see below for a small demonstration using SPSS syntax that might help make connections between generalizability coefficients from a one-facet design, ICC, and coefficient alpha using various procedures in SPSS:
[snip, SPSS code for alpha examples]
Reply | Threaded
Open this post in threaded view
|

Re: inter-rater reliability with multiple raters

Ryan
Okay, I'm going to try that formula out on a small dataset and compare it to the mixed model (ML estimation) approach I suggested. More later. 

Sent from my iPhone

On Jun 18, 2014, at 12:48 PM, Rich Ulrich <[hidden email]> wrote:


ICCs with unequal Ns -

I got a reference in 1995   for computing a simple ICC with unequal Ns from the
Usenet stats group, sci.stat.consult.  I put the formula (below) in my stats-FAQ,
which I maintained from 1997 to about 2006.  No one ever provided newer references.

This is owing to Ernest Haggard, Intraclass Correlation and the Analysis of Variance
(1958) , as posted by Michael Bailey and reformatted and adapted by me.  I hope I have
not screwed it up.

Let, R=intraclass correlation,
    BSMS=between Subject mean square,
    WMS=mean square within,
    c=number of Subjects, and
    ki=number for the ith Subject. Then:

         R = (BSMS - WMS) /  ( BSMS + (k' -1)*WMS )
         where k' =  [ sum(ki) -   (sum(ki**2))/sum(ki) ]   / (c-1)


The value of k'   does need to work out to something in the range of an average ki
number of ratings for the subjects.  I think that I remember using the "reciprocal
mean" of the counts, but I don't remember using this formula for k' for getting it.



--
Rich Ulrich


Date: Tue, 17 Jun 2014 22:49:46 -0400
From: [hidden email]
Subject: Re: inter-rater reliability with multiple raters
To: [hidden email]

Okay. Let me start by saying I'm a bit (okay maybe very) under the weather and paid work is catching up with me, so apologies for any typos/mistakes.  Having said that, this topic is quite interesting as it relates to showing some connections between generalizability coefficients, various forms of ICCs, and coefficient alpha. 

Before discussing those connections, however, the short answer to the question about whether there are valid estimators of an ICC in an unbalanced design would be a solid "it depends." I would argue that to obtain a valid ICC we need to appropriately decompose the variance to obtain between subject variance and variance attributable to all other sources. This could prove challenging.

At any rate, if the raters tend to agree in their ratings of each subject, then the between subject variance will tend to be much larger than other sources of variances, and the ICC should approach 1.0. 

With that said, the ICC is defined as:

ICC = var(between Ss) / Total Variance

where,

Total Variance = 
(1) between Ss variance
(2) between Rater Variance
(3) error variance

If one initially planned a crossed design (all subjects were intended to be rated by all raters) but due to random circumstances some raters were unable to rate some subjects and those missing data can be assumed to be missing at random (MAR), then I would suggest that one could theoretically estimate a valid estimate of the ICC using an ML estimator via the MIXED procedure in SPSS from which the estimated variance components would be inserted into the following ICC equation:

ICC = var(between Ss )/ [var(between Ss) + error variance]

I believe there are more sophisticated ways to deal with unbalanced designs that have been published in the past 5 years, but I am not fully versed in such methods. With that said, please see below for a small demonstration using SPSS syntax that might help make connections between generalizability coefficients from a one-facet design, ICC, and coefficient alpha using various procedures in SPSS:
[snip, SPSS code for alpha examples]
Reply | Threaded
Open this post in threaded view
|

Re: inter-rater reliability with multiple raters

Bruce Weaver
Administrator
In reply to this post by Rich Ulrich
Rich, do you still have all of the files for your FAQ?  If so, someone really ought to get them back on a website somewhere.  


Rich Ulrich wrote
ICCs with unequal Ns -

I got a reference in 1995   for computing a simple ICC with unequal Ns from the
Usenet stats group, sci.stat.consult.  I put the formula (below) in my stats-FAQ,
which I maintained from 1997 to about 2006.  No one ever provided newer references.

This is owing to Ernest Haggard, Intraclass Correlation and the Analysis of Variance
(1958) , as posted by Michael Bailey and reformatted and adapted by me.  I hope I have
not screwed it up.

Let, R=intraclass correlation,
    BSMS=between Subject mean square,
    WMS=mean
square within,
    c=number of Subjects, and
    ki=number for the ith Subject. Then:



         R = (BSMS - WMS) /  ( BSMS + (k' -1)*WMS

)
         where k' =  [ sum(ki) -   (sum(ki**2))/sum(ki) ]    / (c-1)


The value of k'   does need to work out to something in the range of an average ki
number of ratings for the subjects.  I think that I remember using the "reciprocal
mean" of the counts, but I don't remember using this formula for k' for getting it.



--
Rich Ulrich

Date: Tue, 17 Jun 2014 22:49:46 -0400
From: [hidden email]
Subject: Re: inter-rater reliability with multiple raters
To: [hidden email]

Okay. Let me start by saying I'm a bit (okay maybe very) under the weather and paid work is catching up with me, so apologies for any typos/mistakes.  Having said that, this topic is quite interesting as it relates to showing some connections between generalizability coefficients, various forms of ICCs, and coefficient alpha.



Before discussing those connections, however, the short answer to the question about whether there are valid estimators of an ICC in an unbalanced design would be a solid "it depends." I would argue that to obtain a valid ICC we need to appropriately decompose the variance to obtain between subject variance and variance attributable to all other sources. This could prove challenging.


At any rate, if the raters tend to agree in their ratings of each subject, then the between subject variance will tend to be much larger than other sources of variances, and the ICC should approach 1.0.


With that said, the ICC is defined as:


ICC = var(between Ss) / Total Variance
where,
Total Variance = (1) between Ss variance(2) between Rater Variance(3) error variance



If one initially planned a crossed design (all subjects were intended to be rated by all raters) but due to random circumstances some raters were unable to rate some subjects and those missing data can be assumed to be missing at random (MAR), then I would suggest that one could theoretically estimate a valid estimate of the ICC using an ML estimator via the MIXED procedure in SPSS from which the estimated variance components would be inserted into the following ICC equation:



ICC = var(between Ss )/ [var(between Ss) + error variance]
I believe there are more sophisticated ways to deal with unbalanced designs that have been published in the past 5 years, but I am not fully versed in such methods. With that said, please see below for a small demonstration using SPSS syntax that might help make connections between generalizability coefficients from a one-facet design, ICC, and coefficient alpha using various procedures in SPSS:[snip, SPSS code for alpha examples]
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: inter-rater reliability with multiple raters

Rich Ulrich
Yes, I have them.  Everything (old version, new version) is in a couple of
directories, with a fair amount of duplication, and with the same date of
3/29/2004.  I think the proper content amounts to 150 files, 1.5 MB or so.
I suppose that much disk space can be rented at trivial cost.  I don't know
how to go about doing that - who to use, etc.

I just now realized how easy it would be to install them -- almost all the
file links are relative to "this directory", so I can go to the indexing file
and click on them right now.

It could use some updating on some things.  It became easier to Google
than to use my own indexes, even when Google would point to my FAQ.
And Google would give the more recent posts.

I know that I have seen some better references cited on a number of topics
since I stopped adding things in about 1998.  And sometimes, some better
comments.  One initial approach might be, I could just insert comments 
*saying* that this can be improved.   Then I would want an "Comments"
section where people could submit suggestions.  Or mail me old posts
that make good points.

Someone, want to give me advice about doing this?

--
Rich Ulrich


> Date: Wed, 18 Jun 2014 18:39:59 -0700
> From: [hidden email]
> Subject: Re: inter-rater reliability with multiple raters
> To: [hidden email]
>
> Rich, do you still have all of the files for your FAQ? If so, someone really
> ought to get them back on a website somewhere.
>
[snip]
Reply | Threaded
Open this post in threaded view
|

Re: inter-rater reliability with multiple raters

Mike
Just a suggestion: a wiki(pedia) type website might be a good
format.  This would allow new information to added easily as
well as maintaining some degree control over content. One
source on this is the following:
 
Whether this is the best format will depend upon the resources
you have available (e.g., time, whether you can get free hosting,
colleagues to act as editors/admins).
 
-Mike Palij
New York University
 
 
----- Original Message -----
Sent: Thursday, June 19, 2014 2:26 AM
Subject: Re: inter-rater reliability with multiple raters

Yes, I have them.  Everything (old version, new version) is in a couple of
directories, with a fair amount of duplication, and with the same date of
3/29/2004.  I think the proper content amounts to 150 files, 1.5 MB or so.
I suppose that much disk space can be rented at trivial cost.  I don't know
how to go about doing that - who to use, etc.

I just now realized how easy it would be to install them -- almost all the
file links are relative to "this directory", so I can go to the indexing file
and click on them right now.

It could use some updating on some things.  It became easier to Google
than to use my own indexes, even when Google would point to my FAQ.
And Google would give the more recent posts.

I know that I have seen some better references cited on a number of topics
since I stopped adding things in about 1998.  And sometimes, some better
comments.  One initial approach might be, I could just insert comments 
*saying* that this can be improved.   Then I would want an "Comments"
section where people could submit suggestions.  Or mail me old posts
that make good points.

Someone, want to give me advice about doing this?

--
Rich Ulrich


> Date: Wed, 18 Jun 2014 18:39:59 -0700
> From: [hidden email]
> Subject: Re: inter-rater reliability with multiple raters
> To: [hidden email]
>
> Rich, do you still have all of the files for your FAQ? If so, someone really
> ought to get them back on a website somewhere.
>
[snip]
Reply | Threaded
Open this post in threaded view
|

Re: inter-rater reliability with multiple raters

Bruce Weaver
Administrator
In reply to this post by Rich Ulrich
The current version of my stats web-pages was set up via Google Sites (https://sites.google.com/a/lakeheadu.ca/bweaver/Home/statistics).  I moved (most) things there (from Angelfire) when my university started using a version of Gmail.  As part of the new e-mail system, each user could set up a personal website.  I've never had a "regular" Gmail account, but would guess that users of those accounts also have access to Google Sites.  So that would be one option to consider.  

If John Hall is reading this, maybe he can chip in with his thoughts about the "weebly" site where his survey analysis workshop is housed (http://surveyresearch.weebly.com/1-survey-analysis-workshop.html).  

And there's wordpress.com, where Andy has his site (http://andrewpwheeler.wordpress.com/).

There are a few, for a start.


Rich Ulrich wrote
Yes, I have them.  Everything (old version, new version) is in a couple of
directories, with a fair amount of duplication, and with the same date of
3/29/2004.  I think the proper content amounts to 150 files, 1.5 MB or so.
I suppose that much disk space can be rented at trivial cost.  I don't know
how to go about doing that - who to use, etc.

I just now realized how easy it would be to install them -- almost all the
file links are relative to "this directory", so I can go to the indexing file
and click on them right now.

It could use some updating on some things.  It became easier to Google
than to use my own indexes, even when Google would point to my FAQ.
And Google would give the more recent posts.

I know that I have seen some better references cited on a number of topics
since I stopped adding things in about 1998.  And sometimes, some better
comments.  One initial approach might be, I could just insert comments  
*saying* that this can be improved.   Then I would want an "Comments"
section where people could submit suggestions.  Or mail me old posts
that make good points.

Someone, want to give me advice about doing this?

--
Rich Ulrich


> Date: Wed, 18 Jun 2014 18:39:59 -0700
> From: [hidden email]
> Subject: Re: inter-rater reliability with multiple raters
> To: [hidden email]
>
> Rich, do you still have all of the files for your FAQ?  If so, someone really
> ought to get them back on a website somewhere.
>
[snip]
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: inter-rater reliability with multiple raters

Andy W
I use Wordpress + Dropbox. A very simple solution (no programming required) is to just save the files via Dropbox, and then use Wordpress (or any blogging platform) as a front page to point to the files with a bit of organization. Dropbox is free for smaller file sizes (but at this point I pay a subscription and back everything up with it) and Wordpress is free as well (unless you want to buy your own domain name).

If you are just posting text snippets of code - sharing Dropbox links will be fine. If you are uploading large data files though it can become problematic because (in the free version) they throttle your downloads after so many megabytes.

If you want something nicer looking it will take more work, but that is my minimal and free suggestion to anyone starting out. The wiki suggestion by Mike is interesting as well - Wordpress you can have multiple editors so it can be wiki like. Dropbox has a system like this is as well but I have not used it.

Post back to the list Rich when you get a site up. I should probably add this to my blog, but I basically set up the tag wiki on the CrossValidated site for my suggested SPSS references, http://stats.stackexchange.com/tags/spss/info.
Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/
Reply | Threaded
Open this post in threaded view
|

Re: inter-rater reliability with multiple raters

bdates
In reply to this post by Mike
I've just been contracted to set up a learning community around performance measurement/management.  We've set up a Google Blog, and we're attaching our files for upload on a Google Drive.  I'm offering that as something to investigate...and also asking for feedback on how pedestrian it might be. I should be finished with my edit of the monograph on interrater agreement for nominal scales in the next two weeks, including the edits of all the syntax.  I'm trying to find a suitable place to post the info so listserv members can access it.  Marta Garcia Granero set up a website for her contributions, I believe, so maybe that's the way to go??

Brian

From: SPSSX(r) Discussion [[hidden email]] on behalf of Mike Palij [[hidden email]]
Sent: Thursday, June 19, 2014 7:24 AM
To: [hidden email]
Subject: Re: inter-rater reliability with multiple raters

Just a suggestion: a wiki(pedia) type website might be a good
format.  This would allow new information to added easily as
well as maintaining some degree control over content. One
source on this is the following:
 
Whether this is the best format will depend upon the resources
you have available (e.g., time, whether you can get free hosting,
colleagues to act as editors/admins).
 
-Mike Palij
New York University
 
 
----- Original Message -----
Sent: Thursday, June 19, 2014 2:26 AM
Subject: Re: inter-rater reliability with multiple raters

Yes, I have them.  Everything (old version, new version) is in a couple of
directories, with a fair amount of duplication, and with the same date of
3/29/2004.  I think the proper content amounts to 150 files, 1.5 MB or so.
I suppose that much disk space can be rented at trivial cost.  I don't know
how to go about doing that - who to use, etc.

I just now realized how easy it would be to install them -- almost all the
file links are relative to "this directory", so I can go to the indexing file
and click on them right now.

It could use some updating on some things.  It became easier to Google
than to use my own indexes, even when Google would point to my FAQ.
And Google would give the more recent posts.

I know that I have seen some better references cited on a number of topics
since I stopped adding things in about 1998.  And sometimes, some better
comments.  One initial approach might be, I could just insert comments 
*saying* that this can be improved.   Then I would want an "Comments"
section where people could submit suggestions.  Or mail me old posts
that make good points.

Someone, want to give me advice about doing this?

--
Rich Ulrich


> Date: Wed, 18 Jun 2014 18:39:59 -0700
> From: [hidden email]
> Subject: Re: inter-rater reliability with multiple raters
> To: [hidden email]
>
> Rich, do you still have all of the files for your FAQ? If so, someone really
> ought to get them back on a website somewhere.
>
[snip]
Reply | Threaded
Open this post in threaded view
|

Re: inter-rater reliability with multiple raters

Ryan
In reply to this post by Rich Ulrich
The linear MIXED modeling (LMM) procedure is a preferred approach to estimating parameters using REML for unbalanced designs where the data are assumed to be missing at random (MAR). Consistent with the underlying theory of LMM, I have generated data below for which the number of ratings per subject varies according to a random uniform variate. I then estimate the between subject variance and residual variance using the default REML estimation method offered by the MIXED procedure, which is all that is needed to calculate an ICC. 

It should be noted that the linear MIXED model I have employed conforms to one way of estimating an ICC for a specific design described by:

Shrout, P.E. and Fleiss, J.L (1979). Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin, 86, 420-428

where,

1. Each subject is rated by multiple raters
2. Raters are assumed to be randomly assigned to subjects
3. All subjects have the same number of raters

Note that this is consistent with one of the examples in my previous post:


The data generated below conform to 1. and 2., but obviously not 3. However, as I mentioned  before, as long as the data are MAR, the REML estimators should reasonably recover the parameters. Using the example below, the population between subjects VC = .300 and the population ICC=.231. The MIXED model estimates the VC = .306 and the estimated ICC = 0.230.

Note: I generated data that do not specify which raters were assigned to each subject. This post is not intended to compare the proposed method to other methods, but simply to provide a concrete example of how one might estimate an ICC where each subject has been rated by a varying random number of raters.

Ryan
--

/*Generate Data*/.
SET SEED 987879546.
NEW FILE.
INPUT PROGRAM.
COMPUTE subject_ID = -99.
COMPUTE intercept= -99.
COMPUTE nobs = -99.
LEAVE subject_ID to nobs.
LOOP subject_ID = 1 to 100.
COMPUTE intercept = sqrt(.30)*rv.normal(0,1).
COMPUTE nobs = rnd(rv.uniform(12,35)).
LOOP j=1 to nobs.
COMPUTE y = intercept + rv.normal(0,1).
END CASE.
END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
EXECUTE.

*Fit Linear Mixed Model.
MIXED y BY subject_ID
  /FIXED= | SSTYPE(3)
  /METHOD=REML
  /PRINT= G 
  /RANDOM=subject_ID.

*Estimate ICC.
COMPUTE ICC =.30628546692761 / (.30628546692761 + 1.02273871556984 ).
EXECUTE.


On Wed, Jun 18, 2014 at 12:48 PM, Rich Ulrich <[hidden email]> wrote:

ICCs with unequal Ns -

I got a reference in 1995   for computing a simple ICC with unequal Ns from the
Usenet stats group, sci.stat.consult.  I put the formula (below) in my stats-FAQ,
which I maintained from 1997 to about 2006.  No one ever provided newer references.

This is owing to Ernest Haggard, Intraclass Correlation and the Analysis of Variance
(1958) , as posted by Michael Bailey and reformatted and adapted by me.  I hope I have
not screwed it up.

Let, R=intraclass correlation,
    BSMS=between Subject mean square,
    WMS=mean square within,
    c=number of Subjects, and
    ki=number for the ith Subject. Then:

         R = (BSMS - WMS) /  ( BSMS + (k' -1)*WMS )
         where k' =  [ sum(ki) -   (sum(ki**2))/sum(ki) ]   / (c-1)


The value of k'   does need to work out to something in the range of an average ki
number of ratings for the subjects.  I think that I remember using the "reciprocal
mean" of the counts, but I don't remember using this formula for k' for getting it.



--
Rich Ulrich


Date: Tue, 17 Jun 2014 22:49:46 -0400

From: [hidden email]
Subject: Re: inter-rater reliability with multiple raters
To: [hidden email]

Okay. Let me start by saying I'm a bit (okay maybe very) under the weather and paid work is catching up with me, so apologies for any typos/mistakes.  Having said that, this topic is quite interesting as it relates to showing some connections between generalizability coefficients, various forms of ICCs, and coefficient alpha. 

Before discussing those connections, however, the short answer to the question about whether there are valid estimators of an ICC in an unbalanced design would be a solid "it depends." I would argue that to obtain a valid ICC we need to appropriately decompose the variance to obtain between subject variance and variance attributable to all other sources. This could prove challenging.

At any rate, if the raters tend to agree in their ratings of each subject, then the between subject variance will tend to be much larger than other sources of variances, and the ICC should approach 1.0. 

With that said, the ICC is defined as:

ICC = var(between Ss) / Total Variance

where,

Total Variance = 
(1) between Ss variance
(2) between Rater Variance
(3) error variance

If one initially planned a crossed design (all subjects were intended to be rated by all raters) but due to random circumstances some raters were unable to rate some subjects and those missing data can be assumed to be missing at random (MAR), then I would suggest that one could theoretically estimate a valid estimate of the ICC using an ML estimator via the MIXED procedure in SPSS from which the estimated variance components would be inserted into the following ICC equation:

ICC = var(between Ss )/ [var(between Ss) + error variance]

I believe there are more sophisticated ways to deal with unbalanced designs that have been published in the past 5 years, but I am not fully versed in such methods. With that said, please see below for a small demonstration using SPSS syntax that might help make connections between generalizability coefficients from a one-facet design, ICC, and coefficient alpha using various procedures in SPSS:
[snip, SPSS code for alpha examples]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD