SPSSX Discussion

Interrater reliability

Classic

List

Threaded

9 messages Options

Lovins, Brian (lovinsbk)

Aug 03, 2011; 6:41pm

Interrater reliability

3 posts

Good afternoon,

I am looking to calculate Kappa to determine a measure of interrater reliability. I currently have 125 subjects being rated by different pairs of staff. Each pair assesses the same person, but there are 125 different pairs of staff. I want to calculate the overall Kappa for the entire group. I can do it for the individual pairs and average the scores, but was hoping there was a syntax/macro that I could use that would calculate the overall Kappa. The data are formatted as follows, but I can restructure the data if needed:

Rater1a	Rater1b	Rater2a	Rater2b
Item 1-Subject 1	Item 1-Subject 1	Item 1-Subject 2	Item 1 – Subject 2
Item 2 – Subject 1	Item 2 – subject 1	Item 2-Subject 2	Item 2 – Subject 2

Thanks

Brian

Maurice Vergeer

Aug 03, 2011; 7:50pm

Re: Interrater reliability

54 posts

I can't help you with the Kappa coefficient. Still, you might want to read Hayes & Krippendorff (2007) Answering the Call for a Standard Reliability Measure for Coding Data. COMMUNICATION METHODS AND MEASURES, 1(1), 77–89 arguing in favor of Krippendorff's alpha.
Follow this link for the pdf and a spss-macro for calculating alpha: http://www.afhayes.com/spss-sas-and-mplus-macros-and-code.html

HTH
Maurice

On Wed, Aug 3, 2011 at 8:41 PM, Lovins, Brian (lovinsbk) <[hidden email]> wrote:

Good afternoon,
I am looking to calculate Kappa to determine a measure of interrater reliability. I currently have 125 subjects being rated by different pairs of staff. Each pair assesses the same person, but there are 125 different pairs of staff. I want to calculate the overall Kappa for the entire group. I can do it for the individual pairs and average the scores, but was hoping there was a syntax/macro that I could use that would calculate the overall Kappa. The data are formatted as follows, but I can restructure the data if needed:

Rater1a
Rater1b

Rater2a

Rater2b
Item 1-Subject 1

Item 1-Subject 1

Item 1-Subject 2

Item 1 – Subject 2

Item 2 – Subject 1

Item 2 – subject 1

Item 2-Subject 2

Item 2 – Subject 2

Thanks
Brian

... [show rest of quote]

--
___________________________________________________________________
Maurice Vergeer
Department of communication, Radboud University (www.ru.nl)
PO Box 9104, NL-6500 HE Nijmegen, The Netherlands

Visiting Professor Yeungnam University, Gyeongsan, South Korea

Recent publications:
-Vergeer, M., Eisinga, R. & Franses, Ph.H. (forthcoming). Supply and demand effects in television viewing. A time series analysis. Communications - The European Journal of Communication Research.
-Vergeer, M. Lim, Y.S. Park, H.W. (forthcoming). Mediated Relations: New Methods to study Online Social Capital. Asian Journal of Communication.
-Vergeer, M., Hermans, L., & Sams, S. (forthcoming). Online social networks and micro-blogging in political campaigning: The exploration of a new campaign tool and a new campaign style. Party Politics.
-Pleijter, A., Hermans, L. & Vergeer, M. (forthcoming). Journalists and journalism in the Netherlands. In D. Weaver & L. Willnat, The Global Journalist in the 21st Century. London: Routledge.

Webspace
www.mauricevergeer.nl
http://blog.mauricevergeer.nl/
www.journalisteninhetdigitaletijdperk.nl
maurice.vergeer (skype)
___________________________________________________________________

Bruce Weaver

Aug 03, 2011; 8:17pm

Re: Interrater reliability

Administrator

3512 posts

In reply to this post by Lovins, Brian (lovinsbk)

One of my former bosses would no doubt urge you to consider using G-Theory. You can read about it in chapter 9 of Health Measurement Scales. Here's the Google Books link:

http://books.google.ca/books?id=UbKijeRqndwC&printsec=frontcover&dq=health+measurement+scales+streiner&hl=en&ei=bas5TuDDHorMsQKDvsg0&sa=X&oi=book_result&ct=result&resnum=1&ved=0CDAQ6AEwAA#v=onepage&q&f=false

As you'll see there, Norman & Streiner recommend a (free) program called G_String (http://fhsperd.mcmaster.ca/g_string/index.html).

HTH.

Lovins, Brian (lovinsbk) wrote

Good afternoon,
I am looking to calculate Kappa to determine a measure of interrater reliability. I currently have 125 subjects being rated by different pairs of staff. Each pair assesses the same person, but there are 125 different pairs of staff. I want to calculate the overall Kappa for the entire group. I can do it for the individual pairs and average the scores, but was hoping there was a syntax/macro that I could use that would calculate the overall Kappa. The data are formatted as follows, but I can restructure the data if needed:

--- Table snipped ---

Thanks
Brian

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

Ryan

Aug 05, 2011; 2:01am

Re: Interrater reliability

910 posts

In reply to this post by Lovins, Brian (lovinsbk)

What types of ratings were made?

Ryan

On Wed, Aug 3, 2011 at 2:41 PM, Lovins, Brian (lovinsbk) <[hidden email]> wrote:

Good afternoon,
I am looking to calculate Kappa to determine a measure of interrater reliability. I currently have 125 subjects being rated by different pairs of staff. Each pair assesses the same person, but there are 125 different pairs of staff. I want to calculate the overall Kappa for the entire group. I can do it for the individual pairs and average the scores, but was hoping there was a syntax/macro that I could use that would calculate the overall Kappa. The data are formatted as follows, but I can restructure the data if needed:

Rater1a
Rater1b

Rater2a

Rater2b
Item 1-Subject 1

Item 1-Subject 1

Item 1-Subject 2

Item 1 – Subject 2

Item 2 – Subject 1

Item 2 – subject 1

Item 2-Subject 2

Item 2 – Subject 2

Thanks
Brian

... [show rest of quote]

Lovins, Brian (lovinsbk)

Aug 05, 2011; 2:17am

Re: Interrater reliability

3 posts

In reply to this post by Lovins, Brian (lovinsbk)

Ryan
Most of the ratings are dichotomous, but a few have 3 categories. There are 36 items in total.
Thanks
Brian
Sent from my Samsung Epic™ 4G

R B wrote:

What types of ratings were made?

Ryan

On Wed, Aug 3, 2011 at 2:41 PM, Lovins, Brian (lovinsbk) <[hidden email]> wrote:

Good afternoon,

I am looking to calculate Kappa to determine a measure of interrater reliability. I currently have 125 subjects being rated by different pairs of staff. Each pair assesses the same person, but there are 125 different pairs of staff. I want to calculate the overall Kappa for the entire group. I can do it for the individual pairs and average the scores, but was hoping there was a syntax/macro that I could use that would calculate the overall Kappa. The data are formatted as follows, but I can restructure the data if needed:

Rater1a

Rater1b

Rater2a

Rater2b

Item 1-Subject 1

Item 1-Subject 1

Item 1-Subject 2

Item 1 – Subject 2

Item 2 – Subject 1

Item 2 – subject 1

Item 2-Subject 2

Item 2 – Subject 2

Thanks

Brian

... [show rest of quote]

Jeanne Eidex

Aug 05, 2011; 2:33pm

Validity of K Means Cluster

11 posts

Hi Everyone,

This might be an overly simple question, but, the output of this simple clustering syntax doesn’t offer much information to determine how reliable these clusters are. Any suggestions?

QUICK CLUSTER q21 q22 q23 q24 q25 q26 q27 q28 q29 q30 q31 q32 q33 q34

/MISSING=PAIRWISE

/CRITERIA=CLUSTER(3) MXITER(10) CONVERGE(0)

/METHOD=KMEANS(NOUPDATE)

/PRINT INITIAL.

Thanks,

Jeanne

Rich Ulrich

Aug 05, 2011; 2:39pm

Re: Interrater reliability

1067 posts

In reply to this post by Lovins, Brian (lovinsbk)

First, I must say that I am puzzled by the data description, and
possibly by the layout.

You say there are 125 subjects. You say there are 125 "different
pairs" of staff. Are there only 125 ratings, each with a new pair?
How many staff are involved?

Do the data identify, at all, which staff made which ratings?
- The way that I extrapolate the data format is to read
there is just one line per item, and two columns per subject;
that there is *no* identification of rater except as a/b; and
each line has 125x2 scores. It looks like there are multiple
subjects on one line, which is certainly not a reasonable form
for prospective statistical analyses.

You certainly cannot do any reasonable analog of a kappa if
the raters are not identified. One usual and reasonable way to
organize the data, at the start, would be as one line per rating,
with SubjectID and RaterID followed by a set of items.

If you want a pair of ratings on one line, then the line should
include the subject ID, and the RaterID with each rating.

I remember reading the observation that direct computation
of kappa (dichotomous or weighted - the only respectable versions)
is not really necessary for complicated designs, since it usually
would agree, to two decimals, to the intraclass correlation.
That would say that you would have a good approximation of
kappa if you identify Subject and Rater and do the two-way
ANOVA, and compute the ICC. Perhaps you can try that.

Unfortunately, computing the ICC with unequal Ns is not well-
documented. As further misfortune, any newbie-computation of the
ICC is plagued with errors because there is an extremely strong
tendency to swap the d.f. counts (groups, subjects) in the formulas.

--
Rich Ulrich

Date: Wed, 3 Aug 2011 14:41:37 -0400
From: [hidden email]
Subject: Interrater reliability
To: [hidden email]

Good afternoon,

Rater1a	Rater1b	Rater2a	Rater2b
Item 1-Subject 1	Item 1-Subject 1	Item 1-Subject 2	Item 1 – Subject 2
Item 2 – Subject 1	Item 2 – subject 1	Item 2-Subject 2	Item 2 – Subject 2

Thanks

Brian

Art Kendall

Aug 08, 2011; 11:59am

Re: Validity of K Means Cluster

2500 posts

In reply to this post by Jeanne Eidex

What does the word "reliable" mean to you?
How many cases do you have?
What is the nature of your data?
Are you sure that some other PROXIMITY measure might not be preferable?

I notice you did not save the cluster assignments. Since k-means is very dependent on case order it is good practice to try some random-order of cases runs.

Art Kendall
Social Research Consultants

On 8/5/2011 10:33 AM, Jeanne Eidex wrote:

Hi Everyone,

This might be an overly simple question, but, the output of this simple clustering syntax doesn’t offer much information to determine how reliable these clusters are. Any suggestions?

QUICK CLUSTER q21 q22 q23 q24 q25 q26 q27 q28 q29 q30 q31 q32 q33 q34

/MISSING=PAIRWISE

/CRITERIA=CLUSTER(3) MXITER(10) CONVERGE(0)

/METHOD=KMEANS(NOUPDATE)

/PRINT INITIAL.

Thanks,

Jeanne

... [show rest of quote]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Art Kendall
Social Research Consultants

Hector Maletta

Aug 08, 2011; 1:05pm

Re: Validity of K Means Cluster

602 posts

Clusters, obtained by k-means or otherwise, are not meant to be “reliable” in themselves. In the case of k-means, as Art Kendall observes, reordering the cases may alter the results, because the initial cases are taken as initial cluster centers, and the other cases are added sequentially to the various clusters. A practical way to implement what Art’s suggestion would be saving the cluster allocation from the first run, then sorting the cases and re-run the procedure, saving again, and then compare the two results (ideally, they should be almost perfectly correlated).

Another possible meaning of “reliable” could be the explanatory power of the clusters with respect to some criterion variable. Suppose you cluster cases by location, occupation and education, and use income as a criterion. A good cluster solution should minimize intra-cluster variance and maximize inter-cluster variance, thus you may apply ANOVA to the results and watch for possible differences between the two solutions.

A further possible variation is varying the number of clusters: you ordered three clusters in your syntax, but you may try with four and see which is better suited to your purposes.

Clustering is not an “analytical” procedure but a “heuristic” one. Its analytical significance or usefulness should be judged by external criteria, or variation in the clustering parameters or the ordering of cases.

Hector

De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Art Kendall
Enviado el: Monday, August 08, 2011 09:00
Para: [hidden email]
Asunto: Re: Validity of K Means Cluster

Hi Everyone,

This might be an overly simple question, but, the output of this simple clustering syntax doesn’t offer much information to determine how reliable these clusters are. Any suggestions?

QUICK CLUSTER q21 q22 q23 q24 q25 q26 q27 q28 q29 q30 q31 q32 q33 q34

/MISSING=PAIRWISE

/CRITERIA=CLUSTER(3) MXITER(10) CONVERGE(0)

/METHOD=KMEANS(NOUPDATE)

/PRINT INITIAL.

Thanks,

Jeanne

No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1391 / Virus Database: 1518/3819 - Release Date: 08/07/11