Hi everyone, I am trying to determine my options for statistics to assess the concurrent validity of a set of items by different raters against a “gold standard” set of items completed by a professional. From what I know, in cases with scale/continuous variables, this is typically done by performing a Pearson’s correlation and associated significance test. However, I have a set of categorical items with no order. I have 66 participants that I would like to compare against one rater. Each item contains three possible answers. Is anyone aware of a statistic that would be suitable for comparing a group of raters against a single rater in these circumstances? Any help would be greatly appreciated. Kind regards. Sent from Mail for Windows 10 |
Your overall information for each participant is "rate of agreement" with the gold standard -- like, grading a multiple-choice test.
(For information on "difficulty", or toward seeing if the pro got it right, also check how many errors there were for each choice.)
If your categories are separately interesting, you could score them separately for the agreements with answers 1, 2, and 3; a step further might look separately at "sensitivity" and "specificity" for answers 1, 2, and 3. Are answers such that it is interesting whether an answer is overused?
For an overall "agreement" statistic that confounds the different sorts of differences, you could use kappa for the pro versus each participant.
-- Rich Ulrich
From: SPSSX(r) Discussion <[hidden email]> on behalf of [hidden email] <[hidden email]>
Sent: Thursday, November 24, 2016 6:36:36 PM To: [hidden email] Subject: Concurrent validity for nominal/categorical items Hi everyone, I am trying to determine my options for statistics to assess the concurrent validity of a set of items by different raters against a “gold standard” set of items completed by a professional. From what I know, in cases with scale/continuous variables, this is typically done by performing a Pearson’s correlation and associated significance test. However, I have a set of categorical items with no order. I have 66 participants that I would like to compare against one rater. Each item contains three possible answers. Is anyone aware of a statistic that would be suitable for comparing a group of raters against a single rater in these circumstances? Any help would be greatly appreciated. Kind regards. Sent from Mail for Windows 10 |
In reply to this post by Benjamin Spivak (Med)
Please explain what you mean "each item has three possible answers". Is it the same 3 for all items? If so what are the possible answers?
If the 3 are different for each item please provide some examples of items and responses.
Art Kendall
Social Research Consultants |
It is the same for all items- yes, no, not scored. Thanks. Sent from Mail for Windows 10 From: [hidden email] Please explain what you mean "each item has three possible answers". Is it the same 3 for all items? If so what are the possible answers? If the 3 are different for each item please provide some examples of items and responses. ----- Art Kendall Social Research Consultants -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Concurrent-validity-for-nominal-categorical-items-tp5733514p5733516.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Rich Ulrich
HI Rich, I went with calculating a Fleiss Kappa after transforming scores to agree v disagree with the gold standard. I don’t want to calculate individual agreement between pairs because I have a relatively large number of raters (>70). However, my overall kappa is quite low, likely due to the fact that I have a disproportionate number of agreements compared with disagreement. Is there a way to calculate maximum Kappa for the Fleiss variant for n>2 raters? This might help interpretation. The problem is that I can’t find any references where this sort of thing is calculated. Thanks. |
First, you need to figure out what it is that you want to know. I assumed that you would want information about individual raters, no matter how many.
Are you interested in evaluating and reporting on items? raters? the TOTAL score for items?
Here is a good starting article which is relatively brief: http://www.john-uebersax.com/stat/agree.htm
Second, you probably don't want kappa for multiple raters because it has no provision for Gold Standard. - If the Gold Standard pro did use Not Marked, you have correlations. - If the Gold Standard pro did not use it, then a kappa between pro and rater should probably be 2x2, using Yes/No; and every Not Marked would be recoded to whichever choice is wrong on the item. But handle the "FIrst", first: What is it that you want to know? -- Rich Ulrich From: SPSSX(r) Discussion <[hidden email]> on behalf of [hidden email] <[hidden email]>
Sent: Sunday, November 27, 2016 5:04 AM To: [hidden email] Subject: Re: Concurrent validity for nominal/categorical items
HI Rich,
I went with calculating a Fleiss Kappa after transforming scores to agree v disagree with the gold standard. I don’t want to calculate individual agreement between pairs because I have a relatively large number of raters (>70). However, my overall kappa is quite low, likely due to the fact that I have a disproportionate number of agreements compared with disagreement. Is there a way to calculate maximum Kappa for the Fleiss variant for n>2 raters? This might help interpretation. The problem is that I can’t find any references where this sort of thing is calculated.
|
I may be sounding like I'm coming out of left
field here but
the type of problem/situation being discussed
(judgments
against a gold standard or "true state")
sounds a lot like
a signal detection (SDT) problem. Rich's
link to John Uebersax's
doesn't really connect to SDT analysis but if
you go to the
related page for Raw Agreement Indices, the
basic ideas
are laid out even though Uebersax does not
refer to SDT
analyses.
SDT has a theoretical/mathematical basis and a
bunch of
assumptions (though the analysis can be
modified to take
specific assumptions into account) but the one
of the key
results is represented in the Receiver
Operating Characteristic
(ROC) curve which has true positives (or
Sensitivity)on the y-axis
and
false positives (or 1 -
Specificity) on the x-axis .
See this Wikipedia entry for more
detail:
A number of statistics can be calculated in
this situation
but perhaps the best known is the Area Under
the Curve
(AUC) or in SDT terms, A' (A-prime). The
ROC curse
is a unit square because the x and y axes are
probabilities
ranging from 0 to 1. The minor diagonal
from (0,0) to (1,1)
cuts the area inside the square in half giving
one .50 of the
area below it. This is also called the
"chance diagonal"
because the probability of a false positive =
probability of
a true positive, meaning the response being
made are
random. A single rater provides a pair
of false positive rates
and true positive rates, which either falls on
the chance
diagonal (meaning random performance or no
discrimination),
above the diagonal (to the upper left) which
means that
the person has better than chance performance,
or
falls below the diagonal (lower right corner)
which indicate
systematically BAD performance (performance
worse than
chance. With a pair of false positive,
true positive values,
one can calculate the AUC for a person and
"good performance"
will be some value between 0.50 and 1.00
(close to 1, the
better). This is used in radiology
(reading x-rays or scans)
and many other medical areas..
As the Wiki entry points out, other statistics
can be calculated
in this situation and Cohen's Kappa and
Fleiss' Kappa are
among them (see refs 3 and 22 for the entry).
See also:
as well as
There is a large literature on this starting
with SDT's
origination in psychophysics in the early
1950s to its
application to diagnostic issues starting in
the 1970-80s.
Just something to think about.
-Mike Palij
New York University
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
For those suggesting ROC curves, can you show an example of how to go from the OP's data to an ROC curve? I don't understand how you go from the three categorical inputs to a continuous score necessary to compute the sensitivity and specificity.
|
It would help the OP and those of us trying to respond if there were much more context, e.g., describing the task as presented to the 'expert(s?)' and to the respondents. Were there subgroups of items? Of respondents? The OP mentioned 'not scored'. This could have many meanings, e.g., it could mean 'does not apply' or 'Respondent skipped but went on' or 'Respondent stopped responding', etc.
Art Kendall
Social Research Consultants |
I agree with Art to the extent we still need more data. First, if I understand correctly there is a gold standard. If so, it needs to be treated as a separate rater, and then a series of kappa's generated between each rater and the standard. If all the raters are included, then the resultant kappa is their collective agreement and inseparable from each rater's level of agreement with the gold standard. Additionally, Fleiss kappa and Cohen's kappa are the same for two raters, so it's really Cohen and not Fleiss that's being carried out.
At this point, there would need to be an average of all the two-rater kappa's produced. This is the same as Light's kappa, an extension of Cohen's kappa to more than two raters, so it's allowable, at least from a literature and research background. Finally, there's the matter of including 'not scored' as a category. If this sheds light on the 'difficulty' of an item to rate, then maybe it should be included as a category. If it can be interpreted in a number of ways, as Art suggests, then it probably ought to be treated as missing data. BTW, Gwet has developed a version of his AC1 statistic to include a gold standard. The difficulty is that his solutions are in SAS, not SPSS, so unless the OP wants to translate SAS to SPSS syntax, it's probably not doable. Brian Dates, M.A. Director of Evaluation and Research | Evaluation & Research | Southwest Counseling Solutions Southwest Solutions 1906 25th Street, Detroit, MI 48216 313-297-1391 office | 313-849-2702 fax [hidden email] | www.swsol.org -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Art Kendall Sent: Monday, November 28, 2016 12:46 PM To: [hidden email] Subject: Re: Concurrent validity for nominal/categorical items It would help the OP and those of us trying to respond if there were much more context, e.g., describing the task as presented to the 'expert(s?)' and to the respondents. Were there subgroups of items? Of respondents? The OP mentioned 'not scored'. This could have many meanings, e.g., it could mean 'does not apply' or 'Respondent skipped but went on' or 'Respondent stopped responding', etc. ----- Art Kendall Social Research Consultants -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Concurrent-validity-for-nominal-categorical-items-tp5733514p5733536.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Andy W
On Monday, November 28, 2016 11:40 AM, Andy W wrote:
> For those suggesting ROC curves, can you show an example of how to go > from > the OP's data to an ROC curve? I don't understand how you go from the > three > categorical inputs to a continuous score necessary to compute the > sensitivity and specificity. The real issue is whether the three categories are ordered -- in which case this can be treated like a rating scale (0=not scored, 1= No, 2= Yes) and the procedures have been long worked out for this case -- or they are unordered categories. It seems to me to be something of stretch to claim that the responses are ordincal. Things would be a lot simpler if the "not score" response could be ignored (e.g., treated as missing data) but we'll have to wait on the OP to provide more information. So, we're left with the situation where one has three categories or, to use current parlance, classes and we have a multiclass classifier problem. Multiclass ROC analysis has under developmnet since the late 1990s. One source on this is the following article: Tom Fawcett, An introduction to ROC analysis, Pattern Recognition Letters, Volume 27, Issue 8, June 2006, Pages 861-874. http://dx.doi.org/10.1016/j.patrec.2005.10.010 . See section 9. Decision problems with more than two classes (p872) With more than two classes, the math become hairy and one suggestion is to breakdown the analysis into pairwiase comparisons; see: Thomas C.W. Landgrebe, Robert P.W. Duin, Approximating the multiclass ROC by pairwise analysis, Pattern Recognition Letters, Volume 28, Issue 13, 1 October 2007, Pages 1747-1758, http://dx.doi.org/10.1016/j.patrec.2007.05.001 . There appears to be a good sized literature on this case as a classification problem (in contrast to a discrimination problem) and one recent publication provides some sense of where this area is now and where it is going; see: Simon Bernard, Clément Chatelain, Sébastien Adam, Robert Sabourin, The Multiclass ROC Front method for cost-sensitive classification, Pattern Recognition, Volume 52, April 2016, Pages 46-60, http://dx.doi.org/10.1016/j.patcog.2015.10.010 That being said, the bad news is that the analyses provided in the above articles can't be easily done in SPSS (if at all). Stata appears to several procedures for doing ROC analysis (including with a "gold standard": rocgold) but these all appear to be for binary responses. I'm not a Stata person so I don't know if these have been extended to the multiclass case but the stata fora should be able to provide answers. -Mike Palij New York University [hidden email] . ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Mike, it isn't obvious to me how you apply any of those papers to this situation.
Multi-class ROC curves are for "predicting" multi-classes, not for using categorical "independent" variables. That is a red-herring as far as I can tell. If you have a gold standard, you want to see if the extra raters match the gold standard. So the outcome is "Predicted Right" or "Predicted Wrong" - still a plain old binary outcome is it not? Even if that is not the case, just pretend like it is for a moment - how do you get an ROC curve for only three input guesses? For which you agree there is no natural ordering. Rich originally suggested to code along an ordinal scale and calculate sensitivity/specificity. That would give you an ROC curve with only one point. (So would not be a curve at all.) What is the point of the area-under-the-curve statistic in that situation? It is a bit facile, since the inputs don't allow interpolation along any of the line to alter predictions. You can only have three potential predictions if you only have three potential inputs. Are you suggesting an ROC curve for every rater? Or for every item? Or do you just get one ROC curve? I'm still really confused how you turn the OP's data into an ROC curve. Just sketch out a simple example, even if you don't know how to do it in software, to help me understand. |
Administrator
|
Even if it were a situation where everyone agreed on how to generate the ROC curve, some critics are beginning to cast doubt on the usefulness of AUC as a measure of screening test quality. E.g., the authors of the following article suggest that it would generally be more useful to report sensitivity for a desired level of specificity (e.g., when good specificity is needed), or specificity for a given level of sensitivity (when good sensitivity is needed).
https://www.ncbi.nlm.nih.gov/pubmed/24407586 HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
Y'all are killing me slowly with this advice I can't make sense of. I can understand being critical of just one number, like AUC, but Bruce how is just reporting single values of sensitivity and specificity any better than plotting the actual ROC curve - which shows the entire range of sensitivity per specificity? That is the whole motivation for the ROC plot to begin with!
|
Administrator
|
Hi Andy. I didn't (intend to) say that one should not plot ROC curves. All I was suggesting was that AUC is probably not nearly as useful a measure as one might think, given how frequently it is reported.
But now that you've got me started, I would suggest that if one must report AUC, they should consider reporting the Gini coefficient too (or instead). Gini coefficient = 2*AUC-1. Conceptually, it has been described as a chance-corrected AUC. Given how frequently Cohen's kappa (which is described as a chance-corrected measure of agreement) is touted as being far superior to raw percent agreement, it surprises me that the Gini coefficient has not caught on more. Regarding the other point about reporting sensitivity for a given specificity (or vice-versa), I was just suggesting that in many uses of diagnostic tests, it is important to achieve very high sensitivity (e.g., when ruling out disease) or very high specificity (when ruling in disease). In such cases, one would surely choose a cut-point that guarantees the needed level of sensitivity or specificity, and then report the other test property at that cut-point. If 99% specificity is required for a given test in a given situation, I don't really care what the sensitivity would be at another cut-point that yields 75% specificity. I fear this is veering way too far from the OP's question, so will stop there. ;-) Cheers, Bruce
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
Free forum by Nabble | Edit this page |