The Intraclass Correlation Coefficient (ICC) can be used to measure the strength of inter-rater agreement in the situation where the rating scale is continuous or ordinal. It is suitable for studies with two or more raters. Note that, the ICC can be also used for test-retest (repeated measures of the same subject) and intra-rater (multiple scores from the same raters) reliability analysis.
Generally speaking, the ICC determines the reliability of ratings by comparing the variability of different ratings of the same individuals to the total variation across all ratings and all individuals.
- A high ICC (close to 1) indicates high similarity between values from the same group.
- A low ICC (ICC close to zero) means that values from the same group are not similar.
There are multiple forms of ICC (Koo and Li 2016). This article describes how to:
- choose the correct ICC form for inter-rater reliability studies.
- compute the intraclass correlation coefficient in R.
Contents:
Related Book
Inter-Rater Reliability Essentials: Practical Guide in RHow to choose the correct ICC forms
There are different forms of ICC that can give different results when applied to the same set of data (Koo and Li 2016). The forms of ICC can be defined based on the:
- model: one-way random effects, two-way random effects or two-way fixed effects.
- unit: single rater or the mean of k raters
- type of relationship considered to be important: consistency or absolute agreement
There are three models:
- ICC1: One-way random-effects model. In this model, each subject is rated by a different set of randomly chosen raters. Here, raters are considered as the random effects. Practically, this model is rarely used in clinical reliability analysis because majority of such studies typically involve the same set of raters to measure all individuals. An exception would be multicenter studies for which the physical distance between centers makes it impossible to use the same set of raters to rate all subjects. Under such situation, the one-way random-effects model should be used (Koo and Li 2016).
- ICC2: Two-way random-effects model. A set of k raters are randomly selected, then, each subject is measured by the same set of k raters with similar characteristics. In this model, both subjects and raters are viewed as random effects. The two-way random-effects model is chosen if we plan to generalize our reliability results to any raters who possess the same characteristics as the selected raters in the reliability study. This model is appropriate for evaluating rater-based clinical assessment methods that are designed for routine clinical use.
- ICC3: Two-way mixed effects model. Here the raters are considered as fixed. We should use the two-way mixed-effects model if the selected raters are the only raters of interest. With this model, the results only represent the reliability of the specific raters involved in the reliability experiment. They cannot be generalized to other raters even if those raters have similar characteristics as the selected raters in the reliability experiment. The two-way mixed-effects model is less commonly used in inter-rater reliability analysis.
Unit of ratings. For each of these 3 models, reliability can be estimated for a single rating or for the average of k ratings. The selection between “single” and “average” depends on how the measurement protocol will be conducted in the actual application (Koo and Li 2016). For example:
- If we plan to use the mean value of k raters as an assessment basis, the experimental design of the reliability study should involve 3 raters, and the “average of k raters” type should be selected.
- Conversely, if we plan to use the measurement from a single rater as the basis of the actual measurement, “single rater” type should be considered even though the reliability experiment involves 2 or more raters.
Note that, in the next sections, we’ll use the terms:
- ICC1, ICC2 and ICC3 to specify the reliability for a sing rating; and
- ICC1k, ICC2K and ICC3K to design the reliability for the average of k raters.
Consistency or absolute agreement. In the one-way model, the ICC is always a measure for absolute agreement. In the two-way models a choice can be made between two types: Consistency when systematic differences between raters are irrelevant, and absolute agreement, when systematic differences are relevant. In other words, the absolute agreement measures the extent to which different raters assign the same score to the same subject. Conversely, consistency type concerns if raters’ scores to the same group of subjects are correlated in an additive manner (Koo and Li 2016).
Note that, the two-way mixed-effects model and the absolute agreement are recommended for both test-retest and intra-rater reliability studies (Koo et al., 206).
ICC Interpretation
Koo and Li (2016) gives the following suggestion for interpreting ICC (Koo and Li 2016):
- below 0.50: poor
- between 0.50 and 0.75: moderate
- between 0.75 and 0.90: good
- above 0.90: excellent
Example of data
We’ll use the anxiety
data [irr package], which contains the anxiety ratings of 20 subjects, rated by 3 raters. Values are ranging from 1 (not anxious at all) to 6 (extremely anxious).
data("anxiety", package = "irr")
head(anxiety, 4)
## rater1 rater2 rater3
## 1 3 3 2
## 2 3 6 1
## 3 3 4 4
## 4 4 6 4
We want to compute the inter-rater agreement using ICC2.
Computing ICC in R
There are many functions and R packages to compute ICC. Were, we’ll consider the function icc()
[irr package] and the function ICC()
[psych package].
Using the irr package
Recall that, there are different modes of ICC calculations. When considering which form of ICC is appropriate for an actual set of data, one has take several decisions (Shrout and Fleiss 1979):
- Should only the subjects be considered as random effects (‘“oneway”’ model) or are subjects and raters randomly chosen from a bigger pool of persons (‘“twoway”’ model).
- If differences in judges’ mean ratings are of interest, inter-rater ‘“agreement”’ instead of ‘“consistency”’ should be computed.
- If the unit of analysis is a mean of several ratings, unit should be changed to ‘“average”’. In most cases, however, single values (unit=‘“single”’) are regarded.
You can specify the different parameters as follow:
library("irr")
icc(
anxiety, model = "twoway",
type = "agreement", unit = "single"
)
## Single Score Intraclass Correlation
##
## Model: twoway
## Type : agreement
##
## Subjects = 20
## Raters = 3
## ICC(A,1) = 0.198
##
## F-Test, H0: r0 = 0 ; H1: r0 > 0
## F(19,39.7) = 1.83 , p = 0.0543
##
## 95%-Confidence Interval for ICC Population Values:
## -0.039 < ICC < 0.494
Using the psych package
If you use ICC()
function, you don’t need to specify anything. R will compute all forms and you will just select the right one. The output will be in this form:
# install.packages("psych")
library(psych)
ICC(anxiety)
## Call: ICC(x = anxiety)
##
## Intraclass correlation coefficients
## type ICC F df1 df2 p lower bound upper bound
## Single_raters_absolute ICC1 0.18 1.6 19 40 0.094 -0.077 0.48
## Single_random_raters ICC2 0.20 1.8 19 38 0.056 -0.039 0.49
## Single_fixed_raters ICC3 0.22 1.8 19 38 0.056 -0.046 0.52
## Average_raters_absolute ICC1k 0.39 1.6 19 40 0.094 -0.275 0.74
## Average_random_raters ICC2k 0.43 1.8 19 38 0.056 -0.127 0.75
## Average_fixed_raters ICC3k 0.45 1.8 19 38 0.056 -0.153 0.77
##
## Number of subjects = 20 Number of Judges = 3
The rows of the table correspond to the following ICC, respectively: ICC1, ICC2, ICC3, ICC1k, ICC2k and ICC3k. In our example, we will consider the ICC2 form.
Note that, by default, the ICC() function uses the lmer
function, which can handle missing data and unbalanced designs.
Report
The intra-class correlation coefficient was computed to assess the agreement between three doctors in rating the anxiety levels in 20 individuals. There was a poor absolute agreement between the three doctors, using the two-way random effect models and “single rater” unit, kappa = 0.2, p = 0.056.
Summary
This chapter explains the basics of the intra-class correlation coefficient (ICC), which can be used to measure the agreement between multiple raters rating in ordinal or continuous scales. We also show how to compute and interpret the ICC values using the R software.
References
Koo, Terry, and Mae Li. 2016. “A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research.” Journal of Chiropractic Medicine 15 (March). doi:10.1016/j.jcm.2016.02.012.
Shrout, P.E., and J.L. Fleiss. 1979. “Intraclass Correlation: Uses in Assessing Rater Reliability.” Psychological Bulletin 86: 420–28.
Recommended for you
This section contains best data science and self-development resources to help you on your path.
Coursera - Online Courses and Specialization
Data science
- Course: Machine Learning: Master the Fundamentals by Stanford
- Specialization: Data Science by Johns Hopkins University
- Specialization: Python for Everybody by University of Michigan
- Courses: Build Skills for a Top Job in any Industry by Coursera
- Specialization: Master Machine Learning Fundamentals by University of Washington
- Specialization: Statistics with R by Duke University
- Specialization: Software Development in R by Johns Hopkins University
- Specialization: Genomic Data Science by Johns Hopkins University
Popular Courses Launched in 2020
- Google IT Automation with Python by Google
- AI for Medicine by deeplearning.ai
- Epidemiology in Public Health Practice by Johns Hopkins University
- AWS Fundamentals by Amazon Web Services
Trending Courses
- The Science of Well-Being by Yale University
- Google IT Support Professional by Google
- Python for Everybody by University of Michigan
- IBM Data Science Professional Certificate by IBM
- Business Foundations by University of Pennsylvania
- Introduction to Psychology by Yale University
- Excel Skills for Business by Macquarie University
- Psychological First Aid by Johns Hopkins University
- Graphic Design by Cal Arts
Amazon FBA
Amazing Selling Machine
Books - Data Science
Our Books
- Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
- Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
- Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
- R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
- GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
- Network Analysis and Visualization in R by A. Kassambara (Datanovia)
- Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
- Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)
Others
- R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
- Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
- Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
- An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
- Deep Learning with R by François Chollet & J.J. Allaire
- Deep Learning with Python by François Chollet
Version: Français
Thanks a lot for the great post! I would like to ask whether one could use the ICC with questionnaire scales instead of doctors, as in the example above? For instance, if 20 participants filled out a questionnaire with 3 factors that all 3 measure the same construct (e.g., the three factors could be number of shot drank per week, distance between each time a person drinks and # of friends who drink alcohol to measure alcoholism). So, could one use these three factors instead of “doctors” above (i.e., columns in the ICC)? Thank you!
Many thanks for the excellent explanation!
One question regarding the icc-function of the irr-Package. While you can specify that your score is an average score, you cannot specify how many measurements flow into the average score. Would that not be needed for an appropriate adjustement of the ICC?
There seem to be only two options for the argument “model” (one-way and two-way). What should I do in case of a two-way mixed effects model?
I would love to know the answer to this if you’ve figured it out!
The two-way mixed and two-way random methods will yield identical ICC values, so this choice will likely be inconsequential. The only difference is in interpretation: The two-way mixed model treats rater as a fixed effect, the two-way random model treats it as a random effect. The estimated ICCs will however be the same, see e.g., Table 4 in McGraw & Wong (1996).
I would love to know the answer to this if you’ve figured it out!