Cohen's Kappa score

Definition

Cohen’s Kappa score is a statistic used to measure inter-rater reliability (or inter-annotator agreement) of two raters for categorical items. Kappa takes into account the possibility of agreement occurring by chance. Cohen’s kappa assumes the two raters have rated the same set of items.

Formula

Let

$X = (x_{1}, ... x_{N})$ be a dataset of $N$ input values,

$Y = (y_{1}, ..., y_{C})$ the set of $C$ possible categorical outputs,

$h_{1} : X \to Y$ , $h_{2} : X \to Y$ two classifiers from $X$ to $Y$ ,

$M \in M_{C \times C} (N)$ the confusion matrix of classification from $h_{1}$ and $h_{2}$

Then $P_{o}$ the observed proportion of agreement between the $K$ classifiers is:

P_{0} = \frac{1}{N} c = 1 \sum C diag (M)_{c}

Let

$n_{k, c}$ be the number of samples classified by $h_{k}$ into category $c$ ,

$p_{k, c}$ the estimated probability that $h_{k}$ will classify an item into $c$ ,

$p_{c}$ the estimated probability that all $h_{1}$ and $h_{2}$ will classify the same item into $c$

Then $P_{e}$ the expected proportion of agreement by chance is:

P_{e} = c = 1 \sum C p_{c} = c = 1 \sum C p_{1, c} \times p_{2, c} = c = 1 \sum C \frac{n _{1, c}}{N} \times \frac{n _{2, c}}{N} = \frac{1}{N ^{2}} c = 1 \sum C n_{1, c} \times n_{2, c} (indep.)

then finally the Kappa score is:

K a pp a = \frac{P _{o} - P _{e}}{1 - P _{e}}

Example

Let’s consider a binary classification problem where a model predicts whether an email is “Spam” or “Not Spam”. We have 100 emails that have been classified by both our model and a human annotator (ground truth). The results are summarized in the following confusion matrix:

	Model: Spam	Model: Not Spam	Row total
Human: Spam	20	10	30
Human: Not Spam	5	65	70
Column Total	25	75	100

1. Calculate Observed Agreement ( $P_{o}$ )

The observed agreement is the proportion of instances where the model and the human annotator agreed. This occurs for “Spam” (20 instances) and “Not Spam” (65 instances).

P_{o} = \frac{Number of Agreements​}{Total Instances} = \frac{20 + 65}{100} ​ = 0.85

2. Calculate Expected Agreement ( $P_{e}$ )

First, we calculate the probability of agreeing on “Spam” by chance:

The model classified 25 out of 100 as “Spam” (Probability = 0.25).
The human classified 30 out of 100 as “Spam” (Probability = 0.30).
The probability of both randomly choosing “Spam” is $0.25 \times 0.30 = 0.075$ .

Next, we calculate the probability of agreeing on “Not Spam” by chance:

The model classified 75 out of 100 as “Not Spam” (Probability = 0.75).
The human classified 70 out of 100 as “Not Spam” (Probability = 0.70).
The probability of both randomly choosing “Not Spam” is $0.75 \times 0.70 = 0.525$ .

The total expected agreement by chance is the sum of these probabilities:

P e ​ = 0.075 + 0.525 = 0.60

3. Calculate Cohen’s Kappa

Now we can plug $P_{o}$ and $P_{e}$ into the Kappa formula:

K a pp a = \frac{P _{o} - P _{e}}{1 - P _{e}} = \frac{0.85 - 0.60}{1 - 0.60} = 0.625

So, the Cohen’s Kappa score for this model is 0.625, which indicates a “substantial” level of agreement.

Usage

Interpretation of the Score: The Kappa value ranges from -1 to 1.
- 1: Perfect agreement between the raters.
- 0: The agreement is equivalent to what would be expected by chance.
- <0: The agreement is weaker than what would be expected by chance, which is rare.
Imbalanced Datasets: Cohen’s Kappa is particularly useful for classification tasks with imbalanced classes. Accuracy can be misleading in these scenarios because a model can achieve a high accuracy by simply predicting the majority class. Kappa helps to mitigate this by accounting for chance agreement.
Limitations:
- Number of Raters: Cohen’s Kappa is designed for two raters. For more than two, a different statistic like Fleiss’ Kappa is used.
- Subjectivity of Thresholds: The interpretation of what constitutes a “good” Kappa score can be context-dependent and subjective. A Kappa of 0.6 might be acceptable in some fields but considered poor in others where higher precision is required, such as medical diagnostics.
- Ordinal Data: Standard Cohen’s Kappa does not differentiate between degrees of disagreement for ordinal data (e.g., rating scales). For instance, a disagreement between “Slightly Relevant” and “Very Relevant” is treated the same as a disagreement between “Slightly Relevant” and “Very Irrelevant”. A weighted Kappa can be used in these situations to account for the severity of disagreements.

Antoine Déchappe

Latest research review