Definition

Cohen’s Kappa score is a statistic used to measure inter-rater reliability (or inter-annotator agreement) of two raters for categorical items. Kappa takes into account the possibility of agreement occurring by chance. Cohen’s kappa assumes the two raters have rated the same set of items.

Formula

Let

be a dataset of input values,

the set of possible categorical outputs,

, two classifiers from to ,

the confusion matrix of classification from and

Then the observed proportion of agreement between the classifiers is:

Let

be the number of samples classified by into category ,

the estimated probability that will classify an item into ,

the estimated probability that all and will classify the same item into

Then the expected proportion of agreement by chance is:

then finally the Kappa score is:

Example

Let’s consider a binary classification problem where a model predicts whether an email is “Spam” or “Not Spam”. We have 100 emails that have been classified by both our model and a human annotator (ground truth). The results are summarized in the following confusion matrix:

Model: SpamModel: Not SpamRow total
Human: Spam201030
Human: Not Spam56570
Column Total2575100

1. Calculate Observed Agreement (​)

The observed agreement is the proportion of instances where the model and the human annotator agreed. This occurs for “Spam” (20 instances) and “Not Spam” (65 instances).

2. Calculate Expected Agreement (​)

First, we calculate the probability of agreeing on “Spam” by chance:

  • The model classified 25 out of 100 as “Spam” (Probability = 0.25).
  • The human classified 30 out of 100 as “Spam” (Probability = 0.30).
  • The probability of both randomly choosing “Spam” is .

Next, we calculate the probability of agreeing on “Not Spam” by chance:

  • The model classified 75 out of 100 as “Not Spam” (Probability = 0.75).
  • The human classified 70 out of 100 as “Not Spam” (Probability = 0.70).
  • The probability of both randomly choosing “Not Spam” is .

The total expected agreement by chance is the sum of these probabilities:

3. Calculate Cohen’s Kappa

Now we can plug ​ and ​ into the Kappa formula:

So, the Cohen’s Kappa score for this model is 0.625, which indicates a “substantial” level of agreement.

Usage

  • Interpretation of the Score: The Kappa value ranges from -1 to 1.

    • 1: Perfect agreement between the raters.
    • 0: The agreement is equivalent to what would be expected by chance.
    • <0: The agreement is weaker than what would be expected by chance, which is rare.
  • Imbalanced Datasets: Cohen’s Kappa is particularly useful for classification tasks with imbalanced classes. Accuracy can be misleading in these scenarios because a model can achieve a high accuracy by simply predicting the majority class. Kappa helps to mitigate this by accounting for chance agreement.

  • Limitations:

    • Number of Raters: Cohen’s Kappa is designed for two raters. For more than two, a different statistic like Fleiss’ Kappa is used.
    • Subjectivity of Thresholds: The interpretation of what constitutes a “good” Kappa score can be context-dependent and subjective. A Kappa of 0.6 might be acceptable in some fields but considered poor in others where higher precision is required, such as medical diagnostics.
    • Ordinal Data: Standard Cohen’s Kappa does not differentiate between degrees of disagreement for ordinal data (e.g., rating scales). For instance, a disagreement between “Slightly Relevant” and “Very Relevant” is treated the same as a disagreement between “Slightly Relevant” and “Very Irrelevant”. A weighted Kappa can be used in these situations to account for the severity of disagreements.