maj@k

Definition

maj@k measures the performance of a model based on a majority vote. For each problem, k solutions are generated. The most frequent solution is identified, and it is marked as correct if it passes the problem’s tests. The final $maj @ k$ score is the percentage of problems for which the most frequent solution was correct. $maj @ k$ assesses the model’s ability to produce a consistent and correct answer.

Formula

let $D = {q_{i}, a_{i}}_{i = 1}^{i = m}$ be a test dataset

of size $m$ ,
with questions $q_{i}$ ,
and answers $a_{i}$ ,

let $D_{G} = {q_{i}, {\overset{a}{^}_{ij}}_{j = 1}^{j = n}}_{i = 1}^{i = m}$ be a dataset of independently generated answers where

$n$ is the number of generated answers per question,
$\overset{a}{^}_{ij}$ is the model’s 𝑗-th final answer for $q_{i}$ ,
$c_{i}$ is the number of correct solutions for $q_{i}$ in ${\overset{a}{^}_{ij}}_{j = 1}^{j = n}$

then, for a single question $q_{i} $ , the most frequent answer, or mode, is:

\overset{a}{^}_{i, maj ​} = m o d e ({\overset{a}{^}_{ij}}_{j = 1}^{j = k})

If there is a tie for the most frequent answer, one is selected at random.

We then define $i s_correc t (\overset{a}{^}_{i, maj })$ , an indicator function equals to 1 if the majority-voted answer $\overset{a}{^}_{i, maj }$ is correct, and 0 otherwise.

The overall $maj @ k$ metric is the average correctness of these majority-voted answers over the entire dataset:

maj @ k = \frac{1}{m} i = 1 \sum m i s_correc t (\overset{a}{^}_{i, maj ​})

Example

Let’s get through a simple example of a dataset of size $m = 3$ .

Imagine a language model is tasked with solving mathematical equations. To evaluate its performance on this task, we generate $k = 5$ answers per equation.

For the first equation, the result are as follows:

Answer $a_{1}$ (incorrect)
Answer $b_{1}$ (correct)
Answer $b_{1}$ (correct)
Answer $b_{1}$ (correct)
Answer $c_{1}$ (incorrect)

Then $\overset{a}{^}_{i, maj } = b_{1}$ .

For the second equation, the result are as follows:

Answer $a_{2}$ (incorrect)
Answer $b_{2}$ (incorrect)
Answer $b_{2}$ (incorrect)
Answer $a_{2}$ (incorrect)
Answer $c_{2}$ (correct)

Let’s break tie by selecting at random: $\overset{a}{^}_{i, maj } = a_{2}$ .

For the third equation, the result are as follows:

Answer $a_{3}$ (incorrect)
Answer $b_{3}$ (incorrect)
Answer $c_{3}$ (correct)
Answer $c_{3}$ (correct)
Answer $c_{3}$ (correct)

Then $\overset{a}{^}_{i, maj } = c_{3}$ .

Since $b_{1}$ and $c_{3}$ are correct in our exemple, and $a_{2}$ is not:

maj @ k = \frac{1}{3} (1 + 0 + 1) = \frac{2}{3}

Usage

$maj @ k$ measures the model’s reliability in producing a correct answer as its most frequent suggestion. A high $maj @ k$ score suggests that the model is not just getting lucky with one of its generations, but is consistently producing the correct solution.
$maj @ k$ is robust to occasional, non-recurring errors. A single anomalous incorrect answer will not affect the outcome as long as the correct answer holds the majority.
$maj @ k$ is a good sweet spot between pass@k and pass^k, with the former being too generous in its scoring (is there at least one good answer ?) and the later too rigorous (are all the answer correct ?).

Antoine Déchappe

Latest research review

The illusion of thinking: understanding the strengths and limitations of reasoning models via the lens of problem complexity

Recent writing

Beware of poetry package named differently from project structure

Pre-commit options in pyproject.toml should be committed first

Build a chess dataset from PGNs