Definition

maj@k measures the performance of a model based on a majority vote. For each problem, k solutions are generated. The most frequent solution is identified, and it is marked as correct if it passes the problem’s tests. The final score is the percentage of problems for which the most frequent solution was correct. assesses the model’s ability to produce a consistent and correct answer.

Formula

let be a test dataset

  • of size ,
  • with questions ,
  • and answers ,

let be a dataset of independently generated answers where

  • is the number of generated answers per question,
  • is the model’s 𝑗-th final answer for ,
  • is the number of correct solutions for in

then, for a single question , the most frequent answer, or mode, is:

If there is a tie for the most frequent answer, one is selected at random.

We then define , an indicator function equals to 1 if the majority-voted answer is correct, and 0 otherwise.

The overall metric is the average correctness of these majority-voted answers over the entire dataset:

Example

Let’s get through a simple example of a dataset of size .

Imagine a language model is tasked with solving mathematical equations. To evaluate its performance on this task, we generate answers per equation.

For the first equation, the result are as follows:

  • Answer (incorrect)
  • Answer (correct)
  • Answer (correct)
  • Answer (correct)
  • Answer (incorrect)

Then .

For the second equation, the result are as follows:

  • Answer (incorrect)
  • Answer (incorrect)
  • Answer (incorrect)
  • Answer (incorrect)
  • Answer (correct)

Let’s break tie by selecting at random: .

For the third equation, the result are as follows:

  • Answer (incorrect)
  • Answer (incorrect)
  • Answer (correct)
  • Answer (correct)
  • Answer (correct)

Then .

Since and are correct in our exemple, and is not:

Usage

  • measures the model’s reliability in producing a correct answer as its most frequent suggestion. A high score suggests that the model is not just getting lucky with one of its generations, but is consistently producing the correct solution.
  • is robust to occasional, non-recurring errors. A single anomalous incorrect answer will not affect the outcome as long as the correct answer holds the majority.
  • is a good sweet spot between pass@k and pass^k, with the former being too generous in its scoring (is there at least one good answer ?) and the later too rigorous (are all the answer correct ?).