Definition

measures the probability of a model generating a correct solution in all of its independent attempts. demands consistent correctness across all samples. It serves as a measure of a model’s reliability.

Formula

let be a test dataset

  • of size ,
  • with questions ,
  • and answers ,

let be a dataset of independently generated answers where

  • is the number of generated answers per question,
  • is the model’s 𝑗-th final answer for ,
  • is the number of correct solutions for in

then

let’s break this formula step-by-step:

  • computes the probability that a single, randomly chosen sample is correct
  • computes the probability of this event occurring consecutively times.

Example

Let’s get through a simple example of a dataset of size . For a real dataset, all individual values are averaged across all samples.

A language model is tasked with writing a Python function to check if a number is prime. We generate 200 code samples (). After running unit tests, we find that 10 of them are correct ().

The probability of any single sample being correct is .

Now, let’s calculate , , and .

Calculating :

This tells us the probability that a single, randomly chosen sample is correct.

Using the formula:

So, there is a 5% chance that the first generated sample is correct. This is identical to .

Calculating :

This is the probability that all 5 of the first 5 samples are correct.

Using the formula:

This means there is an extremely small (approximately 0.00003%) chance that all 5 of the first 5 generated samples would be correct.

Calculating :

This is the probability that all 10 of the first 10 samples are correct.

The calculation continues in the same manner:

The resulting value is astronomically small, highlighting the metric’s strictness.

Usage

  • is a measure of a model’s reliability and consistency. A high score (for ) would indicate an exceptionally robust model that is correct with high frequency.
  • is the conceptual opposite of pass@k. While pass@k values increase toward 1.0 as grows, values decrease exponentially, quickly approaching zero.
  • could be valuable for safety-critical or high-stakes applications where every output in a batch must be correct. For instance, if a model were used to automatically patch a security vulnerability across thousands of codebases, a single failure could be catastrophic.