pass^k

Definition

$p a s s^{k}$ measures the probability of a model generating a correct solution in all of its $k$ independent attempts. $p a s s^{k}$ demands consistent correctness across all $k$ samples. It serves as a measure of a model’s reliability.

Formula

let $D = {q_{i}, a_{i}}_{i = 1}^{i = m}$ be a test dataset

of size $m$ ,
with questions $q_{i}$ ,
and answers $a_{i}$ ,

let $D_{G} = {q_{i}, {\overset{a}{^}_{ij}}_{j = 1}^{j = n}}_{i = 1}^{i = m}$ be a dataset of independently generated answers where

$n$ is the number of generated answers per question,
$\overset{a}{^}_{ij}$ is the model’s 𝑗-th final answer for $q_{i}$ ,
$c_{i}$ is the number of correct solutions for $q_{i}$ in ${\overset{a}{^}_{ij}}_{j = 1}^{j = n}$

then

p a s s^{k} = E_{D_{G}} (\frac{c}{n})^{k}

let’s break this formula step-by-step:

$\frac{c}{n}$ computes the probability that a single, randomly chosen sample is correct
$(.)^{k}$ computes the probability of this event occurring consecutively $k$ times.

Example

Let’s get through a simple example of a dataset of size $m = 1$ . For a real dataset, all individual $p a s s^{k}$ values are averaged across all samples.

A language model is tasked with writing a Python function to check if a number is prime. We generate 200 code samples ( $n = 200$ ). After running unit tests, we find that 10 of them are correct ( $c = 10$ ).

The probability of any single sample being correct is $p_{correc t} = \frac{10}{200} = 0.05$ .

Now, let’s calculate $p a s s^{1}$ , $p a s s^{5}$ , and $p a s s^{10}$ .

Calculating $p a s s^{1}$ :

This tells us the probability that a single, randomly chosen sample is correct.

Using the formula:

p a s s^{1} = (\frac{10}{200})^{1} = 0.05

So, there is a 5% chance that the first generated sample is correct. This is identical to $p a ss @1$ .

Calculating $p a s s^{5}$ :

This is the probability that all 5 of the first 5 samples are correct.

Using the formula:

p a s s^{5} = (\frac{10}{200})^{5} = (0.05)^{5} = 3.125 \times 1 0^{- 7}

This means there is an extremely small (approximately 0.00003%) chance that all 5 of the first 5 generated samples would be correct.

Calculating $p a s s^{10}$ :

This is the probability that all 10 of the first 10 samples are correct.

The calculation continues in the same manner:

p a s s^{10} = (\frac{10}{200})^{10} = (0.05)^{10} = 9.765625 \times 1 0^{- 14}

The resulting value is astronomically small, highlighting the metric’s strictness.

Usage

$p a s s^{k}$ is a measure of a model’s reliability and consistency. A high $p a s s^{k}$ score (for $k > 1$ ) would indicate an exceptionally robust model that is correct with high frequency.
$p a s s^{k}$ is the conceptual opposite of pass@k. While pass@k values increase toward 1.0 as $k$ grows, $p a s s^{k}$ values decrease exponentially, quickly approaching zero.
$p a s s^{k}$ could be valuable for safety-critical or high-stakes applications where every output in a batch must be correct. For instance, if a model were used to automatically patch a security vulnerability across thousands of codebases, a single failure could be catastrophic.

Antoine Déchappe

Latest research review

The illusion of thinking: understanding the strengths and limitations of reasoning models via the lens of problem complexity

Recent writing

Beware of poetry package named differently from project structure

Pre-commit options in pyproject.toml should be committed first

Build a chess dataset from PGNs