pass@k

Definition

pass@k measures the probability of finding at least one correct solution within a random sample of $k$ solutions, drawn from a larger pool of $n$ independently generated attempts.

Formula

let $D = {q_{i}, a_{i}}_{i = 1}^{i = m}$ be a test dataset

of size $m$ ,
with questions $q_{i}$ ,
and answers $a_{i}$ ,

let $D_{G} = {q_{i}, {\overset{a}{^}_{ij}}_{j = 1}^{j = n}}_{i = 1}^{i = m}$ be a dataset of independently generated answers where

$n$ is the number of generated answers per question,
$\overset{a}{^}_{ij}$ is the model’s 𝑗-th final answer for $q_{i}$ ,
$c_{i}$ is the number of correct solutions for $q_{i}$ in ${\overset{a}{^}_{ij}}_{j = 1}^{j = n}$

then

p a ss @ k = 1 - E_{D_{G}} [\frac{( k n - c )}{( k n )}]

let’s break this formula step-by-step:

$(k n - c)$ computes the number of combination of $k$ incorrect samples
$(k n)$ computes the number of total combination of $k$ samples
$\frac{( k n - c )}{( k n )}$ computes the probability of drawing only incorrect samples
$1 - [\frac{( k n - c )}{( k n )}]$ computes the inverse event, which is drawing at least one correct sample.

Example

Let’s get through a simple example of a dataset of size $m = 1$ . For a real dataset, all individual $p a ss @ k$ are averaged across all samples.

Imagine a language model is tasked with writing a Python function to check if a number is prime. To evaluate its performance on this task, we generate 200 code samples ( $n = 200$ ). After running unit tests on all of them, we find that 10 of the samples are correct ( $c = 10$ ).

Now, let’s calculate $p a ss @1$ , $p a ss @5$ , and $p a ss @10$ .

Calculating $p a ss @1$ :

This tells us the probability that a single, randomly chosen sample is correct.

Using the formula:

p a ss @1 = 1 - [\frac{( 1 200 - 10 )}{( 1 200 )}] = 1 - [\frac{190}{200}] = 1 - 0.95 = 0.05

So, there is a 5% chance that any single generated sample is correct. This is identical to $p a s s^{1}$ .

Calculating $p a ss @5$ :

This is the probability that at least one of 5 randomly chosen samples is correct.

Using the formula:

p a ss @5 = 1 - [\frac{( 5 200 - 10 )}{( 5 200 )}] = 1 - [\frac{\frac{190 !}{5 ! \times 185 !}}{\frac{200 !}{5 ! \times 195 !}}] = 1 - [\frac{190 \times 189 \times 188 \times 187 \times 186}{200 \times 199 \times 198 \times 197 \times 196}] \approx 1 - 0.77 \approx 0.23

So, there is approximately a 23% chance of finding a correct solution within the top 5 generated samples.

Calculating $p a ss @10$ :

This is the probability that at least one of 10 randomly chosen samples is correct.

The calculation would continue in the same manner:

p a ss @10 = 1 - [\frac{( 10 200 - 10 )}{( 10 200 )}]

The resulting value will be higher than $p a ss @5$ , indicating an increased likelihood of finding a correct solution as we consider more samples.

Usage

$p a ss @ k$ can be misleading as it may inflate a model’s perceived performance. The metric focuses on the probability of finding at least one correct solution within $k$ attempts, which can mask a low rate of first-try success. As the plot below illustrates, the $p a ss @ k$ value often rises sharply with $k$ .
Although the sample size ( $n$ ) is not part of the metric’s name, it is a critical factor in its interpretation. For a fixed success rate, a larger sample size ( $n$ ) provides a more rigorous evaluation and results in a more conservative (i.e., lower) $p a ss @ k$ score.
The $p a ss @ k$ metric is most suitable for tasks where solutions can be verified automatically and inexpensively. A prime example is code generation, where unit tests can validate solutions without human intervention. In such scenarios, $p a ss @ k$ serves as an excellent measure of a model’s practical utility—the likelihood it will provide a working solution within a few attempts.

Antoine Déchappe

Latest research review

The illusion of thinking: understanding the strengths and limitations of reasoning models via the lens of problem complexity

Recent writing

Beware of poetry package named differently from project structure

Pre-commit options in pyproject.toml should be committed first

Build a chess dataset from PGNs