Perplexity

Definition

Perplexity is one of the most common metrics for evaluating autoregressive language models. It is a measure of how well a probability model predicts a sample.

Formula

Let $θ$ be the parameters of an autoregressive model. If we have a tokenized sequence

X = (x_{0}, x_{1}, \dots, x_{t})

then the perplexity of $X$ regarding to $θ$ is :

PPL (X) = exp {\frac{1}{t} i = 1 \sum t - lo g p_{θ} (x_{i} ∣ x_{< i})}

let’s break this formula step-by-step:

$p_{θ} (x_{i} ∣ x_{< i})$ is the likelihood of the ith token conditioned on the preceding tokens $x_{< i}$ according to our model
$- lo g (.)$ is the negative log of this likelihood
$\frac{1}{t} \sum_{i = 1}^{t} (.)$ computes the average of the negative log-likelihood
$exp {.}$ computes the exponential of the average.

Note: The base of the logarithm and the exponentiation must match. While the natural logarithm ( $l n$ , base $e$ ) is common, perplexity is often reported using base 2.

Example

Let’s calculate the perplexity for a short sentence given the output probabilities from an hypothetical autoregressive model.

Vocabulary: {“the”, “cat”, “couch”, “sat”, “on”} (plus a <start> token)
Test Sequence (W): (“the”, “cat”, “sat”)
Sequence Length (T): 3

Our model will process this sequence one token at a time and give us a probability for the correct next token.

Step 1: First token, “the”

Context: <start>
Model predicts the probability for each word in the vocabulary. Let’s say it outputs:
- $p ("the" ∣) = 0.6$
- $p ("cat" ∣) = 0.05$
- $p ("couch" ∣) = 0.1$
- $p ("sat" ∣) = 0.15$
- $p ("on" ∣) = 0.1$

We only care about the probability of the actual token, which is 0.6.

Step 2: Second token, “cat”

Context: “the”
Model predicts the probability for the next word given “the”:
- $p ("the" ∣ "the") = 0.01$
- $p ("cat" ∣ "the") = 0.5$
- $p ("couch" ∣ "the") = 0.4$
- $p ("sat" ∣ "the") = 0.04$
- $p ("on" ∣ "the") = 0.05$

The probability of the actual token is 0.5.

Step 3: Third token, “sat”

Context: “the”, “cat”
Model predicts the probability for the next word given “the cat”:
- $p ("the cat" ∣ "the") = 0.05$
- $p ("cat" ∣ "the cat") = 0.2$
- $p ("couch" ∣ "the cat") = 0.05$
- $p ("sat" ∣ "the cat") = 0.4$
- $p ("on" ∣ "the cat") = 0.3$

The probability of the actual token is 0.4.

Step 4: Calculation

Now we have the probabilities for the correct tokens at each step: [0.6, 0.5, 0.4]. Let’s plug them into the formula using base 2.

Calculate the log probabilities:
- $l o g_{2} (0.6) \approx - 0.737$
- $l o g_{2} (0.5) = - 1.0$
- $l o g_{2} (0.4) \approx - 1.322$
Calculate the average negative log-likelihood (the cross-entropy):

H (X) = \frac{1}{t} i = 1 \sum t - lo g p_{θ} (x_{i} ∣ x_{< i}) = \frac{1}{3} (0.737 + 1.0 + 1.322) = \frac{1}{3} (3.059) \approx 1.02

Calculate the perplexity by exponentiating:

PP L (X) = 2^{H (W)} = 2^{1.02} \approx 2.029

Conclusion

The perplexity of the sequence “the cat sat” for our model is approximately 2.03.

Usage

The higher the likelihood of a the correct next token, the lesser its negative log-likelihood; inversely, an unexpected correct token with a very small likelihood will have a considerable impact on the average of the log-likelihoods and hence on the perplexity.
While a lower perplexity is always better, there is no universal “good” perplexity score. A PPL of 50 might be state-of-the-art for a complex domain like legal or scientific text, while a PPL of 50 for a simple domain like children’s stories would be considered very poor.
perplexity is a good proxy for language understanding, but it’s not the same as performance on a downstream task like summarization, translation, or question answering.
Perplexity scores are only comparable if the underlying conditions are identical. This means:
- Same Test Set: You must use the exact same test data.
- Same Tokenization: The way you split text into tokens (e.g., WordPiece, BPE) must be identical. A different tokenizer creates a different vocabulary and sequence length, making PPL scores incomparable.

Additional readings

The Gradient Pub - Understanding evaluation metrics for language models

Antoine Déchappe

Latest research review

The illusion of thinking: understanding the strengths and limitations of reasoning models via the lens of problem complexity

Recent writing

Beware of poetry package named differently from project structure

Pre-commit options in pyproject.toml should be committed first

Build a chess dataset from PGNs