Definition

Perplexity is one of the most common metrics for evaluating autoregressive language models. It is a measure of how well a probability model predicts a sample.

Formula

Let be the parameters of an autoregressive model. If we have a tokenized sequence

then the perplexity of regarding to is :

let’s break this formula step-by-step:

  • is the likelihood of the ith token conditioned on the preceding tokens according to our model
  • is the negative log of this likelihood
  • computes the average of the negative log-likelihood
  • computes the exponential of the average.

Note: The base of the logarithm and the exponentiation must match. While the natural logarithm (, base ) is common, perplexity is often reported using base 2.

Example

Let’s calculate the perplexity for a short sentence given the output probabilities from an hypothetical autoregressive model.

  • Vocabulary: {“the”, “cat”, “couch”, “sat”, “on”} (plus a <start> token)
  • Test Sequence (W): (“the”, “cat”, “sat”)
  • Sequence Length (T): 3

Our model will process this sequence one token at a time and give us a probability for the correct next token.

Step 1: First token, “the

  • Context: <start>
  • Model predicts the probability for each word in the vocabulary. Let’s say it outputs:

We only care about the probability of the actual token, which is 0.6.

Step 2: Second token, “cat

  • Context: “the”
  • Model predicts the probability for the next word given “the”:

The probability of the actual token is 0.5.

Step 3: Third token, “sat

  • Context: “the”, “cat
  • Model predicts the probability for the next word given “the cat”:

The probability of the actual token is 0.4.

Step 4: Calculation

Now we have the probabilities for the correct tokens at each step: [0.6, 0.5, 0.4]. Let’s plug them into the formula using base 2.

  1. Calculate the log probabilities:
  2. Calculate the average negative log-likelihood (the cross-entropy):
  1. Calculate the perplexity by exponentiating:

Conclusion

The perplexity of the sequence “the cat sat” for our model is approximately 2.03.

Usage

  • The higher the likelihood of a the correct next token, the lesser its negative log-likelihood; inversely, an unexpected correct token with a very small likelihood will have a considerable impact on the average of the log-likelihoods and hence on the perplexity.
  • While a lower perplexity is always better, there is no universal “good” perplexity score. A PPL of 50 might be state-of-the-art for a complex domain like legal or scientific text, while a PPL of 50 for a simple domain like children’s stories would be considered very poor.
  • perplexity is a good proxy for language understanding, but it’s not the same as performance on a downstream task like summarization, translation, or question answering.
  • Perplexity scores are only comparable if the underlying conditions are identical. This means:
    • Same Test Set: You must use the exact same test data.
    • Same Tokenization: The way you split text into tokens (e.g., WordPiece, BPE) must be identical. A different tokenizer creates a different vocabulary and sequence length, making PPL scores incomparable.

Additional readings

The Gradient Pub - Understanding evaluation metrics for language models