Definition
Perplexity is one of the most common metrics for evaluating autoregressive language models. It is a measure of how well a probability model predicts a sample.
Formula
Let be the parameters of an autoregressive model. If we have a tokenized sequence
then the perplexity of regarding to is :
let’s break this formula step-by-step:
- is the likelihood of the ith token conditioned on the preceding tokens according to our model
- is the negative log of this likelihood
- computes the average of the negative log-likelihood
- computes the exponential of the average.
Note: The base of the logarithm and the exponentiation must match. While the natural logarithm (, base ) is common, perplexity is often reported using base 2.
Example
Let’s calculate the perplexity for a short sentence given the output probabilities from an hypothetical autoregressive model.
- Vocabulary: {“the”, “cat”, “couch”, “sat”, “on”} (plus a <start> token)
- Test Sequence (W): (“the”, “cat”, “sat”)
- Sequence Length (T): 3
Our model will process this sequence one token at a time and give us a probability for the correct next token.
Step 1: First token, “the”
- Context: <start>
- Model predicts the probability for each word in the vocabulary. Let’s say it outputs:
We only care about the probability of the actual token, which is 0.6.
Step 2: Second token, “cat”
- Context: “the”
- Model predicts the probability for the next word given “the”:
The probability of the actual token is 0.5.
Step 3: Third token, “sat”
- Context: “the”, “cat”
- Model predicts the probability for the next word given “the cat”:
The probability of the actual token is 0.4.
Step 4: Calculation
Now we have the probabilities for the correct tokens at each step: [0.6, 0.5, 0.4]. Let’s plug them into the formula using base 2.
- Calculate the log probabilities:
- Calculate the average negative log-likelihood (the cross-entropy):
- Calculate the perplexity by exponentiating:
Conclusion
The perplexity of the sequence “the cat sat” for our model is approximately 2.03.
Usage
- The higher the likelihood of a the correct next token, the lesser its negative log-likelihood; inversely, an unexpected correct token with a very small likelihood will have a considerable impact on the average of the log-likelihoods and hence on the perplexity.
- While a lower perplexity is always better, there is no universal “good” perplexity score. A PPL of 50 might be state-of-the-art for a complex domain like legal or scientific text, while a PPL of 50 for a simple domain like children’s stories would be considered very poor.
- perplexity is a good proxy for language understanding, but it’s not the same as performance on a downstream task like summarization, translation, or question answering.
- Perplexity scores are only comparable if the underlying conditions are identical. This means:
- Same Test Set: You must use the exact same test data.
- Same Tokenization: The way you split text into tokens (e.g., WordPiece, BPE) must be identical. A different tokenizer creates a different vocabulary and sequence length, making PPL scores incomparable.
Additional readings
The Gradient Pub - Understanding evaluation metrics for language models