Perplexity: Information-Theoretic Measure of Language Model Prediction Quality
PPL(W) = exp(−(1/N) Σ log P(w_i|w_{<i})) = exp(cross-entropy per token); GPT-2 117M zero-shot: 35.1 PPL on Penn Treebank (Radford et al., 2019); GPT-3 175B zero-shot: 20.5 PPL (Brown et al., 2020); 4-gram KN baseline: 141.2 PPL; human-level estimated ~10–20 PPL.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Perplexity formula | PPL(W) = exp(−(1/N) Σ log P(w_i|w_{<i})) | Equivalent to exp(CE) where CE is cross-entropy per token in nats; lower PPL = better model | |
| GPT-2 117M on Penn Treebank (zero-shot) | 35.1 | perplexity | Radford et al. (2019): zero-shot on PTB; prior supervised LSTM best was ~78 PPL |
| GPT-3 175B on Penn Treebank (zero-shot) | 20.5 | perplexity | Brown et al. (2020): zero-shot; 4-gram KN LM baseline is 141.2 PPL on PTB |
| Perplexity branching factor intuition | PPL ≈ effective vocabulary size at each step | PPL=35 means model is as uncertain as choosing uniformly among 35 equiprobable tokens | |
| Human perplexity estimate (English) | ~10–20 | perplexity | Domain-dependent: formal news text ~10 PPL; diverse web text ~25 PPL for strong models |
Perplexity is the canonical intrinsic evaluation metric for language models, measuring how well a model predicts held-out text. It is the exponential of the average cross-entropy loss per token — an information-theoretic quantity expressing the model’s average surprise at each observed token.
The Formula
Given a token sequence W = (w_1, …, w_N) and an autoregressive model P:
PPL(W) = exp(−(1/N) Σ_{i=1}^{N} log P(w_i | w_1, …, w_{i-1}))
Equivalently: PPL(W) = exp(H) where H is the cross-entropy per token in nats.
In bits: PPL(W) = 2^{H_bits} where H_bits is cross-entropy per token in bits.
Historical Progress on Penn Treebank
| Model | PPL (PTB) | Year | Notes |
|---|---|---|---|
| 4-gram Kneser-Ney | 141.2 | Pre-2010 | N-gram with smoothing |
| LSTM LM | ~78 | 2013 | Basic LSTM language model |
| AWD-LSTM | 57.3 | 2017 | Merity et al. (2018); strong supervised baseline |
| Transformer-XL | 21.8 | 2019 | Extended context transformer |
| GPT-2 117M (zero-shot) | 35.1 | 2019 | Radford et al.; no PTB training |
| GPT-3 175B (zero-shot) | 20.5 | 2020 | Brown et al.; matches supervised SOTA |
GPT-3’s zero-shot perplexity (20.5) matches or beats supervised AWD-LSTM (57.3) despite never being trained explicitly on PTB.
Cross-Entropy and Perplexity
| Metric | Formula | Typical Values |
|---|---|---|
| Cross-entropy (nats) | −(1/N) Σ log_e P(w_i) | 2.0–4.5 |
| Cross-entropy (bits) | −(1/N) Σ log₂ P(w_i) | 1.5–6.5 |
| Perplexity | exp(CE in nats) | 7–100+ |
A cross-entropy of 3.5 nats corresponds to PPL ≈ 33; a CE of 3.0 nats corresponds to PPL ≈ 20.
Scaling Laws and Perplexity
Perplexity follows smooth power-law scaling with model size and training data (see scaling-laws):
- PPL(N) ≈ (N_c / N)^{0.076} — decreases predictably with parameter count
- PPL(D) ≈ (D_c / D)^{0.095} — decreases predictably with training token count
This predictability is why perplexity is used as the primary objective for scaling law research.
Limitations of Perplexity as an Evaluation Metric
| Limitation | Explanation |
|---|---|
| Tokenizer-dependent | Different vocabularies yield different PPL for same underlying model |
| Domain-sensitive | Cannot compare values across datasets with different entropy |
| Uncorrelated with generation quality | Low PPL ≠ high-quality, factual generation |
| Contamination-blind | Does not detect if model saw test data during training |
See next-token-prediction for the training objective that perplexity directly measures, scaling-laws for how PPL decreases as power laws in N and D, and hallucination-mechanisms for why low perplexity does not prevent factual errors.
Related Pages
Sources
- Radford et al. (2019) — Language Models are Unsupervised Multitask Learners. OpenAI Blog
- Brown et al. (2020) — Language Models are Few-Shot Learners. NeurIPS 2020
- Merity et al. (2017) — Regularizing and Optimizing LSTM Language Models. ICLR 2018
Frequently Asked Questions
What does a perplexity of 35 mean intuitively?
Perplexity is the geometric mean of the reciprocal of predicted probabilities per token. PPL=35 means the model is, on average, as uncertain as choosing uniformly among 35 equally probable options. It assigns an average probability of 1/35 ≈ 2.9% to the correct next token. Perfect prediction (correct token always at probability 1.0) gives PPL=1. A 4-gram language model achieves ~141 PPL on Penn Treebank; GPT-3 achieves ~20.5 PPL — GPT-3 is roughly 7× more certain per token than a 4-gram model.
Why can you not compare perplexity numbers across different test sets?
Perplexity depends on the entropy of the test data itself. A PPL of 35 on Penn Treebank (formal newspaper text, low entropy) is not comparable to PPL=35 on WikiText-103 (diverse Wikipedia text, higher entropy). A model might score 20 PPL on clean news and 80 PPL on code-mixed social media while being strictly better on both domains. Meaningful comparison requires both models to use identical tokenizers (vocabulary differences change PPL directly) and the exact same test split.