Perplexity: Information-Theoretic Measure of Language Model Prediction Quality

Category: evaluation Updated: 2026-02-27

PPL(W) = exp(−(1/N) Σ log P(w_i|w_{<i})) = exp(cross-entropy per token); GPT-2 117M zero-shot: 35.1 PPL on Penn Treebank (Radford et al., 2019); GPT-3 175B zero-shot: 20.5 PPL (Brown et al., 2020); 4-gram KN baseline: 141.2 PPL; human-level estimated ~10–20 PPL.

Key Data Points
MeasureValueUnitNotes
Perplexity formulaPPL(W) = exp(−(1/N) Σ log P(w_i|w_{<i}))Equivalent to exp(CE) where CE is cross-entropy per token in nats; lower PPL = better model
GPT-2 117M on Penn Treebank (zero-shot)35.1perplexityRadford et al. (2019): zero-shot on PTB; prior supervised LSTM best was ~78 PPL
GPT-3 175B on Penn Treebank (zero-shot)20.5perplexityBrown et al. (2020): zero-shot; 4-gram KN LM baseline is 141.2 PPL on PTB
Perplexity branching factor intuitionPPL ≈ effective vocabulary size at each stepPPL=35 means model is as uncertain as choosing uniformly among 35 equiprobable tokens
Human perplexity estimate (English)~10–20perplexityDomain-dependent: formal news text ~10 PPL; diverse web text ~25 PPL for strong models

Perplexity is the canonical intrinsic evaluation metric for language models, measuring how well a model predicts held-out text. It is the exponential of the average cross-entropy loss per token — an information-theoretic quantity expressing the model’s average surprise at each observed token.

The Formula

Given a token sequence W = (w_1, …, w_N) and an autoregressive model P:

PPL(W) = exp(−(1/N) Σ_{i=1}^{N} log P(w_i | w_1, …, w_{i-1}))

Equivalently: PPL(W) = exp(H) where H is the cross-entropy per token in nats.

In bits: PPL(W) = 2^{H_bits} where H_bits is cross-entropy per token in bits.

Historical Progress on Penn Treebank

ModelPPL (PTB)YearNotes
4-gram Kneser-Ney141.2Pre-2010N-gram with smoothing
LSTM LM~782013Basic LSTM language model
AWD-LSTM57.32017Merity et al. (2018); strong supervised baseline
Transformer-XL21.82019Extended context transformer
GPT-2 117M (zero-shot)35.12019Radford et al.; no PTB training
GPT-3 175B (zero-shot)20.52020Brown et al.; matches supervised SOTA

GPT-3’s zero-shot perplexity (20.5) matches or beats supervised AWD-LSTM (57.3) despite never being trained explicitly on PTB.

Cross-Entropy and Perplexity

MetricFormulaTypical Values
Cross-entropy (nats)−(1/N) Σ log_e P(w_i)2.0–4.5
Cross-entropy (bits)−(1/N) Σ log₂ P(w_i)1.5–6.5
Perplexityexp(CE in nats)7–100+

A cross-entropy of 3.5 nats corresponds to PPL ≈ 33; a CE of 3.0 nats corresponds to PPL ≈ 20.

Scaling Laws and Perplexity

Perplexity follows smooth power-law scaling with model size and training data (see scaling-laws):

  • PPL(N) ≈ (N_c / N)^{0.076} — decreases predictably with parameter count
  • PPL(D) ≈ (D_c / D)^{0.095} — decreases predictably with training token count

This predictability is why perplexity is used as the primary objective for scaling law research.

Limitations of Perplexity as an Evaluation Metric

LimitationExplanation
Tokenizer-dependentDifferent vocabularies yield different PPL for same underlying model
Domain-sensitiveCannot compare values across datasets with different entropy
Uncorrelated with generation qualityLow PPL ≠ high-quality, factual generation
Contamination-blindDoes not detect if model saw test data during training

See next-token-prediction for the training objective that perplexity directly measures, scaling-laws for how PPL decreases as power laws in N and D, and hallucination-mechanisms for why low perplexity does not prevent factual errors.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

What does a perplexity of 35 mean intuitively?

Perplexity is the geometric mean of the reciprocal of predicted probabilities per token. PPL=35 means the model is, on average, as uncertain as choosing uniformly among 35 equally probable options. It assigns an average probability of 1/35 ≈ 2.9% to the correct next token. Perfect prediction (correct token always at probability 1.0) gives PPL=1. A 4-gram language model achieves ~141 PPL on Penn Treebank; GPT-3 achieves ~20.5 PPL — GPT-3 is roughly 7× more certain per token than a 4-gram model.

Why can you not compare perplexity numbers across different test sets?

Perplexity depends on the entropy of the test data itself. A PPL of 35 on Penn Treebank (formal newspaper text, low entropy) is not comparable to PPL=35 on WikiText-103 (diverse Wikipedia text, higher entropy). A model might score 20 PPL on clean news and 80 PPL on code-mixed social media while being strictly better on both domains. Meaningful comparison requires both models to use identical tokenizers (vocabulary differences change PPL directly) and the exact same test split.

← All AI pages · Dashboard