Perplexity: Information-Theoretic Measure of Language Model Prediction Quality

Name: Perplexity: Information-Theoretic Measure of Language Model Prediction Quality
Creator: AI Tower
Published: 2026-02-27

Category: evaluation Updated: 2026-02-27

PPL(W) = exp(−(1/N) Σ log P(w_i|w_{<i})) = exp(cross-entropy per token); GPT-2 117M zero-shot: 35.1 PPL on Penn Treebank (Radford et al., 2019); GPT-3 175B zero-shot: 20.5 PPL (Brown et al., 2020); 4-gram KN baseline: 141.2 PPL; human-level estimated ~10–20 PPL.

Key Data Points
Measure	Value	Unit	Notes
Perplexity formula	PPL(W) = exp(−(1/N) Σ log P(w_i\|w_{<i}))		Equivalent to exp(CE) where CE is cross-entropy per token in nats; lower PPL = better model
GPT-2 117M on Penn Treebank (zero-shot)	35.1	perplexity	Radford et al. (2019): zero-shot on PTB; prior supervised LSTM best was ~78 PPL
GPT-3 175B on Penn Treebank (zero-shot)	20.5	perplexity	Brown et al. (2020): zero-shot; 4-gram KN LM baseline is 141.2 PPL on PTB
Perplexity branching factor intuition	PPL ≈ effective vocabulary size at each step		PPL=35 means model is as uncertain as choosing uniformly among 35 equiprobable tokens
Human perplexity estimate (English)	~10–20	perplexity	Domain-dependent: formal news text ~10 PPL; diverse web text ~25 PPL for strong models

Perplexity is the canonical intrinsic evaluation metric for language models, measuring how well a model predicts held-out text. It is the exponential of the average cross-entropy loss per token — an information-theoretic quantity expressing the model’s average surprise at each observed token.

The Formula

Given a token sequence W = (w_1, …, w_N) and an autoregressive model P:

PPL(W) = exp(−(1/N) Σ_{i=1}^{N} log P(w_i | w_1, …, w_{i-1}))

Equivalently: PPL(W) = exp(H) where H is the cross-entropy per token in nats.

In bits: PPL(W) = 2^{H_bits} where H_bits is cross-entropy per token in bits.

Historical Progress on Penn Treebank

Model	PPL (PTB)	Year	Notes
4-gram Kneser-Ney	141.2	Pre-2010	N-gram with smoothing
LSTM LM	~78	2013	Basic LSTM language model
AWD-LSTM	57.3	2017	Merity et al. (2018); strong supervised baseline
Transformer-XL	21.8	2019	Extended context transformer
GPT-2 117M (zero-shot)	35.1	2019	Radford et al.; no PTB training
GPT-3 175B (zero-shot)	20.5	2020	Brown et al.; matches supervised SOTA

GPT-3’s zero-shot perplexity (20.5) matches or beats supervised AWD-LSTM (57.3) despite never being trained explicitly on PTB.

Cross-Entropy and Perplexity

Metric	Formula	Typical Values
Cross-entropy (nats)	−(1/N) Σ log_e P(w_i)	2.0–4.5
Cross-entropy (bits)	−(1/N) Σ log₂ P(w_i)	1.5–6.5
Perplexity	exp(CE in nats)	7–100+

A cross-entropy of 3.5 nats corresponds to PPL ≈ 33; a CE of 3.0 nats corresponds to PPL ≈ 20.

Scaling Laws and Perplexity

Perplexity follows smooth power-law scaling with model size and training data (see scaling-laws):

PPL(N) ≈ (N_c / N)^{0.076} — decreases predictably with parameter count
PPL(D) ≈ (D_c / D)^{0.095} — decreases predictably with training token count

This predictability is why perplexity is used as the primary objective for scaling law research.

Limitations of Perplexity as an Evaluation Metric

Limitation	Explanation
Tokenizer-dependent	Different vocabularies yield different PPL for same underlying model
Domain-sensitive	Cannot compare values across datasets with different entropy
Uncorrelated with generation quality	Low PPL ≠ high-quality, factual generation
Contamination-blind	Does not detect if model saw test data during training

See next-token-prediction for the training objective that perplexity directly measures, scaling-laws for how PPL decreases as power laws in N and D, and hallucination-mechanisms for why low perplexity does not prevent factual errors.

🧠 🧠 🧠

Sources

Frequently Asked Questions

What does a perplexity of 35 mean intuitively?

Perplexity is the geometric mean of the reciprocal of predicted probabilities per token. PPL=35 means the model is, on average, as uncertain as choosing uniformly among 35 equally probable options. It assigns an average probability of 1/35 ≈ 2.9% to the correct next token. Perfect prediction (correct token always at probability 1.0) gives PPL=1. A 4-gram language model achieves ~141 PPL on Penn Treebank; GPT-3 achieves ~20.5 PPL — GPT-3 is roughly 7× more certain per token than a 4-gram model.

Why can you not compare perplexity numbers across different test sets?

Perplexity depends on the entropy of the test data itself. A PPL of 35 on Penn Treebank (formal newspaper text, low entropy) is not comparable to PPL=35 on WikiText-103 (diverse Wikipedia text, higher entropy). A model might score 20 PPL on clean news and 80 PPL on code-mixed social media while being strictly better on both domains. Meaningful comparison requires both models to use identical tokenizers (vocabulary differences change PPL directly) and the exact same test split.

← All AI pages · Dashboard