Top-p (Nucleus) Sampling: Adaptive Vocabulary Truncation for Language Model Decoding

Name: Top-p (Nucleus) Sampling: Adaptive Vocabulary Truncation for Language Model Decoding
Creator: AI Tower
Published: 2026-02-27

Category: inference Updated: 2026-02-27

Nucleus (top-p) sampling: select smallest V' ⊆ V such that Σ_{w∈V'} p(w|context) ≥ p, renormalize, sample; Holtzman et al. (ICLR 2020) showed top-p=0.9 produces text more preferred by humans than top-k, temperature-only, or greedy decoding across all evaluated metrics.

Key Data Points
Measure	Value	Unit	Notes
Recommended nucleus probability	p = 0.9–0.95		p=0.9 retains top 90% of probability mass; widely adopted default for natural text generation
Dynamic vocabulary size range	1 to ~50,000 tokens per step	tokens	Peaked distributions → few tokens included; flat distributions → many tokens included
Human evaluation ranking	top-p > top-k > temperature-only > greedy		Holtzman et al. (2020): nucleus sampling most preferred by human raters across all story generation tasks
Renormalization formula	p̃_i = p_i / Σ_{j∈nucleus} p_j		After truncation to nucleus, probabilities rescaled to sum to 1.0 before sampling

Nucleus sampling (top-p sampling), introduced by Holtzman et al. (2020), selects a context-dependent subset of the vocabulary before sampling. Unlike top-k sampling which always selects a fixed number of candidates, nucleus sampling adapts to the predicted distribution’s shape at each decoding step.

The Algorithm

Given predicted logits z and temperature T:

Compute probabilities: p_i = exp(z_i / T) / Σ exp(z_j / T)
Sort tokens by probability (descending): p_{(1)} ≥ p_{(2)} ≥ … ≥ p_{(V)}
Find nucleus: smallest set N such that Σ_{i∈N} p_{(i)} ≥ p (typically p=0.9)
Renormalize: p̃_i = p_i / Σ_{j∈N} p_j for i ∈ N; zero for all others
Sample token from renormalized distribution p̃

Dynamic Vocabulary Adaptation

Distribution Type	Top-k (k=50)	Nucleus (p=0.9)
Peaked (one token > 90%)	Includes 49 near-zero tokens	Includes 1–3 tokens
Moderate (top token ~30%)	Includes 50 of ~3000 plausible	Includes ~10–20 tokens
Flat (uniform over 1000 tokens)	Includes 50 of 1000	Includes ~900 tokens

Comparison of Decoding Methods

Method	Vocabulary Size	Determinism	Human Rating
Greedy (T→0)	1	Fully deterministic	Worst (repetitive)
Top-k (k=50)	Fixed k per step	Stochastic	Good
Top-p (p=0.9)	1 to ~50K	Stochastic	Best (Holtzman et al.)
Beam search (k=5)	Full, 5 hypotheses	Near-deterministic	Good for MT, poor for story

The “Unreliable Tail” Problem

Without truncation, language models assign non-negligible probability to thousands of semantically inappropriate tokens. At each step, the probability mass is distributed as:

Reliable nucleus (~90%): tokens the model genuinely considers plausible
Unreliable tail (~10%): low-probability tokens that are semantically inconsistent

Sampling from the tail even occasionally degrades coherence over many decoding steps, since errors compound across the sequence. Top-p removes the tail while respecting the model’s confidence in each specific context.

Combining Temperature and Nucleus Sampling

In practice, temperature is applied before nucleus selection:

Scale logits: z̃_i = z_i / T
Compute softmax: p_i = softmax(z̃_i)
Select nucleus at cumulative threshold p
Renormalize and sample

Lower T narrows the distribution before nucleus selection, yielding a smaller nucleus. Higher T widens it, yielding a larger nucleus.

See temperature-sampling for the logit scaling step applied before nucleus selection, beam-search for deterministic multi-hypothesis search, and autoregressive-decoding for how these methods fit into the generation loop.

🧠 🧠 🧠

Sources

Frequently Asked Questions

Why does nucleus sampling outperform top-k sampling?

Top-k always samples from exactly k tokens regardless of distribution shape. If the distribution is very peaked (one token has 99% probability), top-k wastes probability mass on 999 near-zero tokens. If the distribution is very flat, top-k may exclude many reasonable candidates. Nucleus sampling adapts: peaked distributions yield a small nucleus; flat distributions yield a larger one. This matches the model's actual uncertainty rather than imposing a fixed vocabulary size.

What does setting p=1.0 do in nucleus sampling?

p=1.0 includes all tokens in the vocabulary (the smallest set with cumulative probability ≥ 1.0 is the entire vocabulary). This degenerates to standard temperature sampling with no truncation, including all low-probability 'tail' tokens. Holtzman et al. (2020) identified sampling from the unreliable tail as the primary source of degenerate text — the nucleus specifically excludes this tail, which is why values p < 1.0 produce better outputs.

← All AI pages · Dashboard