Top-p (Nucleus) Sampling: Adaptive Vocabulary Truncation for Language Model Decoding

Category: inference Updated: 2026-02-27

Nucleus (top-p) sampling: select smallest V' ⊆ V such that Σ_{w∈V'} p(w|context) ≥ p, renormalize, sample; Holtzman et al. (ICLR 2020) showed top-p=0.9 produces text more preferred by humans than top-k, temperature-only, or greedy decoding across all evaluated metrics.

Key Data Points
MeasureValueUnitNotes
Recommended nucleus probabilityp = 0.9–0.95p=0.9 retains top 90% of probability mass; widely adopted default for natural text generation
Dynamic vocabulary size range1 to ~50,000 tokens per steptokensPeaked distributions → few tokens included; flat distributions → many tokens included
Human evaluation rankingtop-p > top-k > temperature-only > greedyHoltzman et al. (2020): nucleus sampling most preferred by human raters across all story generation tasks
Renormalization formulap̃_i = p_i / Σ_{j∈nucleus} p_jAfter truncation to nucleus, probabilities rescaled to sum to 1.0 before sampling

Nucleus sampling (top-p sampling), introduced by Holtzman et al. (2020), selects a context-dependent subset of the vocabulary before sampling. Unlike top-k sampling which always selects a fixed number of candidates, nucleus sampling adapts to the predicted distribution’s shape at each decoding step.

The Algorithm

Given predicted logits z and temperature T:

  1. Compute probabilities: p_i = exp(z_i / T) / Σ exp(z_j / T)
  2. Sort tokens by probability (descending): p_{(1)} ≥ p_{(2)} ≥ … ≥ p_{(V)}
  3. Find nucleus: smallest set N such that Σ_{i∈N} p_{(i)} ≥ p (typically p=0.9)
  4. Renormalize: p̃_i = p_i / Σ_{j∈N} p_j for i ∈ N; zero for all others
  5. Sample token from renormalized distribution p̃

Dynamic Vocabulary Adaptation

Distribution TypeTop-k (k=50)Nucleus (p=0.9)
Peaked (one token > 90%)Includes 49 near-zero tokensIncludes 1–3 tokens
Moderate (top token ~30%)Includes 50 of ~3000 plausibleIncludes ~10–20 tokens
Flat (uniform over 1000 tokens)Includes 50 of 1000Includes ~900 tokens

Comparison of Decoding Methods

MethodVocabulary SizeDeterminismHuman Rating
Greedy (T→0)1Fully deterministicWorst (repetitive)
Top-k (k=50)Fixed k per stepStochasticGood
Top-p (p=0.9)1 to ~50KStochasticBest (Holtzman et al.)
Beam search (k=5)Full, 5 hypothesesNear-deterministicGood for MT, poor for story

The “Unreliable Tail” Problem

Without truncation, language models assign non-negligible probability to thousands of semantically inappropriate tokens. At each step, the probability mass is distributed as:

  • Reliable nucleus (~90%): tokens the model genuinely considers plausible
  • Unreliable tail (~10%): low-probability tokens that are semantically inconsistent

Sampling from the tail even occasionally degrades coherence over many decoding steps, since errors compound across the sequence. Top-p removes the tail while respecting the model’s confidence in each specific context.

Combining Temperature and Nucleus Sampling

In practice, temperature is applied before nucleus selection:

  1. Scale logits: z̃_i = z_i / T
  2. Compute softmax: p_i = softmax(z̃_i)
  3. Select nucleus at cumulative threshold p
  4. Renormalize and sample

Lower T narrows the distribution before nucleus selection, yielding a smaller nucleus. Higher T widens it, yielding a larger nucleus.

See temperature-sampling for the logit scaling step applied before nucleus selection, beam-search for deterministic multi-hypothesis search, and autoregressive-decoding for how these methods fit into the generation loop.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why does nucleus sampling outperform top-k sampling?

Top-k always samples from exactly k tokens regardless of distribution shape. If the distribution is very peaked (one token has 99% probability), top-k wastes probability mass on 999 near-zero tokens. If the distribution is very flat, top-k may exclude many reasonable candidates. Nucleus sampling adapts: peaked distributions yield a small nucleus; flat distributions yield a larger one. This matches the model's actual uncertainty rather than imposing a fixed vocabulary size.

What does setting p=1.0 do in nucleus sampling?

p=1.0 includes all tokens in the vocabulary (the smallest set with cumulative probability ≥ 1.0 is the entire vocabulary). This degenerates to standard temperature sampling with no truncation, including all low-probability 'tail' tokens. Holtzman et al. (2020) identified sampling from the unreliable tail as the primary source of degenerate text — the nucleus specifically excludes this tail, which is why values p < 1.0 produce better outputs.

← All AI pages · Dashboard