Top-p (Nucleus) Sampling: Adaptive Vocabulary Truncation for Language Model Decoding
Nucleus (top-p) sampling: select smallest V' ⊆ V such that Σ_{w∈V'} p(w|context) ≥ p, renormalize, sample; Holtzman et al. (ICLR 2020) showed top-p=0.9 produces text more preferred by humans than top-k, temperature-only, or greedy decoding across all evaluated metrics.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Recommended nucleus probability | p = 0.9–0.95 | p=0.9 retains top 90% of probability mass; widely adopted default for natural text generation | |
| Dynamic vocabulary size range | 1 to ~50,000 tokens per step | tokens | Peaked distributions → few tokens included; flat distributions → many tokens included |
| Human evaluation ranking | top-p > top-k > temperature-only > greedy | Holtzman et al. (2020): nucleus sampling most preferred by human raters across all story generation tasks | |
| Renormalization formula | p̃_i = p_i / Σ_{j∈nucleus} p_j | After truncation to nucleus, probabilities rescaled to sum to 1.0 before sampling |
Nucleus sampling (top-p sampling), introduced by Holtzman et al. (2020), selects a context-dependent subset of the vocabulary before sampling. Unlike top-k sampling which always selects a fixed number of candidates, nucleus sampling adapts to the predicted distribution’s shape at each decoding step.
The Algorithm
Given predicted logits z and temperature T:
- Compute probabilities: p_i = exp(z_i / T) / Σ exp(z_j / T)
- Sort tokens by probability (descending): p_{(1)} ≥ p_{(2)} ≥ … ≥ p_{(V)}
- Find nucleus: smallest set N such that Σ_{i∈N} p_{(i)} ≥ p (typically p=0.9)
- Renormalize: p̃_i = p_i / Σ_{j∈N} p_j for i ∈ N; zero for all others
- Sample token from renormalized distribution p̃
Dynamic Vocabulary Adaptation
| Distribution Type | Top-k (k=50) | Nucleus (p=0.9) |
|---|---|---|
| Peaked (one token > 90%) | Includes 49 near-zero tokens | Includes 1–3 tokens |
| Moderate (top token ~30%) | Includes 50 of ~3000 plausible | Includes ~10–20 tokens |
| Flat (uniform over 1000 tokens) | Includes 50 of 1000 | Includes ~900 tokens |
Comparison of Decoding Methods
| Method | Vocabulary Size | Determinism | Human Rating |
|---|---|---|---|
| Greedy (T→0) | 1 | Fully deterministic | Worst (repetitive) |
| Top-k (k=50) | Fixed k per step | Stochastic | Good |
| Top-p (p=0.9) | 1 to ~50K | Stochastic | Best (Holtzman et al.) |
| Beam search (k=5) | Full, 5 hypotheses | Near-deterministic | Good for MT, poor for story |
The “Unreliable Tail” Problem
Without truncation, language models assign non-negligible probability to thousands of semantically inappropriate tokens. At each step, the probability mass is distributed as:
- Reliable nucleus (~90%): tokens the model genuinely considers plausible
- Unreliable tail (~10%): low-probability tokens that are semantically inconsistent
Sampling from the tail even occasionally degrades coherence over many decoding steps, since errors compound across the sequence. Top-p removes the tail while respecting the model’s confidence in each specific context.
Combining Temperature and Nucleus Sampling
In practice, temperature is applied before nucleus selection:
- Scale logits: z̃_i = z_i / T
- Compute softmax: p_i = softmax(z̃_i)
- Select nucleus at cumulative threshold p
- Renormalize and sample
Lower T narrows the distribution before nucleus selection, yielding a smaller nucleus. Higher T widens it, yielding a larger nucleus.
See temperature-sampling for the logit scaling step applied before nucleus selection, beam-search for deterministic multi-hypothesis search, and autoregressive-decoding for how these methods fit into the generation loop.
Related Pages
Sources
- Holtzman et al. (2020) — The Curious Case of Neural Text Degeneration. ICLR 2020
- Fan et al. (2018) — Hierarchical Neural Story Generation. ACL 2018
Frequently Asked Questions
Why does nucleus sampling outperform top-k sampling?
Top-k always samples from exactly k tokens regardless of distribution shape. If the distribution is very peaked (one token has 99% probability), top-k wastes probability mass on 999 near-zero tokens. If the distribution is very flat, top-k may exclude many reasonable candidates. Nucleus sampling adapts: peaked distributions yield a small nucleus; flat distributions yield a larger one. This matches the model's actual uncertainty rather than imposing a fixed vocabulary size.
What does setting p=1.0 do in nucleus sampling?
p=1.0 includes all tokens in the vocabulary (the smallest set with cumulative probability ≥ 1.0 is the entire vocabulary). This degenerates to standard temperature sampling with no truncation, including all low-probability 'tail' tokens. Holtzman et al. (2020) identified sampling from the unreliable tail as the primary source of degenerate text — the nucleus specifically excludes this tail, which is why values p < 1.0 produce better outputs.