Beam Search: Approximate Sequence Optimization in Autoregressive Models

Category: inference Updated: 2026-02-27

Beam search maintains k hypotheses; at each step expands k×|V| continuations, retains top-k by Σ log P(y_t|y_{<t}); Sutskever et al. (NeurIPS 2014) used k=2–12 for seq2seq MT; Holtzman et al. (2020) showed beam-decoded text has lower human preference than nucleus-sampled text for open generation.

Key Data Points
MeasureValueUnitNotes
Standard beam width for neural MTk = 4–5beamsSutskever et al. (2014): k=2 already improves significantly over greedy; k>10 yields diminishing returns
Compute cost vs greedyO(k × |V|) per stepEach step evaluates k beams × full vocabulary |V|; memory scales as k × T × d_model
Length penalty (Google NMT)lp(Y) = (5 + |Y|)^α / 6^α, α=0.6–0.7Wu et al. (2016): without penalty, beam search favors short sequences; α corrects length bias
BLEU improvement from greedy to beam k=5+1.5–2.0 BLEUBLEU pointsTypical gain on WMT English-German benchmarks; beam k=5 vs k=1 greedy baseline

Beam search is an approximate search algorithm for finding high-probability output sequences in autoregressive models. Rather than exhaustively enumerating all |V|^T possible sequences (exponential in length T), it maintains a fixed-size set of k hypotheses at each step, pruning while preserving higher-quality candidates than greedy decoding.

Algorithm

At each decoding step t with k active beams:

  1. For each beam b_i (i=1..k), compute logits over full vocabulary |V|
  2. Generate k × |V| candidate extensions
  3. Score all candidates by cumulative log-probability: score(y_{1:t}) = Σ_{j=1}^{t} log P(y_j | y_{<j}, x)
  4. Retain top-k by score
  5. Continue until all k beams reach EOS token or max length

Final output: highest-scoring completed beam (optionally after length normalization).

Beam Width vs BLEU Score (WMT English-German)

Beam Width (k)BLEU ScoreNotes
1 (greedy)~26.0Baseline; misses many good sequences
2~27.2Large gain from minimal additional cost
5~28.0–28.4Standard; most MT systems use k=4–5
10~28.3–28.5Diminishing returns
50~28.3Sometimes degrades due to length bias

Length Normalization

Without length correction, beam search favors short sequences (fewer terms in the cumulative sum). The Google NMT length penalty (Wu et al., 2016):

lp(Y) = (5 + |Y|)^α / 6^α

with α=0.6–0.7 normalizes scores by sequence length. Cover penalty and block n-gram repetition penalties are additional regularizations used in practice.

When Beam Search Works and When It Fails

TaskBeam SearchStochastic Sampling
Machine translationEffective (k=4–5)Worse (adds noise to precise translation)
Abstractive summarizationEffective (k=4)Comparable at high quality
Open-ended story generationDegenerate (repetitive)Preferred (nucleus/temperature)
Dialogue responseDull, genericMore engaging
Structured output (SQL, JSON)Preferred (deterministic)Risk of invalid structure

The Text Degeneration Problem

In open-ended generation, beam search produces repeating n-gram sequences. The failure mechanism: once a high-probability phrase appears, its continuation has high conditional probability, so beam search reinforces the repetition. N-gram blocking (suppress beams containing repeated n-grams of length ≥ 3) partially mitigates this at the cost of hard vocabulary constraints.

See kv-cache for how past attention states are cached during beam search, temperature-sampling for the stochastic alternative, and autoregressive-decoding for the token-by-token generation framework.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why does beam search fail for open-ended text generation?

Beam search finds sequences with high joint probability under the model. For machine translation with a clear reference, maximizing probability aligns with quality. For open-ended generation, high-probability sequences are often generic, repetitive, and boring — the model's probability distribution over creative text is broad, and the highest-probability path corresponds to safe, uninteresting continuations. Holtzman et al. (2020) showed that joint log-probability of human-written text is not higher than degenerate repetitive sequences under typical language models.

What is the difference between beam search and greedy decoding?

Greedy decoding selects the single highest-probability token at each step — equivalent to beam search with k=1. This can miss sequences where a slightly lower-probability first token leads to a much higher-probability overall sequence. Beam search explores k alternatives simultaneously, pruning the search space to retain the k highest-probability partial sequences at each step. The trade-off: k-beam search requires k× the computation and memory of greedy decoding, with diminishing accuracy returns above k=5–10.

← All AI pages · Dashboard