Temperature Sampling: Controlling Randomness in Autoregressive Language Model Generation

Category: inference Updated: 2026-02-27

Temperature T scales logits before softmax: p_i = exp(z_i/T) / Σ exp(z_j/T); T→0 approaches greedy decoding; T=1 is standard softmax; T>1 increases entropy. Holtzman et al. (ICLR 2020) showed T-based sampling produces incoherent text at high values without truncation.

Key Data Points
MeasureValueUnitNotes
Temperature formulap_i = exp(z_i/T) / Σ exp(z_j/T)Standard softmax when T=1; reduces to argmax as T→0; equivalent to scaling logits by 1/T
Greedy decoding thresholdT → 0 (argmax)Deterministic; always selects highest-probability token; produces repetition loops in long generation
Typical creative generation range0.7–1.0Balances diversity and coherence; values above 1.2 typically degrade semantic and grammatical coherence
Entropy at low vs high temperatureH(T=0.5) < H(T=1) < H(T=2)Low T concentrates mass on top tokens; high T spreads mass uniformly; entropy scales monotonically with T

Temperature sampling controls the randomness of language model generation by scaling the logit vector before applying softmax. The technique originates from the Boltzmann distribution in statistical mechanics and was adapted to neural language generation as a direct, differentiable control over output entropy.

The Formula

Standard softmax computes: p_i = exp(z_i) / Σ exp(z_j)

With temperature T: p_i = exp(z_i / T) / Σ exp(z_j / T)

The effect is equivalent to dividing all logits by T before the softmax operation. This does not change the ranking of tokens — only the sharpness of the resulting distribution.

Temperature Effects on Distribution

TemperatureBehaviorTypical Use Case
T → 0Greedy: always argmax tokenFactual Q&A, structured output
T = 0.3–0.7Concentrated; low diversityCode generation, factual writing
T = 1.0Standard softmax (unchanged)Default generation
T = 1.2–1.5Higher diversity, more creative riskStory generation, brainstorming
T > 2.0Near-uniform distributionMostly noise; rarely useful

Repetition and Greedy Decoding

Greedy decoding (T→0) consistently selects the maximum-probability token at each step. While this produces locally coherent text, it creates global repetition: once a high-probability phrase begins, the model assigns high probability to its continuation indefinitely. Holtzman et al. (2020) showed that humans judge greedy-decoded text as significantly worse than human text, even when per-token probabilities are high — the “most probable” sequence is not the most human-like.

Relationship to Other Sampling Methods

Temperature is typically the first step in a multi-stage pipeline:

MethodMechanismInteraction with Temperature
Top-k samplingSample from k highest-probability tokens onlyApplied after temperature scaling
Nucleus (top-p) samplingSample from smallest set with cumulative prob ≥ pApplied after temperature scaling
Beam searchMaintain k hypothesesDoes not use sampling; temperature irrelevant

Standard practice: apply temperature scaling first, then top-k or top-p truncation, then sample from the resulting renormalized distribution.

Entropy and Temperature: Formal Relationship

For a distribution with entropy H at T=1, the entropy at temperature T is approximately:

H(T) ≈ H(1) + (T − 1) · Var(log p) / T

At T=1 the distribution is unchanged. At T→∞ the entropy approaches log|V| (uniform over vocabulary). At T→0 entropy approaches 0 (deterministic).

See top-p-sampling for the nucleus sampling approach that pairs with temperature, beam-search for deterministic sequence search, and autoregressive-decoding for the overall token-by-token generation loop.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

What happens to the probability distribution when temperature approaches zero?

As T→0, exp(z_i/T) diverges for the highest logit while all other terms approach zero. The result approaches a one-hot distribution placing all probability mass on the argmax token, equivalent to greedy decoding. This is deterministic but produces repetitive text because high-probability sequences often loop: once a frequent phrase begins, the model assigns high probability to its continuation indefinitely.

Why does high temperature produce incoherent text?

At T≫1, all logits are divided by a large number, compressing them toward zero before softmax. This makes the output distribution near-uniform — even highly unlikely tokens gain substantial probability. The model effectively ignores its learned predictions, treating all tokens as nearly equally likely, which destroys semantic and grammatical coherence. Holtzman et al. (2020) showed this degeneration requires truncation methods (top-k or top-p) to constrain the vocabulary.

← All AI pages · Dashboard