Temperature Sampling: Controlling Randomness in Autoregressive Language Model Generation

Name: Temperature Sampling: Controlling Randomness in Autoregressive Language Model Generation
Creator: AI Tower
Published: 2026-02-27

Category: inference Updated: 2026-02-27

Temperature T scales logits before softmax: p_i = exp(z_i/T) / Σ exp(z_j/T); T→0 approaches greedy decoding; T=1 is standard softmax; T>1 increases entropy. Holtzman et al. (ICLR 2020) showed T-based sampling produces incoherent text at high values without truncation.

Key Data Points
Measure	Value	Notes
Temperature formula	p_i = exp(z_i/T) / Σ exp(z_j/T)	Standard softmax when T=1; reduces to argmax as T→0; equivalent to scaling logits by 1/T
Greedy decoding threshold	T → 0 (argmax)	Deterministic; always selects highest-probability token; produces repetition loops in long generation
Typical creative generation range	0.7–1.0	Balances diversity and coherence; values above 1.2 typically degrade semantic and grammatical coherence
Entropy at low vs high temperature	H(T=0.5) < H(T=1) < H(T=2)	Low T concentrates mass on top tokens; high T spreads mass uniformly; entropy scales monotonically with T

Temperature sampling controls the randomness of language model generation by scaling the logit vector before applying softmax. The technique originates from the Boltzmann distribution in statistical mechanics and was adapted to neural language generation as a direct, differentiable control over output entropy.

The Formula

Standard softmax computes: p_i = exp(z_i) / Σ exp(z_j)

With temperature T: p_i = exp(z_i / T) / Σ exp(z_j / T)

The effect is equivalent to dividing all logits by T before the softmax operation. This does not change the ranking of tokens — only the sharpness of the resulting distribution.

Temperature Effects on Distribution

Temperature	Behavior	Typical Use Case
T → 0	Greedy: always argmax token	Factual Q&A, structured output
T = 0.3–0.7	Concentrated; low diversity	Code generation, factual writing
T = 1.0	Standard softmax (unchanged)	Default generation
T = 1.2–1.5	Higher diversity, more creative risk	Story generation, brainstorming
T > 2.0	Near-uniform distribution	Mostly noise; rarely useful

Repetition and Greedy Decoding

Greedy decoding (T→0) consistently selects the maximum-probability token at each step. While this produces locally coherent text, it creates global repetition: once a high-probability phrase begins, the model assigns high probability to its continuation indefinitely. Holtzman et al. (2020) showed that humans judge greedy-decoded text as significantly worse than human text, even when per-token probabilities are high — the “most probable” sequence is not the most human-like.

Relationship to Other Sampling Methods

Temperature is typically the first step in a multi-stage pipeline:

Method	Mechanism	Interaction with Temperature
Top-k sampling	Sample from k highest-probability tokens only	Applied after temperature scaling
Nucleus (top-p) sampling	Sample from smallest set with cumulative prob ≥ p	Applied after temperature scaling
Beam search	Maintain k hypotheses	Does not use sampling; temperature irrelevant

Standard practice: apply temperature scaling first, then top-k or top-p truncation, then sample from the resulting renormalized distribution.

Entropy and Temperature: Formal Relationship

For a distribution with entropy H at T=1, the entropy at temperature T is approximately:

H(T) ≈ H(1) + (T − 1) · Var(log p) / T

At T=1 the distribution is unchanged. At T→∞ the entropy approaches log|V| (uniform over vocabulary). At T→0 entropy approaches 0 (deterministic).

See top-p-sampling for the nucleus sampling approach that pairs with temperature, beam-search for deterministic sequence search, and autoregressive-decoding for the overall token-by-token generation loop.

🧠 🧠 🧠

Sources

Frequently Asked Questions

What happens to the probability distribution when temperature approaches zero?

As T→0, exp(z_i/T) diverges for the highest logit while all other terms approach zero. The result approaches a one-hot distribution placing all probability mass on the argmax token, equivalent to greedy decoding. This is deterministic but produces repetitive text because high-probability sequences often loop: once a frequent phrase begins, the model assigns high probability to its continuation indefinitely.

Why does high temperature produce incoherent text?

At T≫1, all logits are divided by a large number, compressing them toward zero before softmax. This makes the output distribution near-uniform — even highly unlikely tokens gain substantial probability. The model effectively ignores its learned predictions, treating all tokens as nearly equally likely, which destroys semantic and grammatical coherence. Holtzman et al. (2020) showed this degeneration requires truncation methods (top-k or top-p) to constrain the vocabulary.

← All AI pages · Dashboard