Temperature Sampling: Controlling Randomness in Autoregressive Language Model Generation
Temperature T scales logits before softmax: p_i = exp(z_i/T) / Σ exp(z_j/T); T→0 approaches greedy decoding; T=1 is standard softmax; T>1 increases entropy. Holtzman et al. (ICLR 2020) showed T-based sampling produces incoherent text at high values without truncation.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Temperature formula | p_i = exp(z_i/T) / Σ exp(z_j/T) | Standard softmax when T=1; reduces to argmax as T→0; equivalent to scaling logits by 1/T | |
| Greedy decoding threshold | T → 0 (argmax) | Deterministic; always selects highest-probability token; produces repetition loops in long generation | |
| Typical creative generation range | 0.7–1.0 | Balances diversity and coherence; values above 1.2 typically degrade semantic and grammatical coherence | |
| Entropy at low vs high temperature | H(T=0.5) < H(T=1) < H(T=2) | Low T concentrates mass on top tokens; high T spreads mass uniformly; entropy scales monotonically with T |
Temperature sampling controls the randomness of language model generation by scaling the logit vector before applying softmax. The technique originates from the Boltzmann distribution in statistical mechanics and was adapted to neural language generation as a direct, differentiable control over output entropy.
The Formula
Standard softmax computes: p_i = exp(z_i) / Σ exp(z_j)
With temperature T: p_i = exp(z_i / T) / Σ exp(z_j / T)
The effect is equivalent to dividing all logits by T before the softmax operation. This does not change the ranking of tokens — only the sharpness of the resulting distribution.
Temperature Effects on Distribution
| Temperature | Behavior | Typical Use Case |
|---|---|---|
| T → 0 | Greedy: always argmax token | Factual Q&A, structured output |
| T = 0.3–0.7 | Concentrated; low diversity | Code generation, factual writing |
| T = 1.0 | Standard softmax (unchanged) | Default generation |
| T = 1.2–1.5 | Higher diversity, more creative risk | Story generation, brainstorming |
| T > 2.0 | Near-uniform distribution | Mostly noise; rarely useful |
Repetition and Greedy Decoding
Greedy decoding (T→0) consistently selects the maximum-probability token at each step. While this produces locally coherent text, it creates global repetition: once a high-probability phrase begins, the model assigns high probability to its continuation indefinitely. Holtzman et al. (2020) showed that humans judge greedy-decoded text as significantly worse than human text, even when per-token probabilities are high — the “most probable” sequence is not the most human-like.
Relationship to Other Sampling Methods
Temperature is typically the first step in a multi-stage pipeline:
| Method | Mechanism | Interaction with Temperature |
|---|---|---|
| Top-k sampling | Sample from k highest-probability tokens only | Applied after temperature scaling |
| Nucleus (top-p) sampling | Sample from smallest set with cumulative prob ≥ p | Applied after temperature scaling |
| Beam search | Maintain k hypotheses | Does not use sampling; temperature irrelevant |
Standard practice: apply temperature scaling first, then top-k or top-p truncation, then sample from the resulting renormalized distribution.
Entropy and Temperature: Formal Relationship
For a distribution with entropy H at T=1, the entropy at temperature T is approximately:
H(T) ≈ H(1) + (T − 1) · Var(log p) / T
At T=1 the distribution is unchanged. At T→∞ the entropy approaches log|V| (uniform over vocabulary). At T→0 entropy approaches 0 (deterministic).
See top-p-sampling for the nucleus sampling approach that pairs with temperature, beam-search for deterministic sequence search, and autoregressive-decoding for the overall token-by-token generation loop.
Related Pages
Sources
- Holtzman et al. (2020) — The Curious Case of Neural Text Degeneration. ICLR 2020
- Ackley et al. (1985) — A Learning Algorithm for Boltzmann Machines. Cognitive Science 9(1)
Frequently Asked Questions
What happens to the probability distribution when temperature approaches zero?
As T→0, exp(z_i/T) diverges for the highest logit while all other terms approach zero. The result approaches a one-hot distribution placing all probability mass on the argmax token, equivalent to greedy decoding. This is deterministic but produces repetitive text because high-probability sequences often loop: once a frequent phrase begins, the model assigns high probability to its continuation indefinitely.
Why does high temperature produce incoherent text?
At T≫1, all logits are divided by a large number, compressing them toward zero before softmax. This makes the output distribution near-uniform — even highly unlikely tokens gain substantial probability. The model effectively ignores its learned predictions, treating all tokens as nearly equally likely, which destroys semantic and grammatical coherence. Holtzman et al. (2020) showed this degeneration requires truncation methods (top-k or top-p) to constrain the vocabulary.