Softmax Function: Formula, Temperature Scaling, and Numerical Stability
Softmax σ(z_i) = e^{z_i}/Σe^{z_j} converts attention logits to probability distributions; temperature T<1 sharpens toward argmax (greedy), T→∞ flattens to uniform; numerically stabilized by subtracting max(z).
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Softmax formula | σ(z_i) = e^{z_i} / Σ_j e^{z_j} | Input z ∈ ℝ^K; output is probability vector summing to 1 | |
| Temperature-scaled softmax | σ(z_i/T) | T=1 standard; T→0 argmax (greedy); T→∞ uniform distribution | |
| Numerically stable computation | σ(z_i − max(z)) | Subtracting max(z) prevents overflow; does not change the output value | |
| Gradient (diagonal Jacobian) | ∂σ_i/∂z_i = σ_i(1 − σ_i) | Derivative w.r.t. own input; off-diagonal: ∂σ_i/∂z_j = −σ_i·σ_j for i≠j | |
| Attention softmax input scale (d_k=64) | ÷8 (÷√d_k) | Dividing by √d_k=8 prevents large logits that saturate softmax gradients |
The softmax function maps a vector of arbitrary real numbers (logits) to a probability distribution over K categories. It is ubiquitous in language models: converting raw attention scores to attention weights, converting final hidden states to next-token probability distributions, and controlling sampling behavior via temperature.
The Formula
For a vector z = (z₁, z₂, …, z_K):
σ(z)_i = e^{z_i} / Σⱼ₌₁ᴷ e^{z_j}
Properties:
- Each output is in (0, 1) — strictly positive
- Outputs sum to exactly 1.0
- Order-preserving: if z_i > z_j, then σ(z)_i > σ(z)_j
Numerical Stability
Naive computation overflows for large logits. The numerically stable equivalent uses the identity:
σ(z)_i = e^{z_i − c} / Σⱼ e^{z_j − c}
where c = max(z). Subtracting c sets the maximum exponent to e⁰ = 1, preventing overflow.
| z_max | e^{z_max} | Stable? |
|---|---|---|
| 10 | 22,026 | Yes (float32 OK) |
| 50 | 5.18 × 10²¹ | Yes (float32 OK) |
| 88 | 1.65 × 10³⁸ | Borderline |
| 100 | 2.69 × 10⁴³ | Overflow → NaN |
| 100 (stabilized) | e^0 = 1 | Always stable |
Temperature Scaling
| Temperature | Effect | Use Case |
|---|---|---|
| T = 0 | Argmax (greedy) | Deterministic decoding |
| T = 0.7 | Sharpened | High-confidence outputs |
| T = 1.0 | Original distribution | Standard sampling |
| T = 1.5 | Flattened | More diverse/creative outputs |
| T → ∞ | Uniform distribution | Maximum randomness |
Softmax in Attention
In self-attention, the score matrix Q·Kᵀ is divided by √d_k before softmax. For d_k=64, this scale factor is 8. Without it, large logits push the softmax into its saturation region where gradients are near zero.
| d_k | √d_k (scale) | Logit variance before scaling | After scaling |
|---|---|---|---|
| 16 | 4 | d_k = 16 | 1 |
| 64 | 8 | d_k = 64 | 1 |
| 256 | 16 | d_k = 256 | 1 |
| 512 | 22.6 | d_k = 512 | 1 |
Gradient Properties
For a single output σ_i with respect to input z_j:
- Self-gradient: ∂σ_i/∂z_i = σ_i(1 − σ_i) — maximum at σ_i = 0.5
- Cross-gradient: ∂σ_i/∂z_j = −σ_i·σ_j for i≠j
The Jacobian is dense: each output depends on all inputs. This dense coupling is essential for attention — a high score for one key reduces attention on all others.
See self-attention-mechanism for where softmax appears in attention computation, and temperature-sampling for how temperature controls output diversity during inference.
Related Pages
Sources
- Vaswani et al. (2017) — Attention Is All You Need. NeurIPS 2017
- Goodfellow et al. (2016) — Deep Learning. MIT Press (Chapter 6)
- Ackley et al. (1985) — A Learning Algorithm for Boltzmann Machines. Cognitive Science
Frequently Asked Questions
Why is the softmax numerically unstable without the max subtraction?
For large z_i, e^{z_i} can exceed float32's maximum (≈3.4×10³⁸ at z≈88). If any component overflows to infinity, the division produces NaN. Subtracting max(z) from all components before exponentiation keeps the largest exponent at e^0=1, guaranteeing no overflow while preserving the output distribution identically.
How does temperature affect softmax in language models?
Temperature T scales the logits before softmax: σ(z_i/T). At T=1 (default), the model's trained distribution is used. T<1 sharpens the distribution — at T=0, it becomes argmax (always picks the highest-probability token). T>1 flattens it toward uniform, increasing randomness and diversity. Most practical inference uses T between 0.7 and 1.2.
Why does attention use softmax specifically?
Attention requires a probability distribution over key positions — weights that are non-negative and sum to 1. Softmax is the standard way to achieve this from arbitrary real-valued scores. The exponential function ensures all weights are strictly positive, and the normalization ensures they sum to exactly 1.0. Alternative normalizations (sigmoid, sparsemax) have been explored but softmax remains dominant.