Softmax Function: Formula, Temperature Scaling, and Numerical Stability

Name: Softmax Function: Formula, Temperature Scaling, and Numerical Stability
Creator: AI Tower
Published: 2026-02-27

Category: architecture Updated: 2026-02-27

Softmax σ(z_i) = e^{z_i}/Σe^{z_j} converts attention logits to probability distributions; temperature T<1 sharpens toward argmax (greedy), T→∞ flattens to uniform; numerically stabilized by subtracting max(z).

Key Data Points
Measure	Value	Notes
Softmax formula	σ(z_i) = e^{z_i} / Σ_j e^{z_j}	Input z ∈ ℝ^K; output is probability vector summing to 1
Temperature-scaled softmax	σ(z_i/T)	T=1 standard; T→0 argmax (greedy); T→∞ uniform distribution
Numerically stable computation	σ(z_i − max(z))	Subtracting max(z) prevents overflow; does not change the output value
Gradient (diagonal Jacobian)	∂σ_i/∂z_i = σ_i(1 − σ_i)	Derivative w.r.t. own input; off-diagonal: ∂σ_i/∂z_j = −σ_i·σ_j for i≠j
Attention softmax input scale (d_k=64)	÷8 (÷√d_k)	Dividing by √d_k=8 prevents large logits that saturate softmax gradients

The softmax function maps a vector of arbitrary real numbers (logits) to a probability distribution over K categories. It is ubiquitous in language models: converting raw attention scores to attention weights, converting final hidden states to next-token probability distributions, and controlling sampling behavior via temperature.

The Formula

For a vector z = (z₁, z₂, …, z_K):

σ(z)_i = e^{z_i} / Σⱼ₌₁ᴷ e^{z_j}

Properties:

Each output is in (0, 1) — strictly positive
Outputs sum to exactly 1.0
Order-preserving: if z_i > z_j, then σ(z)_i > σ(z)_j

Numerical Stability

Naive computation overflows for large logits. The numerically stable equivalent uses the identity:

σ(z)_i = e^{z_i − c} / Σⱼ e^{z_j − c}

where c = max(z). Subtracting c sets the maximum exponent to e⁰ = 1, preventing overflow.

z_max	e^{z_max}	Stable?
10	22,026	Yes (float32 OK)
50	5.18 × 10²¹	Yes (float32 OK)
88	1.65 × 10³⁸	Borderline
100	2.69 × 10⁴³	Overflow → NaN
100 (stabilized)	e^0 = 1	Always stable

Temperature Scaling

Temperature	Effect	Use Case
T = 0	Argmax (greedy)	Deterministic decoding
T = 0.7	Sharpened	High-confidence outputs
T = 1.0	Original distribution	Standard sampling
T = 1.5	Flattened	More diverse/creative outputs
T → ∞	Uniform distribution	Maximum randomness

Softmax in Attention

In self-attention, the score matrix Q·Kᵀ is divided by √d_k before softmax. For d_k=64, this scale factor is 8. Without it, large logits push the softmax into its saturation region where gradients are near zero.

d_k	√d_k (scale)	Logit variance before scaling	After scaling
16	4	d_k = 16	1
64	8	d_k = 64	1
256	16	d_k = 256	1
512	22.6	d_k = 512	1

Gradient Properties

For a single output σ_i with respect to input z_j:

Self-gradient: ∂σ_i/∂z_i = σ_i(1 − σ_i) — maximum at σ_i = 0.5
Cross-gradient: ∂σ_i/∂z_j = −σ_i·σ_j for i≠j

The Jacobian is dense: each output depends on all inputs. This dense coupling is essential for attention — a high score for one key reduces attention on all others.

See self-attention-mechanism for where softmax appears in attention computation, and temperature-sampling for how temperature controls output diversity during inference.

🧠 🧠 🧠

Sources

Frequently Asked Questions

Why is the softmax numerically unstable without the max subtraction?

For large z_i, e^{z_i} can exceed float32's maximum (≈3.4×10³⁸ at z≈88). If any component overflows to infinity, the division produces NaN. Subtracting max(z) from all components before exponentiation keeps the largest exponent at e^0=1, guaranteeing no overflow while preserving the output distribution identically.

How does temperature affect softmax in language models?

Temperature T scales the logits before softmax: σ(z_i/T). At T=1 (default), the model's trained distribution is used. T<1 sharpens the distribution — at T=0, it becomes argmax (always picks the highest-probability token). T>1 flattens it toward uniform, increasing randomness and diversity. Most practical inference uses T between 0.7 and 1.2.

Why does attention use softmax specifically?

Attention requires a probability distribution over key positions — weights that are non-negative and sum to 1. Softmax is the standard way to achieve this from arbitrary real-valued scores. The exponential function ensures all weights are strictly positive, and the normalization ensures they sum to exactly 1.0. Alternative normalizations (sigmoid, sparsemax) have been explored but softmax remains dominant.

← All AI pages · Dashboard