Gradient Descent and Adam Optimizer: Update Rules and Hyperparameters
SGD updates θ ← θ − η∇L; Adam adapts per-parameter learning rates using m_t = β₁m_{t-1}+(1−β₁)g_t and v_t = β₂v_{t-1}+(1−β₂)g_t²; typical transformer settings: β₁=0.9, β₂=0.95–0.999, ε=1e-8 (Kingma & Ba, 2015).
| Measure | Value | Unit | Notes |
|---|---|---|---|
| SGD update rule | θ_{t+1} = θ_t − η · ∇_θ L | η = learning rate; ∇_θ L = gradient of loss w.r.t. parameters | |
| Adam first moment (β₁) | 0.9 | Exponential moving average of gradients; controls gradient smoothing | |
| Adam second moment (β₂) | 0.999 | Exponential moving average of squared gradients; controls adaptive scaling | |
| Adam epsilon | 1e-8 | Numerical stability term; prevents division by zero when v_t ≈ 0 | |
| Transformer warmup steps (original) | 4,000 | steps | lr = d_model^{-0.5} · min(step^{-0.5}, step · warmup_steps^{-1.5}) |
| Typical peak learning rate (large models) | 1e-4 to 3e-4 | With warmup + cosine decay; exact value scales inversely with batch size |
Gradient descent is the fundamental optimization algorithm for training neural networks. Starting from random weights, each iteration computes the gradient of the loss with respect to all parameters and takes a step in the negative gradient direction. Modern language model training universally uses variants of Adam, which adapts the learning rate per parameter using gradient history.
Gradient Descent Variants
| Algorithm | Update Rule | Key Property |
|---|---|---|
| Vanilla SGD | θ ← θ − η·g | Simple; requires careful tuning |
| SGD + Momentum | v ← μv − η·g; θ ← θ + v | Accelerates in consistent directions |
| AdaGrad | θ ← θ − η·g/(√G + ε) | Adapts to rare features; learning rate never increases |
| RMSProp | θ ← θ − η·g/(√EMA[g²] + ε) | Decaying average of squared gradients |
| Adam | See below | Combines momentum + RMSProp; dominant in practice |
| AdamW | Adam + decoupled weight decay | Corrects L2 regularization in Adam |
Adam Update Rules (Kingma & Ba, 2015)
At step t, with gradient g_t = ∇_θ L:
- First moment: m_t = β₁·m_{t-1} + (1−β₁)·g_t
- Second moment: v_t = β₂·v_{t-1} + (1−β₂)·g_t²
- Bias correction: m̂_t = m_t/(1−β₁ᵗ), v̂_t = v_t/(1−β₂ᵗ)
- Update: θ_{t+1} = θ_t − η · m̂_t / (√v̂_t + ε)
Hyperparameter Defaults
| Hyperparameter | Adam Default | Typical LLM Training |
|---|---|---|
| β₁ | 0.9 | 0.9 |
| β₂ | 0.999 | 0.95–0.999 |
| ε | 1e-8 | 1e-8 |
| Weight decay λ (AdamW) | 0 | 0.01–0.1 |
| Peak learning rate | — | 1e-4 to 3e-4 |
| Gradient clipping | — | Norm ≤ 1.0 |
Gradient Clipping
Gradient clipping prevents exploding gradients by rescaling the gradient vector when its norm exceeds a threshold:
if ‖g‖₂ > max_norm: g ← g × (max_norm / ‖g‖₂)
Large language model training universally applies gradient clipping with max_norm = 1.0. Without clipping, occasional large gradient spikes (common in attention layers) can destabilize training.
Learning Rate Schedule (Transformer)
Vaswani et al. (2017) define:
lr(step) = d_model^{−0.5} · min(step^{−0.5}, step · 4000^{−1.5})
| Step | Learning Rate (d_model=512) |
|---|---|
| 1 | ~1.1 × 10⁻⁵ |
| 1,000 | ~4.9 × 10⁻⁴ |
| 4,000 (peak) | ~2.4 × 10⁻⁴ |
| 10,000 | ~1.4 × 10⁻⁴ |
| 100,000 | ~4.4 × 10⁻⁵ |
See backpropagation for how gradients are computed, and neural-network-fundamentals for the broader optimization landscape.
Related Pages
Sources
- Kingma & Ba (2015) — Adam: A Method for Stochastic Optimization. ICLR 2015
- Loshchilov & Hutter (2019) — Decoupled Weight Decay Regularization (AdamW). ICLR 2019
- Vaswani et al. (2017) — Attention Is All You Need. NeurIPS 2017
Frequently Asked Questions
Why does Adam outperform vanilla SGD for transformer training?
Transformers have parameters across very different scales — embedding weights, attention projections, and FFN weights all have different gradient magnitudes. Adam's per-parameter adaptive learning rates normalize these differences: parameters with consistently large gradients get smaller effective learning rates, while parameters with small gradients get larger effective steps. This adaptive scaling makes Adam much more robust to the choice of global learning rate and is why it dominates in language model training.
What is AdamW and why is it used instead of Adam?
Original Adam implements L2 regularization by adding λθ to the gradient, which interacts with the adaptive learning rate scaling in an undesirable way — parameters updated infrequently get less regularization. Decoupled weight decay (AdamW, Loshchilov & Hutter 2019) applies weight decay directly to the parameters: θ_{t+1} = (1 − ηλ)θ_t − η·Adam_update. This correctly decouples weight decay from the gradient-based update, improving generalization. Most large model training uses AdamW.
What is the transformer learning rate schedule?
Vaswani et al. (2017) introduced a warmup-then-decay schedule: lr(step) = d_model^{-0.5} × min(step^{-0.5}, step × warmup_steps^{-1.5}). This linearly increases the learning rate for the first warmup_steps, then decreases proportionally to the inverse square root of the step count. The 4,000-step warmup in the original paper is roughly 4% of training — modern practice varies from 1% to 10% of total steps depending on model size and batch size.