Question 1

Why does Adam outperform vanilla SGD for transformer training?

Accepted Answer

Transformers have parameters across very different scales — embedding weights, attention projections, and FFN weights all have different gradient magnitudes. Adam's per-parameter adaptive learning rates normalize these differences: parameters with consistently large gradients get smaller effective learning rates, while parameters with small gradients get larger effective steps. This adaptive scaling makes Adam much more robust to the choice of global learning rate and is why it dominates in language model training.

Question 2

What is AdamW and why is it used instead of Adam?

Accepted Answer

Original Adam implements L2 regularization by adding λθ to the gradient, which interacts with the adaptive learning rate scaling in an undesirable way — parameters updated infrequently get less regularization. Decoupled weight decay (AdamW, Loshchilov & Hutter 2019) applies weight decay directly to the parameters: θ_{t+1} = (1 − ηλ)θ_t − η·Adam_update. This correctly decouples weight decay from the gradient-based update, improving generalization. Most large model training uses AdamW.

Question 3

What is the transformer learning rate schedule?

Accepted Answer

Vaswani et al. (2017) introduced a warmup-then-decay schedule: lr(step) = d_model^{-0.5} × min(step^{-0.5}, step × warmup_steps^{-1.5}). This linearly increases the learning rate for the first warmup_steps, then decreases proportionally to the inverse square root of the step count. The 4,000-step warmup in the original paper is roughly 4% of training — modern practice varies from 1% to 10% of total steps depending on model size and batch size.

Measure	Value	Unit	Notes
SGD update rule	θ_{t+1} = θ_t − η · ∇_θ L		η = learning rate; ∇_θ L = gradient of loss w.r.t. parameters
Adam first moment (β₁)	0.9		Exponential moving average of gradients; controls gradient smoothing
Adam second moment (β₂)	0.999		Exponential moving average of squared gradients; controls adaptive scaling
Adam epsilon	1e-8		Numerical stability term; prevents division by zero when v_t ≈ 0
Transformer warmup steps (original)	4,000	steps	lr = d_model^{-0.5} · min(step^{-0.5}, step · warmup_steps^{-1.5})
Typical peak learning rate (large models)	1e-4 to 3e-4		With warmup + cosine decay; exact value scales inversely with batch size

Algorithm	Update Rule	Key Property
Vanilla SGD	θ ← θ − η·g	Simple; requires careful tuning
SGD + Momentum	v ← μv − η·g; θ ← θ + v	Accelerates in consistent directions
AdaGrad	θ ← θ − η·g/(√G + ε)	Adapts to rare features; learning rate never increases
RMSProp	θ ← θ − η·g/(√EMA[g²] + ε)	Decaying average of squared gradients
Adam	See below	Combines momentum + RMSProp; dominant in practice
AdamW	Adam + decoupled weight decay	Corrects L2 regularization in Adam

Hyperparameter	Adam Default	Typical LLM Training
β₁	0.9	0.9
β₂	0.999	0.95–0.999
ε	1e-8	1e-8
Weight decay λ (AdamW)	0	0.01–0.1
Peak learning rate	—	1e-4 to 3e-4
Gradient clipping	—	Norm ≤ 1.0

Step	Learning Rate (d_model=512)
1	~1.1 × 10⁻⁵
1,000	~4.9 × 10⁻⁴
4,000 (peak)	~2.4 × 10⁻⁴
10,000	~1.4 × 10⁻⁴
100,000	~4.4 × 10⁻⁵

Gradient Descent and Adam Optimizer: Update Rules and Hyperparameters

Gradient Descent Variants

Adam Update Rules (Kingma & Ba, 2015)

Hyperparameter Defaults

Gradient Clipping

Learning Rate Schedule (Transformer)

Related Pages

Sources

Frequently Asked Questions

Why does Adam outperform vanilla SGD for transformer training?

What is AdamW and why is it used instead of Adam?

What is the transformer learning rate schedule?