Attention Is All You Need: The Transformer Paper — Key Results and Impact
Vaswani et al. (NeurIPS 2017) introduced the transformer architecture, achieving 28.4 BLEU on WMT EN-DE — surpassing all prior models including ensembles — with a 64M parameter model trained for 12 hours on 8 P100 GPUs.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| WMT EN-DE BLEU (transformer big) | 28.4 | BLEU | State of the art at publication; surpassed all prior single-model and ensemble results |
| WMT EN-FR BLEU (transformer big) | 41.8 | BLEU | Trained on 36M sentence pairs; outperformed all prior models |
| Base model training time | 12 hours | 100K steps on 8 × NVIDIA P100 GPUs; big model trained for 3.5 days | |
| Training cost (base model) | 3.3 × 10¹⁸ | FLOPs | Big model: 2.3 × 10¹⁹ FLOPs; dramatically less than prior LSTM-based systems |
| Previous SOTA (GNMT+RL ensemble) | 26.30 | BLEU EN-DE | Wu et al. (2016) Google NMT ensemble; transformer single model exceeded this |
| Paper citation count | 130,000+ | citations | As of 2025; one of the most-cited machine learning papers |
“Attention Is All You Need” by Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin (NeurIPS 2017) introduced the transformer architecture and demonstrated that recurrence could be entirely eliminated from sequence modeling. The paper is foundational to all subsequent large language model development.
Core Contribution
Prior state-of-the-art neural machine translation used encoder-decoder architectures with LSTMs augmented by attention (Bahdanau et al., 2015; Wu et al., 2016). These models processed tokens sequentially — each hidden state depended on the previous — preventing training parallelization.
The transformer replaced all recurrent components with self-attention, enabling:
- Full parallelization across sequence positions during training
- Direct long-range connections between any two positions in O(1) steps
- Significantly faster training — 3.5 days vs weeks for comparable LSTM systems
Benchmark Results (Table 2 — Vaswani et al.)
| Model | WMT EN-DE BLEU | WMT EN-FR BLEU | Training Cost (FLOPs) |
|---|---|---|---|
| GNMT+RL (ensemble, 2016) | 26.30 | 41.16 | ~10²⁰ |
| ConvS2S (ensemble, 2017) | 26.36 | 40.46 | — |
| Transformer (base) | 27.3 | — | 3.3 × 10¹⁸ |
| Transformer (big) | 28.4 | 41.8 | 2.3 × 10¹⁹ |
The transformer big model exceeded all prior ensembles as a single model with less total compute.
Ablation Study Results (Table 3 — Selected Rows)
| Configuration | WMT EN-DE BLEU | Notes |
|---|---|---|
| Full base model (N=6, h=8, d_k=64) | 25.8 | Reference |
| Single head (h=1, d_k=512) | 24.9 | −0.9 BLEU |
| 16 heads (h=16, d_k=32) | 25.1 | −0.7 BLEU |
| Learned positional encoding | 25.8 | Equivalent to sinusoidal |
| No dropout | 24.6 | −1.2 BLEU |
| d_k = 16 (vs 64) | 24.9 | −0.9 BLEU |
Training Configuration
| Hyperparameter | Base Model | Big Model |
|---|---|---|
| Optimizer | Adam | Adam |
| β₁ | 0.9 | 0.9 |
| β₂ | 0.98 | 0.98 |
| ε | 10⁻⁹ | 10⁻⁹ |
| Warmup steps | 4,000 | 4,000 |
| Learning rate formula | d_model^{−0.5} · min(step^{−0.5}, step · warmup^{−1.5}) | — |
| Dropout | 0.1 | 0.3 |
| Label smoothing | 0.1 | 0.1 |
| Training steps | 100,000 | 300,000 |
The warmup schedule increases the learning rate linearly for the first warmup_steps, then decreases it proportionally to the inverse square root of step number — a specific choice validated by the authors.
See transformer-architecture for a detailed walkthrough of the model dimensions, scaling-laws for how the principles established here were extended to larger models, and pre-training for how the transformer paradigm was extended to self-supervised training.
Related Pages
Sources
- Vaswani et al. (2017) — Attention Is All You Need. NeurIPS 2017
- Wu et al. (2016) — Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv
- Bahdanau et al. (2015) — Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015
Frequently Asked Questions
What was the key innovation of 'Attention Is All You Need'?
The paper eliminated recurrence and convolutions entirely, building a sequence-to-sequence model using only attention mechanisms. This enabled full parallelization during training (unlike RNNs which process tokens sequentially), dramatically reducing training time. The multi-head self-attention mechanism allowed each position to directly attend to all other positions in O(1) operations, solving the long-range dependency problem that plagued LSTMs.
How did the transformer compare to prior LSTM-based systems?
The transformer big model achieved 28.4 BLEU on WMT EN-DE, compared to the prior best ensemble model (GNMT+RL) at 26.30 BLEU — a 2.1 BLEU improvement. More significantly, it achieved this in 3.5 days of training (2.3×10¹⁹ FLOPs) whereas GNMT required weeks. The base transformer (27.3 BLEU, 12 hours, 3.3×10¹⁸ FLOPs) already outperformed most prior single models.
What architectural choices were validated in the paper's ablations?
Table 3 of the paper systematically ablated: number of attention heads (optimal 8), key dimension d_k (smaller hurts more than larger), dropout (0.1 optimal), positional encoding type (learned vs sinusoidal equivalent), and residual dropout. The ablations confirmed that multiple attention heads and the specific scaling of d_k are critical design choices, not arbitrary hyperparameters.