Attention Is All You Need: The Transformer Paper — Key Results and Impact

Name: Attention Is All You Need: The Transformer Paper — Key Results and Impact
Creator: AI Tower
Published: 2026-02-27

Category: representation Updated: 2026-02-27

Vaswani et al. (NeurIPS 2017) introduced the transformer architecture, achieving 28.4 BLEU on WMT EN-DE — surpassing all prior models including ensembles — with a 64M parameter model trained for 12 hours on 8 P100 GPUs.

Key Data Points
Measure	Value	Unit	Notes
WMT EN-DE BLEU (transformer big)	28.4	BLEU	State of the art at publication; surpassed all prior single-model and ensemble results
WMT EN-FR BLEU (transformer big)	41.8	BLEU	Trained on 36M sentence pairs; outperformed all prior models
Base model training time	12 hours		100K steps on 8 × NVIDIA P100 GPUs; big model trained for 3.5 days
Training cost (base model)	3.3 × 10¹⁸	FLOPs	Big model: 2.3 × 10¹⁹ FLOPs; dramatically less than prior LSTM-based systems
Previous SOTA (GNMT+RL ensemble)	26.30	BLEU EN-DE	Wu et al. (2016) Google NMT ensemble; transformer single model exceeded this
Paper citation count	130,000+	citations	As of 2025; one of the most-cited machine learning papers

“Attention Is All You Need” by Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin (NeurIPS 2017) introduced the transformer architecture and demonstrated that recurrence could be entirely eliminated from sequence modeling. The paper is foundational to all subsequent large language model development.

Core Contribution

Prior state-of-the-art neural machine translation used encoder-decoder architectures with LSTMs augmented by attention (Bahdanau et al., 2015; Wu et al., 2016). These models processed tokens sequentially — each hidden state depended on the previous — preventing training parallelization.

The transformer replaced all recurrent components with self-attention, enabling:

Full parallelization across sequence positions during training
Direct long-range connections between any two positions in O(1) steps
Significantly faster training — 3.5 days vs weeks for comparable LSTM systems

Benchmark Results (Table 2 — Vaswani et al.)

Model	WMT EN-DE BLEU	WMT EN-FR BLEU	Training Cost (FLOPs)
GNMT+RL (ensemble, 2016)	26.30	41.16	~10²⁰
ConvS2S (ensemble, 2017)	26.36	40.46	—
Transformer (base)	27.3	—	3.3 × 10¹⁸
Transformer (big)	28.4	41.8	2.3 × 10¹⁹

The transformer big model exceeded all prior ensembles as a single model with less total compute.

Ablation Study Results (Table 3 — Selected Rows)

Configuration	WMT EN-DE BLEU	Notes
Full base model (N=6, h=8, d_k=64)	25.8	Reference
Single head (h=1, d_k=512)	24.9	−0.9 BLEU
16 heads (h=16, d_k=32)	25.1	−0.7 BLEU
Learned positional encoding	25.8	Equivalent to sinusoidal
No dropout	24.6	−1.2 BLEU
d_k = 16 (vs 64)	24.9	−0.9 BLEU

Training Configuration

Hyperparameter	Base Model	Big Model
Optimizer	Adam	Adam
β₁	0.9	0.9
β₂	0.98	0.98
ε	10⁻⁹	10⁻⁹
Warmup steps	4,000	4,000
Learning rate formula	d_model^{−0.5} · min(step^{−0.5}, step · warmup^{−1.5})	—
Dropout	0.1	0.3
Label smoothing	0.1	0.1
Training steps	100,000	300,000

The warmup schedule increases the learning rate linearly for the first warmup_steps, then decreases it proportionally to the inverse square root of step number — a specific choice validated by the authors.

See transformer-architecture for a detailed walkthrough of the model dimensions, scaling-laws for how the principles established here were extended to larger models, and pre-training for how the transformer paradigm was extended to self-supervised training.

🧠 🧠 🧠

Sources

Frequently Asked Questions

What was the key innovation of 'Attention Is All You Need'?

The paper eliminated recurrence and convolutions entirely, building a sequence-to-sequence model using only attention mechanisms. This enabled full parallelization during training (unlike RNNs which process tokens sequentially), dramatically reducing training time. The multi-head self-attention mechanism allowed each position to directly attend to all other positions in O(1) operations, solving the long-range dependency problem that plagued LSTMs.

How did the transformer compare to prior LSTM-based systems?

The transformer big model achieved 28.4 BLEU on WMT EN-DE, compared to the prior best ensemble model (GNMT+RL) at 26.30 BLEU — a 2.1 BLEU improvement. More significantly, it achieved this in 3.5 days of training (2.3×10¹⁹ FLOPs) whereas GNMT required weeks. The base transformer (27.3 BLEU, 12 hours, 3.3×10¹⁸ FLOPs) already outperformed most prior single models.

What architectural choices were validated in the paper's ablations?

Table 3 of the paper systematically ablated: number of attention heads (optimal 8), key dimension d_k (smaller hurts more than larger), dropout (0.1 optimal), positional encoding type (learned vs sinusoidal equivalent), and residual dropout. The ablations confirmed that multiple attention heads and the specific scaling of d_k are critical design choices, not arbitrary hyperparameters.

← All AI pages · Dashboard