Attention Heads: Specialization, Pruning, and What Different Heads Learn
Trained transformer attention heads specialize: positional heads track adjacent tokens, syntactic heads model grammatical dependencies, semantic heads capture coreference; Voita et al. (2019) pruned 48 of 96 encoder heads with <0.1 BLEU loss.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Total heads in base transformer encoder | 48 | heads | 8 heads × 6 encoder layers = 48 encoder attention heads |
| Heads prunable with <0.1 BLEU loss | ~48 of 96 | heads | Voita et al. (2019): 50% of heads prunable in 6-layer encoder-decoder; only a few are critical |
| Critical head types identified | 4 | categories | Positional, syntactic, rare-word, and semantic heads — each with measurable behavior |
| Michel et al. single-head performance | BLEU drops 17.7% | relative | Reducing all layers to 1 head on WMT EN-DE; most individual head removals cost <0.5 BLEU |
| Head importance score (gradient-based) | I_h = E[|∂L/∂A_h|] | Importance approximated by expected absolute gradient of loss w.r.t. attention weights |
The transformer’s multi-head attention mechanism runs h parallel attention operations. While designed to allow the model to attend to different subspaces, empirical research has revealed that individual heads specialize into functionally distinct roles in trained models — and that many heads contribute minimally to final performance.
Types of Attention Heads
Voita et al. (2019) analyzed trained WMT EN-RU transformer models and found four functional head categories:
| Head Type | Behavior | Proportion |
|---|---|---|
| Positional | Attends to fixed relative position (e.g., immediately preceding token) | ~15–20% |
| Syntactic | Tracks specific grammatical dependency relations | ~10–15% |
| Rare-word | Concentrates attention on low-frequency tokens | ~10% |
| Semantic | Captures coreference, entity consistency, semantic relatedness | ~10–15% |
| Background/redundant | No clear pattern; distributes broadly | ~50% |
The majority of heads fall into the “background” category — they do not show interpretable specialization patterns and are the primary targets for pruning.
Head Pruning Results
Michel et al. (2019) tested progressive head removal across layers on WMT EN-DE (NeurIPS 2019):
| Heads Retained | BLEU | Notes |
|---|---|---|
| All 48 encoder heads | 28.0 | Baseline |
| 32 heads (33% pruned) | 27.9 | Negligible loss |
| 16 heads (67% pruned) | 27.4 | −0.6 BLEU |
| 8 heads (83% pruned) | 26.2 | −1.8 BLEU |
| 6 heads (random single per layer) | 23.1 | −4.9 BLEU |
| 1 head across all layers | 23.1 | −4.9 BLEU |
The practical takeaway: only a small subset of heads are truly essential. Retaining the top-8 most important heads (by gradient score) costs only −1.8 BLEU while eliminating 83% of attention heads.
Head Importance Scoring
Gradient-based importance for head h in layer l:
I_{h,l} = E_{x∼data}[ |∂L/∂A_{h,l}| ]
where A_{h,l} is the attention weight matrix for head h in layer l. Heads with near-zero gradients contribute little to the loss — removing them has minimal effect. This metric can be computed on a validation set in a single forward-backward pass.
Cross-Layer Patterns
Different layers develop different head specialization profiles. Early encoder layers tend to contain more positional heads; middle layers develop syntactic heads; later layers develop more semantic heads. This layered functional organization mirrors theories of hierarchical representation in neural networks.
See multi-head-attention for the parameter structure that enables this specialization, and self-attention-mechanism for the core computation each head performs.
Related Pages
Sources
- Voita et al. (2019) — Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting. ACL 2019
- Michel et al. (2019) — Are Sixteen Heads Really Better than One? NeurIPS 2019
- Vig & Belinkov (2019) — Analyzing the Structure of Attention in a Transformer Language Model. BlackboxNLP 2019
- Vaswani et al. (2017) — Attention Is All You Need. NeurIPS 2017
Frequently Asked Questions
Do all attention heads learn the same thing?
No. Research has documented systematic specialization. Voita et al. (2019) identified at least four distinct head types in trained models: positional heads (attend to fixed offsets like +1 or -1 from the current token), syntactic heads (track specific grammatical dependencies like subject-verb), rare-word heads (focus attention on low-frequency tokens), and semantic heads (capture coreference, entity type, and semantic similarity).
What fraction of attention heads are actually necessary?
Surprisingly few. Michel et al. (2019) showed that on WMT EN-DE translation, 20/48 encoder heads can be removed with less than 1% BLEU degradation. Voita et al. (2019) pruned 48 of 96 heads across the full encoder-decoder model (50%) with under 0.1 BLEU loss. The remaining critical heads show high gradient-based importance scores and specific attention patterns.
How are important attention heads identified?
The most common method is gradient-based importance scoring: I_h = E[|∂L/∂A_h|], the expected absolute gradient of the loss with respect to the attention weights of head h. Heads with low importance can be 'pruned' by masking them out. At inference time, pruned heads are replaced by a uniform or zero attention distribution, reducing computation while preserving most model quality.