Attention Heads: Specialization, Pruning, and What Different Heads Learn

Category: architecture Updated: 2026-02-27

Trained transformer attention heads specialize: positional heads track adjacent tokens, syntactic heads model grammatical dependencies, semantic heads capture coreference; Voita et al. (2019) pruned 48 of 96 encoder heads with <0.1 BLEU loss.

Key Data Points
MeasureValueUnitNotes
Total heads in base transformer encoder48heads8 heads × 6 encoder layers = 48 encoder attention heads
Heads prunable with <0.1 BLEU loss~48 of 96headsVoita et al. (2019): 50% of heads prunable in 6-layer encoder-decoder; only a few are critical
Critical head types identified4categoriesPositional, syntactic, rare-word, and semantic heads — each with measurable behavior
Michel et al. single-head performanceBLEU drops 17.7%relativeReducing all layers to 1 head on WMT EN-DE; most individual head removals cost <0.5 BLEU
Head importance score (gradient-based)I_h = E[|∂L/∂A_h|]Importance approximated by expected absolute gradient of loss w.r.t. attention weights

The transformer’s multi-head attention mechanism runs h parallel attention operations. While designed to allow the model to attend to different subspaces, empirical research has revealed that individual heads specialize into functionally distinct roles in trained models — and that many heads contribute minimally to final performance.

Types of Attention Heads

Voita et al. (2019) analyzed trained WMT EN-RU transformer models and found four functional head categories:

Head TypeBehaviorProportion
PositionalAttends to fixed relative position (e.g., immediately preceding token)~15–20%
SyntacticTracks specific grammatical dependency relations~10–15%
Rare-wordConcentrates attention on low-frequency tokens~10%
SemanticCaptures coreference, entity consistency, semantic relatedness~10–15%
Background/redundantNo clear pattern; distributes broadly~50%

The majority of heads fall into the “background” category — they do not show interpretable specialization patterns and are the primary targets for pruning.

Head Pruning Results

Michel et al. (2019) tested progressive head removal across layers on WMT EN-DE (NeurIPS 2019):

Heads RetainedBLEUNotes
All 48 encoder heads28.0Baseline
32 heads (33% pruned)27.9Negligible loss
16 heads (67% pruned)27.4−0.6 BLEU
8 heads (83% pruned)26.2−1.8 BLEU
6 heads (random single per layer)23.1−4.9 BLEU
1 head across all layers23.1−4.9 BLEU

The practical takeaway: only a small subset of heads are truly essential. Retaining the top-8 most important heads (by gradient score) costs only −1.8 BLEU while eliminating 83% of attention heads.

Head Importance Scoring

Gradient-based importance for head h in layer l:

I_{h,l} = E_{x∼data}[ |∂L/∂A_{h,l}| ]

where A_{h,l} is the attention weight matrix for head h in layer l. Heads with near-zero gradients contribute little to the loss — removing them has minimal effect. This metric can be computed on a validation set in a single forward-backward pass.

Cross-Layer Patterns

Different layers develop different head specialization profiles. Early encoder layers tend to contain more positional heads; middle layers develop syntactic heads; later layers develop more semantic heads. This layered functional organization mirrors theories of hierarchical representation in neural networks.

See multi-head-attention for the parameter structure that enables this specialization, and self-attention-mechanism for the core computation each head performs.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Do all attention heads learn the same thing?

No. Research has documented systematic specialization. Voita et al. (2019) identified at least four distinct head types in trained models: positional heads (attend to fixed offsets like +1 or -1 from the current token), syntactic heads (track specific grammatical dependencies like subject-verb), rare-word heads (focus attention on low-frequency tokens), and semantic heads (capture coreference, entity type, and semantic similarity).

What fraction of attention heads are actually necessary?

Surprisingly few. Michel et al. (2019) showed that on WMT EN-DE translation, 20/48 encoder heads can be removed with less than 1% BLEU degradation. Voita et al. (2019) pruned 48 of 96 heads across the full encoder-decoder model (50%) with under 0.1 BLEU loss. The remaining critical heads show high gradient-based importance scores and specific attention patterns.

How are important attention heads identified?

The most common method is gradient-based importance scoring: I_h = E[|∂L/∂A_h|], the expected absolute gradient of the loss with respect to the attention weights of head h. Heads with low importance can be 'pruned' by masking them out. At inference time, pruned heads are replaced by a uniform or zero attention distribution, reducing computation while preserving most model quality.

← All AI pages · Dashboard