Multi-Head Attention: Projection Matrices, Parameter Count, and Head Ablations

Category: architecture Updated: 2026-02-27

Multi-head attention uses h=8 heads with d_k=64 each; the base transformer's attention block contains ~1.05M parameters; ablations show 8 heads achieves 25.8 BLEU vs 24.9 for a single head (Vaswani et al., 2017).

Key Data Points
MeasureValueUnitNotes
Number of heads (base model)8headsd_k = d_v = d_model/h = 512/8 = 64
d_k per head64dimensionsEach of the 8 heads projects into a 64-dimensional subspace
Parameters per W_Q / W_K / W_V512 × 64 = 32,768parametersPer head; all 8 heads together: 3 × 8 × 32,768 = 786,432 parameters
W_O projection parameters512 × 512 = 262,144parametersFinal output projection; maps concatenated 512-dim back to d_model=512
Total attention block parameters~1,048,576parameters786,432 (input projections) + 262,144 (output projection) = 1,048,576
BLEU with 1 head vs 8 heads24.9 vs 25.8BLEUWMT EN-DE; single-head is 0.9 BLEU worse; ablation from Table 3 row A
BLEU with 16 heads (base model)25.1BLEU0.7 BLEU below 8-head optimum; too many heads hurts performance

Multi-head attention wraps the scaled dot-product attention mechanism by running h parallel attention functions on learned linear projections of the inputs, then concatenating and reprojecting the results. Proposed by Vaswani et al. in “Attention Is All You Need” (NeurIPS 2017), it allows the model to attend simultaneously to information from different representation subspaces.

The Formula

MultiHead(Q, K, V) = Concat(head₁, …, headₕ) · W_O

where headᵢ = Attention(Q·W_Qᵢ, K·W_Kᵢ, V·W_Vᵢ)

The projection matrices for each head i are:

  • W_Qᵢ ∈ ℝ^{d_model × d_k} = ℝ^{512 × 64}
  • W_Kᵢ ∈ ℝ^{d_model × d_k} = ℝ^{512 × 64}
  • W_Vᵢ ∈ ℝ^{d_model × d_v} = ℝ^{512 × 64}
  • W_O ∈ ℝ^{h·d_v × d_model} = ℝ^{512 × 512}

Parameter Count Per Attention Block

ComponentShapeParameters
W_Q (all heads)8 × (512 × 64)262,144
W_K (all heads)8 × (512 × 64)262,144
W_V (all heads)8 × (512 × 64)262,144
W_O (output proj)512 × 512262,144
Total1,048,576

Base vs Big Model Head Configuration

HyperparameterBase ModelBig Model
d_model5121024
Heads (h)816
d_k = d_v6464
Encoder layers66
Decoder layers66
Dropout0.10.3
Total parameters65M213M
WMT EN-DE BLEU27.328.4

Note: the big model uses 16 heads but keeps d_k=64 by doubling d_model to 1024. This means more distinct subspaces rather than wider projections per head.

Head Count Ablation (Table 3, Row A — Vaswani et al.)

The following results hold d_model=512 fixed while varying the number of heads, keeping total computation constant by adjusting d_k accordingly:

Headsd_kWMT EN-DE BLEU
151224.9
412825.5
86425.8
163225.1
321625.4

The 8-head configuration is optimal. Single-head attention is 0.9 BLEU worse; too many heads with narrow d_k dimensions also underperforms, likely because 32 dimensions are insufficient for each head to learn meaningful projections.

What Different Heads Learn

Research by Voita et al. (2019) at ACL found that in a trained model, most attention heads are prunable with minimal performance loss, but a small set of specialized heads perform distinct functions: positional heads attend to adjacent tokens, syntactic heads track specific grammatical dependencies, and rare-word heads focus on low-frequency tokens. This functional specialization is what multiple heads enable.

See self-attention-mechanism for the dot-product attention formula inside each head, transformer-architecture for how this block sits within the full model, and feed-forward-layers for the other major parameter block in each transformer layer.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why use multiple attention heads instead of one large attention operation?

Multiple heads allow the model to jointly attend to information from different representation subspaces at different positions. A single head averages all this information, losing the ability to specialize. With h=8 heads, each head can learn to track different syntactic or semantic relationships simultaneously.

How does multi-head attention keep the total computation constant?

Each head operates on d_k = d_model/h dimensions, so the per-head computation is reduced proportionally. Running h heads at d_k = 64 each involves the same total floating-point operations as a single head at d_k = 512, while enabling richer, parallel representations.

What did ablation studies show about the optimal number of heads?

Vaswani et al. (2017) found in Table 3 that 8 heads achieves 25.8 BLEU on WMT EN-DE; single-head attention scores 24.9 BLEU (−0.9), 4 heads scores 25.5 BLEU, 16 heads scores 25.1 BLEU, and 32 heads scores 25.4 BLEU. Performance degrades at both extremes, suggesting 8 heads is the practical optimum for the base model.

← All AI pages · Dashboard