Scaled Dot-Product Attention: Formula, Complexity, and the √d_k Scaling Factor
Scaled dot-product attention computes softmax(Q·Kᵀ/√d_k)·V, scaling by √d_k=8 to prevent vanishing gradients; time complexity is O(n²·d), quadratic in sequence length n (Vaswani et al., 2017).
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Attention formula | softmax(Q·Kᵀ/√d_k)·V | Scaled dot-product; scaling by √d_k prevents softmax saturation | |
| d_k (key/query dimension) | 64 | dimensions | d_model/h = 512/8 = 64; √64 = 8 is the scaling divisor |
| Time complexity | O(n²·d) | Quadratic in sequence length n; bottleneck for long sequences | |
| Memory complexity | O(n²) | Attention matrix is n×n; 1024-token sequence needs ~4MB per head at float32 | |
| Sequential operations | O(1) | All token interactions computed in parallel; vs O(n) for RNNs | |
| Path length (max signal) | O(1) | Any token pair connected in one operation; RNNs require O(n) steps |
Scaled dot-product attention is the core computational primitive of the transformer. Introduced by Vaswani et al. in “Attention Is All You Need” (NeurIPS 2017), it computes a weighted combination of value vectors, where the weights come from the compatibility between query and key vectors.
The Formula
The attention function maps a set of queries Q, keys K, and values V to an output:
Attention(Q, K, V) = softmax(Q·Kᵀ / √d_k) · V
The queries, keys, and values are matrices: Q and K have dimension n×d_k, and V has dimension n×d_v. In the base transformer model, d_k = d_v = 64 and d_model = 512.
The factor √d_k = 8 is the scaling divisor. Without it, the dot products Q·Kᵀ grow large in magnitude as d_k increases, driving the softmax into near-zero-gradient saturation.
Complexity Comparison
| Operation | Time Complexity | Memory Complexity | Sequential Ops | Max Path Length |
|---|---|---|---|---|
| Self-Attention | O(n²·d) | O(n²) | O(1) | O(1) |
| Recurrent (RNN) | O(n·d²) | O(d) | O(n) | O(n) |
| Convolutional (k=kernel) | O(k·n·d²) | O(k·n·d) | O(log n) | O(log n / log k) |
| Self-Attn (restricted) | O(r·n·d) | O(r·n) | O(1) | O(n/r) |
The quadratic memory in n is the defining constraint: a 4096-token sequence requires 256× more memory than a 256-token sequence in the attention matrix alone.
Why Dot Products Scale with √d_k
Consider d_k independent random variables with mean 0 and variance 1. Their dot product has variance d_k. The standard deviation — and thus the typical scale of the input to softmax — grows as √d_k. Dividing by √d_k normalizes this variance back to 1, keeping gradients well-behaved throughout training.
Computation Breakdown
For the base model’s multi-head attention block with n=512 tokens, h=8 heads, d_k=64:
| Step | Shape | Operations |
|---|---|---|
| Q = X·W_Q | (512, 64) | 512×512×64 = 16.8M multiplications |
| K = X·W_K | (512, 64) | 512×512×64 = 16.8M multiplications |
| Attention scores Q·Kᵀ | (512, 512) | 512×512×64 = 16.8M multiplications |
| Softmax + V weighting | (512, 64) | 512×512×64 = 16.8M multiplications |
Each head performs roughly 67M multiply-accumulate operations; across 8 heads, that is ~537M per attention sublayer.
Self-Attention vs Additive Attention
The dot-product approach is faster and more space-efficient than the additive attention mechanism (Bahdanau et al., 2015), which uses a feed-forward network with learnable parameter vector v: score(Q, K) = vᵀ·tanh(W_Q·Q + W_K·K). In practice at large d_k, the scaled dot-product formulation is faster due to optimized matrix multiplication routines.
For extensions that reduce the O(n²) cost, see encoder-decoder-architecture and for the multi-head wrapper around this function see multi-head-attention. The positional-encoding page explains how token positions are injected before attention is computed.
Related Pages
Sources
- Vaswani et al. (2017) — Attention Is All You Need. NeurIPS 2017
- Bahdanau et al. (2015) — Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015
- Child et al. (2019) — Generating Long Sequences with Sparse Transformers. arXiv 2019
Frequently Asked Questions
Why divide by √d_k in scaled dot-product attention?
For large d_k, the dot products Q·Kᵀ grow in magnitude proportional to √d_k, pushing the softmax into regions of very small gradients. Dividing by √d_k keeps the inputs to softmax in a moderate range regardless of dimensionality, preventing near-zero gradients during training.
Why is self-attention O(n²) and why does that matter?
Computing the full n×n attention matrix between all pairs of tokens requires O(n²) operations and O(n²) memory. For a sequence of 1024 tokens with d_model=512, the attention matrix alone consumes roughly 4MB per head at float32. This quadratic scaling is the primary constraint on context length for standard transformers.
How does self-attention compare to RNNs for long-range dependencies?
Self-attention connects any two tokens in O(1) operations regardless of their distance in the sequence. Recurrent networks require O(n) sequential steps for a signal to travel between distant tokens, making them vulnerable to vanishing gradients over long dependencies. This is one of the key advantages of the transformer.