Residual Connections: Skip Connections, Gradient Flow, and Deep Network Training
Residual connections compute output = x + Sublayer(x), providing a gradient highway that bypasses each sublayer; He et al. (2016) showed they enable training of 1,000-layer networks with no vanishing gradient.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Residual formula | y = x + F(x) | F(x) = sublayer function (attention or FFN); x = identity shortcut | |
| Gradient flow advantage | ∂L/∂x = 1 + ∂L/∂F | The gradient always has a direct path of magnitude ≥ 1 to earlier layers | |
| Original ResNet depth | 1,000 | layers | He et al. (2016) successfully trained 1,000-layer residual nets; impossible without skip connections |
| Transformer depth with residuals | 6 + 6 | encoder + decoder layers | Each sublayer wrapped in residual; enables stable training of 100+ layer variants |
| Dropout applied | 0.1 | rate | Dropout applied to sublayer output before addition: y = x + Dropout(Sublayer(x)) |
Residual connections, introduced by He et al. in “Deep Residual Learning for Image Recognition” (CVPR 2016), revolutionized deep learning by enabling training of networks with hundreds or thousands of layers. The transformer adopts residual connections at every sublayer in both encoder and decoder.
The Core Idea
Instead of learning a direct mapping H(x) through each sublayer, residual connections reformulate the learning problem as learning the residual F(x) = H(x) − x:
y = x + F(x, {W_i})
where F represents the sublayer’s transformation (multi-head attention or feed-forward network), and x is the unchanged input passed through directly. The identity shortcut requires no parameters and adds negligible computation.
Gradient Analysis
The key mathematical property: for a loss L, the gradient with respect to the input to a residual block is:
∂L/∂x = ∂L/∂y · (1 + ∂F/∂x)
| Network Type | Gradient at Layer l from Layer L | Risk |
|---|---|---|
| Plain (no residual) | ∏ᵢ ∂hᵢ/∂hᵢ₋₁ | Vanishing if ‖∂h/∂h‖ < 1 |
| Residual | 1 + ∂F/∂x (partial sum) | Guaranteed nonzero gradient path |
The ‘1’ term from the shortcut ensures that even if ∂F/∂x is very small, gradients still propagate. He et al. demonstrated that residual networks with 1,000+ layers converge normally, while plain networks of equal depth diverge.
Depth Scaling Comparison
| Model | Depth | Residual Connections | Training Success |
|---|---|---|---|
| AlexNet (2012) | 8 layers | No | Yes (shallow) |
| VGG-16 (2014) | 16 layers | No | Marginal |
| Plain-34 (2016) | 34 layers | No | Higher train error than 18-layer |
| ResNet-34 (2016) | 34 layers | Yes | Lower error than 18-layer |
| ResNet-1000 (2016) | 1,000 layers | Yes | Converges normally |
Transformer Implementation
In the transformer, each encoder and decoder sublayer is wrapped with a residual connection and layer normalization. In the post-norm formulation (original paper):
y = LayerNorm(x + Dropout(Sublayer(x)))
In the pre-norm formulation (most modern models):
y = x + Dropout(Sublayer(LayerNorm(x)))
The dropout (p=0.1 in the base model) is applied to the sublayer output before addition, providing regularization during training without disrupting the identity path.
See layer-normalization for how normalization interacts with residuals, feed-forward-layers for the FFN sublayer wrapped by residuals, and gradient-descent for the optimization mechanics that residuals facilitate.
Related Pages
Sources
- He et al. (2016) — Deep Residual Learning for Image Recognition. CVPR 2016
- Vaswani et al. (2017) — Attention Is All You Need. NeurIPS 2017
- Veit et al. (2016) — Residual Networks Behave Like Ensembles of Relatively Shallow Networks. NeurIPS 2016
Frequently Asked Questions
Why do residual connections prevent vanishing gradients?
In a network without residuals, gradients must pass through every layer's Jacobian. If any layer's Jacobian has eigenvalues < 1, the gradient decays exponentially with depth. With residuals, the gradient path is ∂L/∂x = ∂L/∂(x+F) · (1 + ∂F/∂x). The '1' term ensures there is always a direct gradient signal of at least magnitude 1, regardless of what ∂F/∂x does.
How do residual connections affect model capacity?
Residual connections do not reduce model capacity — F(x) still has all the same parameters. However, they change what the network learns: instead of learning the desired mapping H(x) directly, F learns the residual H(x) − x. If the optimal mapping is close to the identity, F only needs to produce a small correction, which is easier to learn than the full transformation.
What is the ensemble interpretation of residual networks?
Veit et al. (2016) showed that a residual network with n blocks can be understood as an ensemble of 2ⁿ paths of varying lengths. Most gradient during training flows through short paths (2–3 layers), while long paths contribute exponentially less. This explains why removing or damaging a single layer in a residual network causes only a small accuracy drop — the other paths compensate.