Residual Connections: Skip Connections, Gradient Flow, and Deep Network Training

Category: architecture Updated: 2026-02-27

Residual connections compute output = x + Sublayer(x), providing a gradient highway that bypasses each sublayer; He et al. (2016) showed they enable training of 1,000-layer networks with no vanishing gradient.

Key Data Points
MeasureValueUnitNotes
Residual formulay = x + F(x)F(x) = sublayer function (attention or FFN); x = identity shortcut
Gradient flow advantage∂L/∂x = 1 + ∂L/∂FThe gradient always has a direct path of magnitude ≥ 1 to earlier layers
Original ResNet depth1,000layersHe et al. (2016) successfully trained 1,000-layer residual nets; impossible without skip connections
Transformer depth with residuals6 + 6encoder + decoder layersEach sublayer wrapped in residual; enables stable training of 100+ layer variants
Dropout applied0.1rateDropout applied to sublayer output before addition: y = x + Dropout(Sublayer(x))

Residual connections, introduced by He et al. in “Deep Residual Learning for Image Recognition” (CVPR 2016), revolutionized deep learning by enabling training of networks with hundreds or thousands of layers. The transformer adopts residual connections at every sublayer in both encoder and decoder.

The Core Idea

Instead of learning a direct mapping H(x) through each sublayer, residual connections reformulate the learning problem as learning the residual F(x) = H(x) − x:

y = x + F(x, {W_i})

where F represents the sublayer’s transformation (multi-head attention or feed-forward network), and x is the unchanged input passed through directly. The identity shortcut requires no parameters and adds negligible computation.

Gradient Analysis

The key mathematical property: for a loss L, the gradient with respect to the input to a residual block is:

∂L/∂x = ∂L/∂y · (1 + ∂F/∂x)

Network TypeGradient at Layer l from Layer LRisk
Plain (no residual)∏ᵢ ∂hᵢ/∂hᵢ₋₁Vanishing if ‖∂h/∂h‖ < 1
Residual1 + ∂F/∂x (partial sum)Guaranteed nonzero gradient path

The ‘1’ term from the shortcut ensures that even if ∂F/∂x is very small, gradients still propagate. He et al. demonstrated that residual networks with 1,000+ layers converge normally, while plain networks of equal depth diverge.

Depth Scaling Comparison

ModelDepthResidual ConnectionsTraining Success
AlexNet (2012)8 layersNoYes (shallow)
VGG-16 (2014)16 layersNoMarginal
Plain-34 (2016)34 layersNoHigher train error than 18-layer
ResNet-34 (2016)34 layersYesLower error than 18-layer
ResNet-1000 (2016)1,000 layersYesConverges normally

Transformer Implementation

In the transformer, each encoder and decoder sublayer is wrapped with a residual connection and layer normalization. In the post-norm formulation (original paper):

y = LayerNorm(x + Dropout(Sublayer(x)))

In the pre-norm formulation (most modern models):

y = x + Dropout(Sublayer(LayerNorm(x)))

The dropout (p=0.1 in the base model) is applied to the sublayer output before addition, providing regularization during training without disrupting the identity path.

See layer-normalization for how normalization interacts with residuals, feed-forward-layers for the FFN sublayer wrapped by residuals, and gradient-descent for the optimization mechanics that residuals facilitate.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why do residual connections prevent vanishing gradients?

In a network without residuals, gradients must pass through every layer's Jacobian. If any layer's Jacobian has eigenvalues < 1, the gradient decays exponentially with depth. With residuals, the gradient path is ∂L/∂x = ∂L/∂(x+F) · (1 + ∂F/∂x). The '1' term ensures there is always a direct gradient signal of at least magnitude 1, regardless of what ∂F/∂x does.

How do residual connections affect model capacity?

Residual connections do not reduce model capacity — F(x) still has all the same parameters. However, they change what the network learns: instead of learning the desired mapping H(x) directly, F learns the residual H(x) − x. If the optimal mapping is close to the identity, F only needs to produce a small correction, which is easier to learn than the full transformation.

What is the ensemble interpretation of residual networks?

Veit et al. (2016) showed that a residual network with n blocks can be understood as an ensemble of 2ⁿ paths of varying lengths. Most gradient during training flows through short paths (2–3 layers), while long paths contribute exponentially less. This explains why removing or damaging a single layer in a residual network causes only a small accuracy drop — the other paths compensate.

← All AI pages · Dashboard