Neural Network Fundamentals: Universal Approximation, Depth vs Width, and Activation Functions
Universal approximation theorem: a 2-layer network with n hidden units approximates any continuous function on compact subsets of ℝ^d (Cybenko, 1989); depth enables exponentially more compact representations than equivalent-width shallow networks.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Universal approximation theorem | ∃ 2-layer net that ε-approximates any f ∈ C([0,1]^d) | Cybenko (1989); requires potentially exponential width; depth reduces this | |
| Depth efficiency (exponential) | Exponential | functions representable | Depth-L network can represent functions requiring exponential width in a depth-1 network |
| ReLU activation | f(x) = max(0, x) | Simple; enables vanishing gradient mitigation; creates piecewise-linear functions | |
| GeLU at x=0 | 0.5 | output | GeLU(x) = x·Φ(x) where Φ is Gaussian CDF; smooth approximation to ReLU |
| He initialization for ReLU | W ~ N(0, 2/n_in) | He et al. (2015): variance scaled by 2/n for ReLU to preserve activation variance |
Neural networks are parameterized compositions of linear transformations and nonlinear activation functions. Understanding their fundamental mathematical properties — expressivity, trainability, and initialization — is essential context for understanding why modern architectures like transformers are designed the way they are.
The Universal Approximation Theorem
Theorem (Cybenko, 1989): Let φ: ℝ → ℝ be any continuous sigmoidal function. For any continuous function f: [0,1]^d → ℝ and ε > 0, there exists a finite linear combination F(x) = Σᵢ αᵢ φ(wᵢᵀx + bᵢ) such that:
|F(x) − f(x)| < ε for all x ∈ [0,1]^d
This establishes that 2-layer networks are universal. The catch: the number of hidden units required may be exponential in d.
Depth vs Width Efficiency
| Representation | Shallow Network Width Required | Deep Network Width Required |
|---|---|---|
| Parity function on n bits | 2^{n-1} units | O(n) units (layered) |
| Polynomial degree d | Exponential in d | O(d) units per layer |
| XOR on n variables | 2^{n-1} units | O(n) units, O(log n) layers |
Depth enables exponentially compact representation of hierarchically structured functions.
Activation Functions
| Activation | Formula | Derivative | Properties |
|---|---|---|---|
| Sigmoid | σ(x) = 1/(1+e^{-x}) | σ(1−σ) | Saturates; vanishing gradients |
| Tanh | tanh(x) = (e^x−e^{-x})/(e^x+e^{-x}) | 1−tanh² | Zero-centered; still saturates |
| ReLU | max(0, x) | 𝟙[x>0] | No saturation for x>0; “dying ReLU” for x<0 |
| Leaky ReLU | max(αx, x), α=0.01 | 𝟙[x>0] + α·𝟙[x≤0] | Small gradient for x<0 |
| GeLU | x·Φ(x) | Φ(x) + x·φ(x) | Smooth; used in modern transformers |
| SiLU/Swish | x·σ(x) | σ(x)(1+x(1−σ(x))) | Smooth; non-monotonic |
Weight Initialization
Proper initialization prevents gradient explosion and vanishing at the start of training:
| Initialization | Formula | Designed For |
|---|---|---|
| Xavier/Glorot | W ~ U(−√(6/(n_in+n_out)), √(6/(n_in+n_out))) | Sigmoid, tanh |
| He / Kaiming | W ~ N(0, √(2/n_in)) | ReLU |
| LeCun | W ~ N(0, 1/n_in) | SELU |
| Small constant | W ~ N(0, 0.02) | Transformers (embedding, output) |
The transformer uses a modified initialization: embedding weights ∈ N(0, 0.02), other weights ∈ N(0, d_model^{-0.5}).
See feed-forward-layers for how activation functions appear in transformer FFN layers, residual-connections for how depth is made trainable via skip connections, and backpropagation for gradient computation through these activation functions.
Related Pages
Sources
- Cybenko (1989) — Approximation by Superpositions of a Sigmoidal Function. Mathematics of Control, Signals and Systems
- Hornik (1991) — Approximation Capabilities of Multilayer Feedforward Networks. Neural Networks
- He et al. (2015) — Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015
Frequently Asked Questions
What does the universal approximation theorem actually guarantee?
The theorem (Cybenko 1989, Hornik 1991) proves that a feedforward neural network with a single hidden layer and a non-polynomial activation function can approximate any continuous function on a compact domain to arbitrary precision, given sufficient hidden units. The key limitation: it does not guarantee learning — only existence. The required width can be exponential in input dimension, and gradient descent may not find the optimal solution.
Why does depth matter if shallow networks are universal approximators?
While shallow (2-layer) networks are universal approximators in theory, they may require exponentially many hidden units to represent certain functions. Deep networks can represent the same functions with exponentially fewer total neurons by composing simpler functions. In practice, depth enables hierarchical feature learning — early layers detect simple patterns; later layers combine them into complex abstractions. This is why all successful large language models use dozens to hundreds of layers.
Why did ReLU replace sigmoid and tanh as the standard activation?
Sigmoid and tanh saturate at large positive and negative values, producing near-zero gradients (vanishing gradient problem). ReLU = max(0,x) has gradient exactly 1 for positive inputs and 0 for negative inputs. The constant gradient for positive inputs prevents vanishing gradients in deep networks. ReLU also computes faster than sigmoid/tanh and produces sparse activations (many neurons are exactly zero), which has a useful regularization effect.