Question 1

What does the universal approximation theorem actually guarantee?

Accepted Answer

The theorem (Cybenko 1989, Hornik 1991) proves that a feedforward neural network with a single hidden layer and a non-polynomial activation function can approximate any continuous function on a compact domain to arbitrary precision, given sufficient hidden units. The key limitation: it does not guarantee learning — only existence. The required width can be exponential in input dimension, and gradient descent may not find the optimal solution.

Question 2

Why does depth matter if shallow networks are universal approximators?

Accepted Answer

While shallow (2-layer) networks are universal approximators in theory, they may require exponentially many hidden units to represent certain functions. Deep networks can represent the same functions with exponentially fewer total neurons by composing simpler functions. In practice, depth enables hierarchical feature learning — early layers detect simple patterns; later layers combine them into complex abstractions. This is why all successful large language models use dozens to hundreds of layers.

Question 3

Why did ReLU replace sigmoid and tanh as the standard activation?

Accepted Answer

Sigmoid and tanh saturate at large positive and negative values, producing near-zero gradients (vanishing gradient problem). ReLU = max(0,x) has gradient exactly 1 for positive inputs and 0 for negative inputs. The constant gradient for positive inputs prevents vanishing gradients in deep networks. ReLU also computes faster than sigmoid/tanh and produces sparse activations (many neurons are exactly zero), which has a useful regularization effect.

Measure	Value	Unit	Notes
Universal approximation theorem	∃ 2-layer net that ε-approximates any f ∈ C([0,1]^d)		Cybenko (1989); requires potentially exponential width; depth reduces this
Depth efficiency (exponential)	Exponential	functions representable	Depth-L network can represent functions requiring exponential width in a depth-1 network
ReLU activation	f(x) = max(0, x)		Simple; enables vanishing gradient mitigation; creates piecewise-linear functions
GeLU at x=0	0.5	output	GeLU(x) = x·Φ(x) where Φ is Gaussian CDF; smooth approximation to ReLU
He initialization for ReLU	W ~ N(0, 2/n_in)		He et al. (2015): variance scaled by 2/n for ReLU to preserve activation variance

Representation	Shallow Network Width Required	Deep Network Width Required
Parity function on n bits	2^{n-1} units	O(n) units (layered)
Polynomial degree d	Exponential in d	O(d) units per layer
XOR on n variables	2^{n-1} units	O(n) units, O(log n) layers

Activation	Formula	Derivative	Properties
Sigmoid	σ(x) = 1/(1+e^{-x})	σ(1−σ)	Saturates; vanishing gradients
Tanh	tanh(x) = (e^x−e^{-x})/(e^x+e^{-x})	1−tanh²	Zero-centered; still saturates
ReLU	max(0, x)	𝟙[x>0]	No saturation for x>0; “dying ReLU” for x<0
Leaky ReLU	max(αx, x), α=0.01	𝟙[x>0] + α·𝟙[x≤0]	Small gradient for x<0
GeLU	x·Φ(x)	Φ(x) + x·φ(x)	Smooth; used in modern transformers
SiLU/Swish	x·σ(x)	σ(x)(1+x(1−σ(x)))	Smooth; non-monotonic

Initialization	Formula	Designed For
Xavier/Glorot	W ~ U(−√(6/(n_in+n_out)), √(6/(n_in+n_out)))	Sigmoid, tanh
He / Kaiming	W ~ N(0, √(2/n_in))	ReLU
LeCun	W ~ N(0, 1/n_in)	SELU
Small constant	W ~ N(0, 0.02)	Transformers (embedding, output)

Neural Network Fundamentals: Universal Approximation, Depth vs Width, and Activation Functions

The Universal Approximation Theorem

Depth vs Width Efficiency

Activation Functions

Weight Initialization

Related Pages

Sources

Frequently Asked Questions

What does the universal approximation theorem actually guarantee?

Why does depth matter if shallow networks are universal approximators?

Why did ReLU replace sigmoid and tanh as the standard activation?