Position-Wise Feed-Forward Layers: FFN Formula, Parameter Budget, and GeLU vs ReLU

Name: Position-Wise Feed-Forward Layers: FFN Formula, Parameter Budget, and GeLU vs ReLU
Creator: AI Tower
Published: 2026-02-27

Category: architecture Updated: 2026-02-27

Each transformer FFN layer computes max(0,xW₁+b₁)W₂+b₂ with d_ff=2048 (4× d_model=512); FFN sublayers account for ~67% of the base model's 65M parameters; GeLU outperforms ReLU on NLP benchmarks (Hendrycks & Gimpel, 2016).

Key Data Points
Measure	Value	Unit	Notes
FFN formula	FFN(x) = max(0, xW₁ + b₁)W₂ + b₂		ReLU activation; GeLU variant: FFN(x) = GELU(xW₁ + b₁)W₂ + b₂
d_model (input/output dimension)	512	dimensions	FFN input and output match d_model for residual connections
d_ff (inner dimension)	2048	dimensions	4× d_model; chosen empirically; expands and then compresses the representation
W₁ parameters (per layer)	512 × 2048 + 2048 = 1,050,624	parameters	Weights + biases for the expansion layer
W₂ parameters (per layer)	2048 × 512 + 512 = 1,049,088	parameters	Weights + biases for the compression layer
Total FFN parameters per layer	2,099,712	parameters	~2.1M per encoder or decoder layer; vs ~1.05M for the attention block
FFN share of base model parameters	~67%	percent	12 FFN layers × 2.1M = 25.2M; attention blocks contribute ~12.6M; remainder is embeddings
GeLU vs ReLU (CIFAR-10 error)	7.89% vs 8.16%	error rate	GeLU achieves lower error; Hendrycks & Gimpel (2016) Table 1

The position-wise feed-forward network (FFN) is the second major sublayer in every transformer encoder and decoder layer. Applied independently to each token position after the attention sublayer, it provides the non-linear capacity that multi-head attention — which applies only linear transformations to value vectors — cannot supply alone.

The Formula

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

where:

x ∈ ℝ^{d_model} is the d_model=512 dimensional input for one token position
W₁ ∈ ℝ^{d_model × d_ff} = ℝ^{512 × 2048} — expands the representation
W₂ ∈ ℝ^{d_ff × d_model} = ℝ^{2048 × 512} — compresses back to d_model
max(0, ·) is ReLU; the same sublayer with GeLU is GELU(xW₁ + b₁)W₂ + b₂

The FFN is applied identically and independently to each of the n token positions — it does not mix position information. The network is “position-wise” in the same sense that a 1×1 convolution is channel-wise.

Parameter Breakdown Across Architectures

Hyperparameter	Base Model	Big Model	Modern 4× rule
d_model	512	1024	varies
d_ff	2048	4096	4 × d_model
d_ff / d_model ratio	4×	4×	4×
W₁ parameters (per FFN)	1,048,576	4,194,304
W₂ parameters (per FFN)	1,048,576	4,194,304
Biases (per FFN)	2,560	4,608
Total per FFN sublayer	~2.1M	~8.4M

Where Do the Parameters Go? (Base Model, 65M Total)

Component	Layers	Params per layer	Total
Token embeddings (vocab=37,000)	—	—	~18.9M
Attention blocks (enc + dec)	12	~1.05M	~12.6M
FFN sublayers (enc + dec)	12	~2.1M	~25.2M
LayerNorm + output projection	—	—	~8.3M
Total			~65M

FFN sublayers account for roughly 39% of the base model’s parameters on their own; combined with the attention blocks’ parameter budget (~19%), the 6+6 transformer layers hold ~88% of all parameters. Approximately two-thirds of per-layer parameters are in the FFN.

GeLU vs ReLU

The original transformer uses ReLU. Later architectures switched to GeLU (Hendrycks & Gimpel, 2016), which is defined as:

GeLU(x) = x · Φ(x)

where Φ(x) is the standard Gaussian CDF. Unlike ReLU, GeLU applies a smooth, probabilistic gate that decreases output for negative inputs rather than zeroing them entirely.

Activation	CIFAR-10 error	CIFAR-100 error	Characteristic
ReLU	8.16%	21.77%	Hard threshold at 0; sparse activations
ELU	8.41%	22.98%	Smooth for negative inputs
GeLU	7.89%	20.74%	Smooth gate; weights by magnitude

Shazeer (2020) further extended this with Gated Linear Units (GLU), where the FFN becomes:

FFN_GLU(x) = (xW₁ ⊙ σ(xW_gate)) W₂

This variant and its GeLU-gated form (SwiGLU) are widely used in modern architectures for improved training stability.

See multi-head-attention for the other parameter-dense sublayer in each layer, self-attention-mechanism for the attention formula, and transformer-architecture for how FFN and attention sublayers are composed with residual connections and layer normalization.

🧠 🧠 🧠

Sources

Frequently Asked Questions

Why is d_ff set to 4× d_model in the original transformer?

The 4× ratio (d_ff=2048 for d_model=512) was chosen empirically by Vaswani et al. It provides sufficient capacity for the FFN to perform complex non-linear transformations of each token's representation after attention. In practice, d_ff ratios from 2.67× to 8× are used across modern architectures, with the 4× ratio remaining a common default.

What is the role of the FFN layer if attention already mixes token information?

Multi-head attention mixes information across token positions, but applies only a linear transformation to each token's value vector. The position-wise FFN applies an independent non-linear transformation to each token's representation individually. It is thought to act as a key-value memory (Geva et al., 2021), storing factual associations learned during training.

Why does GeLU outperform ReLU in transformer architectures?

GeLU (x·Φ(x), where Φ is the standard normal CDF) is a smooth function that weights inputs by their magnitude rather than applying a hard threshold at zero. This smoother activation landscape tends to produce better-conditioned gradients during training on language tasks. Hendrycks & Gimpel (2016) showed consistent improvements over ReLU across NLP, vision, and speech benchmarks.

← All AI pages · Dashboard