Position-Wise Feed-Forward Layers: FFN Formula, Parameter Budget, and GeLU vs ReLU
Each transformer FFN layer computes max(0,xW₁+b₁)W₂+b₂ with d_ff=2048 (4× d_model=512); FFN sublayers account for ~67% of the base model's 65M parameters; GeLU outperforms ReLU on NLP benchmarks (Hendrycks & Gimpel, 2016).
| Measure | Value | Unit | Notes |
|---|---|---|---|
| FFN formula | FFN(x) = max(0, xW₁ + b₁)W₂ + b₂ | ReLU activation; GeLU variant: FFN(x) = GELU(xW₁ + b₁)W₂ + b₂ | |
| d_model (input/output dimension) | 512 | dimensions | FFN input and output match d_model for residual connections |
| d_ff (inner dimension) | 2048 | dimensions | 4× d_model; chosen empirically; expands and then compresses the representation |
| W₁ parameters (per layer) | 512 × 2048 + 2048 = 1,050,624 | parameters | Weights + biases for the expansion layer |
| W₂ parameters (per layer) | 2048 × 512 + 512 = 1,049,088 | parameters | Weights + biases for the compression layer |
| Total FFN parameters per layer | 2,099,712 | parameters | ~2.1M per encoder or decoder layer; vs ~1.05M for the attention block |
| FFN share of base model parameters | ~67% | percent | 12 FFN layers × 2.1M = 25.2M; attention blocks contribute ~12.6M; remainder is embeddings |
| GeLU vs ReLU (CIFAR-10 error) | 7.89% vs 8.16% | error rate | GeLU achieves lower error; Hendrycks & Gimpel (2016) Table 1 |
The position-wise feed-forward network (FFN) is the second major sublayer in every transformer encoder and decoder layer. Applied independently to each token position after the attention sublayer, it provides the non-linear capacity that multi-head attention — which applies only linear transformations to value vectors — cannot supply alone.
The Formula
FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
where:
- x ∈ ℝ^{d_model} is the d_model=512 dimensional input for one token position
- W₁ ∈ ℝ^{d_model × d_ff} = ℝ^{512 × 2048} — expands the representation
- W₂ ∈ ℝ^{d_ff × d_model} = ℝ^{2048 × 512} — compresses back to d_model
- max(0, ·) is ReLU; the same sublayer with GeLU is GELU(xW₁ + b₁)W₂ + b₂
The FFN is applied identically and independently to each of the n token positions — it does not mix position information. The network is “position-wise” in the same sense that a 1×1 convolution is channel-wise.
Parameter Breakdown Across Architectures
| Hyperparameter | Base Model | Big Model | Modern 4× rule |
|---|---|---|---|
| d_model | 512 | 1024 | varies |
| d_ff | 2048 | 4096 | 4 × d_model |
| d_ff / d_model ratio | 4× | 4× | 4× |
| W₁ parameters (per FFN) | 1,048,576 | 4,194,304 | |
| W₂ parameters (per FFN) | 1,048,576 | 4,194,304 | |
| Biases (per FFN) | 2,560 | 4,608 | |
| Total per FFN sublayer | ~2.1M | ~8.4M |
Where Do the Parameters Go? (Base Model, 65M Total)
| Component | Layers | Params per layer | Total |
|---|---|---|---|
| Token embeddings (vocab=37,000) | — | — | ~18.9M |
| Attention blocks (enc + dec) | 12 | ~1.05M | ~12.6M |
| FFN sublayers (enc + dec) | 12 | ~2.1M | ~25.2M |
| LayerNorm + output projection | — | — | ~8.3M |
| Total | ~65M |
FFN sublayers account for roughly 39% of the base model’s parameters on their own; combined with the attention blocks’ parameter budget (~19%), the 6+6 transformer layers hold ~88% of all parameters. Approximately two-thirds of per-layer parameters are in the FFN.
GeLU vs ReLU
The original transformer uses ReLU. Later architectures switched to GeLU (Hendrycks & Gimpel, 2016), which is defined as:
GeLU(x) = x · Φ(x)
where Φ(x) is the standard Gaussian CDF. Unlike ReLU, GeLU applies a smooth, probabilistic gate that decreases output for negative inputs rather than zeroing them entirely.
| Activation | CIFAR-10 error | CIFAR-100 error | Characteristic |
|---|---|---|---|
| ReLU | 8.16% | 21.77% | Hard threshold at 0; sparse activations |
| ELU | 8.41% | 22.98% | Smooth for negative inputs |
| GeLU | 7.89% | 20.74% | Smooth gate; weights by magnitude |
Shazeer (2020) further extended this with Gated Linear Units (GLU), where the FFN becomes:
FFN_GLU(x) = (xW₁ ⊙ σ(xW_gate)) W₂
This variant and its GeLU-gated form (SwiGLU) are widely used in modern architectures for improved training stability.
See multi-head-attention for the other parameter-dense sublayer in each layer, self-attention-mechanism for the attention formula, and transformer-architecture for how FFN and attention sublayers are composed with residual connections and layer normalization.
Related Pages
Sources
- Vaswani et al. (2017) — Attention Is All You Need. NeurIPS 2017
- Hendrycks & Gimpel (2016) — Gaussian Error Linear Units (GELUs). arXiv 2016
- Noam et al. (2020) — GLU Variants Improve Transformer. arXiv 2020
Frequently Asked Questions
Why is d_ff set to 4× d_model in the original transformer?
The 4× ratio (d_ff=2048 for d_model=512) was chosen empirically by Vaswani et al. It provides sufficient capacity for the FFN to perform complex non-linear transformations of each token's representation after attention. In practice, d_ff ratios from 2.67× to 8× are used across modern architectures, with the 4× ratio remaining a common default.
What is the role of the FFN layer if attention already mixes token information?
Multi-head attention mixes information across token positions, but applies only a linear transformation to each token's value vector. The position-wise FFN applies an independent non-linear transformation to each token's representation individually. It is thought to act as a key-value memory (Geva et al., 2021), storing factual associations learned during training.
Why does GeLU outperform ReLU in transformer architectures?
GeLU (x·Φ(x), where Φ is the standard normal CDF) is a smooth function that weights inputs by their magnitude rather than applying a hard threshold at zero. This smoother activation landscape tends to produce better-conditioned gradients during training on language tasks. Hendrycks & Gimpel (2016) showed consistent improvements over ReLU across NLP, vision, and speech benchmarks.