Mixture of Experts: Sparse Gating, Switch Transformer, and Efficient Scaling

Category: representation Updated: 2026-02-27

Sparse MoE routes each token to top-k of N expert FFN layers; Switch Transformer (Fedus et al., 2022) uses k=1 routing to scale to 1.6T parameters, activating ~7B per token — 7× pre-training speedup over dense T5-11B.

Key Data Points
MeasureValueUnitNotes
Switch Transformer total parameters1.6trillionFedus et al. (2022); each token activates ~7B parameters via top-1 routing
Switch Transformer parameters activated per token~7billionOnly 1 expert per MoE layer activated; ~0.4% of total parameters per token
Pre-training speedup (Switch vs dense T5-11B)fasterSame compute budget; Switch reaches equivalent perplexity 7× faster in steps
Typical top-k routingk = 1 or 2experts per tokenk=1 (Switch); k=2 (GShard, most other MoE); k>2 shows diminishing returns
Expert capacity factor1.0–1.5Maximum tokens per expert = capacity_factor × (tokens/n_experts); overflow tokens skip MoE layer

Mixture of Experts (MoE) is a conditional computation technique that scales model capacity without proportionally scaling per-token compute. By activating only a fraction of model parameters for each input token, MoE layers enable enormously large models to be trained on the same compute budget as smaller dense models.

Architecture

A standard transformer FFN layer has fixed parameters applied to every token. An MoE layer replaces it with:

  1. N expert FFN networks: E₁, E₂, …, E_N — each identical in structure to a standard FFN
  2. A gating network G(x) that produces routing weights over experts
  3. Top-k selection: for each token, activate the k experts with the highest gate values

The output of the MoE layer for token x:

MoE(x) = Σᵢ∈top-k G(x)ᵢ · Eᵢ(x)

Switch Transformer vs Dense Baselines

Fedus et al. (2022) benchmarked Switch Transformer (k=1 routing) against dense T5 models on C4 pre-training:

ModelParametersActive Params/TokenSteps to -1.90 NLLRelative Speed
T5-Base (dense)223M223M500K
T5-Large (dense)739M739M500K
T5-11B (dense)11B11B500K
Switch-Base (128 experts)7.4B~223M71K7× vs T5-Base
Switch-XXL (128 experts)395B~4.7B250K
Switch-C (2048 experts)1.6T~7B4× vs T5-11B

Routing Strategies

MethodkLoad BalancingKey Paper
Sparsely-Gated MoE2Auxiliary loss + noiseShazeer et al. (2017)
Switch Transformer1Auxiliary lossFedus et al. (2022)
GShard2Local group dispatchLepikhin et al. (2021)
Expert Choicek per expertNaturally balancedZhou et al. (2022)

Expert Capacity and Token Overflow

Expert capacity defines the maximum number of tokens routed to each expert:

capacity = capacity_factor × (total_tokens / n_experts)

With capacity_factor=1.25 (25% buffer above uniform load), most tokens are processed normally. Overflow tokens (exceeding capacity) skip the MoE layer and pass through with the residual identity. This design prevents expert underload from bottlenecking training, at the cost of some tokens not utilizing MoE layers.

See feed-forward-layers for the dense FFN that MoE replaces, and scaling-laws for how MoE interacts with compute-optimal training principles.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

How does sparse MoE differ from a standard transformer FFN layer?

A standard transformer FFN applies the same learned weight matrix to every token. A sparse MoE layer has N expert FFN networks (each identical in structure to a standard FFN) and a gating network that routes each token to the top-k experts. Only the selected experts' parameters are used for each token. If N=64 experts and k=2, each token activates 2/64 ≈ 3% of the expert parameters, giving the model large total capacity while keeping per-token compute nearly the same as a single expert.

What is load balancing in MoE models and why does it matter?

Load balancing ensures tokens are distributed roughly evenly across experts. Without it, the gating network tends to collapse — repeatedly sending most tokens to the same few experts — leaving others undertrained. Switch Transformer addresses this with an auxiliary load balancing loss: L_aux = α · Σᵢ f_i · p_i, where f_i is the fraction of tokens routed to expert i and p_i is the average gating probability. This encourages uniform routing during training.

How are MoE models trained in practice?

MoE layers are distributed across multiple accelerators, with each device hosting a subset of experts. During the forward pass, tokens are dispatched across devices to their assigned experts (all-to-all communication), processed locally, then sent back (second all-to-all). The computational cost per token is similar to a standard FFN, but model capacity is multiplied by N. GShard (Lepikhin et al.) and Switch Transformer demonstrate this approach at trillion-parameter scale.

← All AI pages · Dashboard