Inference vs Training Compute: FLOPs per Token vs Total Training Cost

Category: inference Updated: 2026-02-27

Training FLOPs ≈ 6·N·D for dense transformers (N parameters, D tokens); inference ≈ 2·N FLOPs per token; a 70B model requires ~1.4×10¹¹ FLOPs per token vs ~5.9×10²³ total training FLOPs (Chinchilla-optimal); training = ~4.2 trillion inference tokens equivalent.

Key Data Points
MeasureValueUnitNotes
Training FLOPs approximationC_train ≈ 6·N·DFLOPsN = parameters, D = tokens; factor 6 = 2 (forward) + 4 (backward + optimizer step)
Inference FLOPs per token (with KV cache)C_infer ≈ 2·NFLOPs/tokenTwo matrix multiplications per token; no backward pass; past tokens cached, no recomputation
Training-equivalent inference tokensC_train / C_infer = 3·Dinference tokens6·N·D training FLOPs ÷ 2·N per inference token = 3D inference tokens equivalent
70B Chinchilla training FLOPs~5.9×10²³FLOPs70B params × 1.4T tokens × 6 ≈ 5.9×10²³; requires ~1000 A100-days at peak efficiency
70B inference FLOPs per token~1.4×10¹¹FLOPs2 × 70×10⁹ = 1.4×10¹¹; A100 at 312 TFLOPS processes ~2000 tokens/s for a single request

Understanding the compute split between training and inference is fundamental to reasoning about large language model economics, energy costs, and deployment strategies. Training and inference have structurally different cost profiles: training is a one-time fixed cost, while inference cost scales linearly with deployment usage.

FLOPs Breakdown: Training

For a dense transformer with N parameters trained on D tokens:

C_train ≈ 6 · N · D

The factor 6 decomposes as:

  • : forward pass (compute activations, loss)
  • : backward pass (compute gradients via backpropagation)
  • : optimizer step (Adam: update first and second moment estimates per parameter)

FLOPs Breakdown: Inference

For a single generated token (assuming full KV cache):

C_infer ≈ 2 · N

The factor 2 covers: attention projection (Q, K, V matrices) and feed-forward network projection at each layer. No backward pass; no optimizer state maintained.

Compute at Scale

Model Size (N)Training Tokens (D)Total Training (FLOPs)Inference/Token (FLOPs)Training ≡ N_inf Inference Tokens
1B20B (Chinchilla)1.2×10²⁰2×10⁹~6×10¹⁰
7B140B (Chinchilla)5.9×10²¹1.4×10¹⁰~4.2×10¹¹
70B1.4T (Chinchilla)5.9×10²³1.4×10¹¹~4.2×10¹²

Batch Size and Inference Efficiency

Batch SizeArithmetic IntensityThroughputBottleneck
1 (latency-optimized)~1 FLOP/byteLow tokens/sMemory bandwidth
32~32 FLOP/byteModerateMixed
512+HighNear-peak FLOP/sCompute

At batch size 1, inference throughput is ~1000× below GPU peak compute due to memory bandwidth constraints. Batching requests amortizes the weight loading cost across multiple tokens.

Prefill vs Decode Phases

Inference has two distinct phases:

PhaseInputFLOPsParallelism
Prefill (prompt processing)All input tokens at once2·N·T_promptFull parallelism
Decode (token generation)One token per step2·N per tokenSequential

The prefill phase computes at high arithmetic intensity and is GPU compute-bound. The decode phase generates sequentially and is memory-bandwidth bound unless batched.

See compute-flops for detailed FLOP counting methodology, kv-cache for how caching eliminates redundant prefill recomputation, and quantization for reducing the memory bottleneck in decode.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why is the backward pass approximately 2× more expensive than the forward pass?

The forward pass computes activations and loss in one traversal through the network. Backpropagation must apply the chain rule in reverse: one pass computes gradients of the loss w.r.t. activations (∂L/∂a), another computes gradients w.r.t. weights (∂L/∂W). The Adam optimizer step adds computation for updating first and second moment estimates per parameter. Together, backward + optimizer ≈ 4× forward pass compute, giving a training total of ~6× forward pass per step.

Why is inference memory-bandwidth bound at small batch sizes?

For a single request (batch size 1), generating one token requires loading all N model parameters from GPU HBM memory to compute 2·N FLOPs. The arithmetic intensity is 2·N FLOPs / (2N bytes for FP16) = 1 FLOP/byte — far below the A100's compute-to-memory ratio (~300:1). The GPU spends most time waiting for memory transfers, not computing. Larger batch sizes improve arithmetic intensity by amortizing the weight loads over multiple simultaneous requests.

← All AI pages · Dashboard