Question 1

Why is the backward pass approximately 2× the forward pass in FLOPs?

Accepted Answer

The forward pass through each layer requires one matrix multiply per linear transformation. The backward pass requires two matrix multiplies per layer: one to compute gradients with respect to the inputs (∂L/∂x), and one to compute gradients with respect to the weights (∂L/∂W). So the backward pass is roughly 2× the forward pass, giving 3× total per parameter update. Accounting for the optimizer step brings the factor to ~6× the forward-pass-only FLOPs.

Question 2

How is hardware efficiency measured in practice?

Accepted Answer

Hardware efficiency (model FLOPs utilization, MFU) = (achieved FLOPs) / (peak FLOPs). In practice, large model training achieves 30–70% MFU due to communication overhead (gradient synchronization across GPUs), memory bandwidth bottlenecks (loading weights from HBM), and kernel launch latency. Well-optimized training runs for large models typically achieve 40–50% MFU on A100 clusters.

Question 3

What is the FLOPs difference between training and inference?

Accepted Answer

Training costs ≈ 6·N·D FLOPs total (forward + backward + optimizer). Inference costs ≈ 2·N FLOPs per generated token (forward pass only). For GPT-3: training ≈ 3.14×10²³ FLOPs total; generating 1,000 tokens at inference ≈ 2 × 175B × 1,000 = 3.5×10¹⁴ FLOPs. The full training run costs ~10⁹ times more FLOPs than generating a single 1,000-token response. See also [inference-vs-training-compute](/ai/inference-vs-training-compute).

Measure	Value	Unit	Notes
Training FLOPs formula	C ≈ 6 · N · D	FLOPs	N = parameters, D = training tokens; factor 6 = 2 (forward) + 4 (backward)
Inference FLOPs per token	C_inf ≈ 2 · N	FLOPs/token	One forward pass at inference; no backward pass needed
GPT-3 training FLOPs (175B, 300B tokens)	3.14 × 10²³	FLOPs	6 × 175B × 300B = 3.15 × 10²³; consistent with Kaplan formula
NVIDIA A100 BF16 peak	312	TFLOPS	Tensor core peak; actual utilization typically 30–70% of theoretical peak
GPT-3 equivalent A100-days	~10,000	A100-days	3.14×10²³ FLOPs / (312×10¹² FLOPs/s × 86,400 s/day) ≈ 11,600 A100-days

Operation	FLOPs
Q, K, V projections (3 × d_model²)	3 × 2 × 512² × 512 = 805M
Attention (Q·Kᵀ, scale, softmax, ·V)	4 × 512² × 64 × 8 = 537M
Output projection (d_model²)	2 × 512² × 512 = 268M
FFN (2 × d_model × d_ff)	2 × 2 × 512 × 2048 × 512 = 2.1B
Total per layer	~3.7B FLOPs/layer

Model	Parameters	Tokens	Training FLOPs	Approx A100-days
BERT-base	110M	13.7B	~9.1 × 10¹⁸	~0.3
GPT-2 1.5B	1.5B	40B	~3.6 × 10²⁰	~13
GPT-3	175B	300B	3.14 × 10²³	~11,600
Chinchilla	70B	1.4T	5.88 × 10²³	~21,800

GPU	BF16 Tensor FLOPs	FP16 Tensor FLOPs	HBM Bandwidth
NVIDIA V100	125 TFLOPS	125 TFLOPS	900 GB/s
NVIDIA A100	312 TFLOPS	312 TFLOPS	2,000 GB/s
NVIDIA H100	989 TFLOPS	989 TFLOPS	3,350 GB/s

Compute FLOPs: Counting Training and Inference Operations for Language Models

The 6·N·D Training Formula

FLOPs Per Layer (Base Transformer, N=512, seq_len=512)

Training Cost Comparison

Hardware Peak FLOPs

Related Pages

Sources

Frequently Asked Questions

Why is the backward pass approximately 2× the forward pass in FLOPs?

How is hardware efficiency measured in practice?

What is the FLOPs difference between training and inference?