Compute FLOPs: Counting Training and Inference Operations for Language Models
Training FLOPs ≈ 6·N·D for dense transformers (N parameters, D tokens); inference costs ≈ 2·N FLOPs per token; an A100 GPU delivers 312 TFLOPS (BF16), making GPT-3 training require ~10⁴ A100-days.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Training FLOPs formula | C ≈ 6 · N · D | FLOPs | N = parameters, D = training tokens; factor 6 = 2 (forward) + 4 (backward) |
| Inference FLOPs per token | C_inf ≈ 2 · N | FLOPs/token | One forward pass at inference; no backward pass needed |
| GPT-3 training FLOPs (175B, 300B tokens) | 3.14 × 10²³ | FLOPs | 6 × 175B × 300B = 3.15 × 10²³; consistent with Kaplan formula |
| NVIDIA A100 BF16 peak | 312 | TFLOPS | Tensor core peak; actual utilization typically 30–70% of theoretical peak |
| GPT-3 equivalent A100-days | ~10,000 | A100-days | 3.14×10²³ FLOPs / (312×10¹² FLOPs/s × 86,400 s/day) ≈ 11,600 A100-days |
Understanding FLOP (floating-point operation) counts is essential for estimating training costs, comparing model efficiency, and reasoning about compute budgets. The key insight is that both training and inference FLOPs scale linearly with the number of parameters.
The 6·N·D Training Formula
For a dense transformer with N parameters trained on D tokens:
C_train ≈ 6 · N · D FLOPs
The factor 6 decomposes as:
- 2·N·D: forward pass (2 FLOPs per weight per token — one multiply, one add)
- 4·N·D: backward pass (gradient w.r.t. weights + gradient w.r.t. inputs, each ≈ 2·N·D)
This approximation ignores attention FLOPs (smaller than FFN for typical sequence lengths) and embedding operations (negligible at scale).
FLOPs Per Layer (Base Transformer, N=512, seq_len=512)
| Operation | FLOPs |
|---|---|
| Q, K, V projections (3 × d_model²) | 3 × 2 × 512² × 512 = 805M |
| Attention (Q·Kᵀ, scale, softmax, ·V) | 4 × 512² × 64 × 8 = 537M |
| Output projection (d_model²) | 2 × 512² × 512 = 268M |
| FFN (2 × d_model × d_ff) | 2 × 2 × 512 × 2048 × 512 = 2.1B |
| Total per layer | ~3.7B FLOPs/layer |
For N=6 layers: ~22B FLOPs per forward pass for 512-token sequence.
Training Cost Comparison
| Model | Parameters | Tokens | Training FLOPs | Approx A100-days |
|---|---|---|---|---|
| BERT-base | 110M | 13.7B | ~9.1 × 10¹⁸ | ~0.3 |
| GPT-2 1.5B | 1.5B | 40B | ~3.6 × 10²⁰ | ~13 |
| GPT-3 | 175B | 300B | 3.14 × 10²³ | ~11,600 |
| Chinchilla | 70B | 1.4T | 5.88 × 10²³ | ~21,800 |
Hardware Peak FLOPs
| GPU | BF16 Tensor FLOPs | FP16 Tensor FLOPs | HBM Bandwidth |
|---|---|---|---|
| NVIDIA V100 | 125 TFLOPS | 125 TFLOPS | 900 GB/s |
| NVIDIA A100 | 312 TFLOPS | 312 TFLOPS | 2,000 GB/s |
| NVIDIA H100 | 989 TFLOPS | 989 TFLOPS | 3,350 GB/s |
See inference-vs-training-compute for the training/inference FLOP ratio, scaling-laws for how FLOPs determine optimal model size and token count, and chinchilla-scaling for compute-optimal training allocation.
Related Pages
Sources
- Kaplan et al. (2020) — Scaling Laws for Neural Language Models. arXiv
- Hoffmann et al. (2022) — Training Compute-Optimal Large Language Models. NeurIPS 2022
- Patterson et al. (2021) — Carbon and the Cloud. Nature
Frequently Asked Questions
Why is the backward pass approximately 2× the forward pass in FLOPs?
The forward pass through each layer requires one matrix multiply per linear transformation. The backward pass requires two matrix multiplies per layer: one to compute gradients with respect to the inputs (∂L/∂x), and one to compute gradients with respect to the weights (∂L/∂W). So the backward pass is roughly 2× the forward pass, giving 3× total per parameter update. Accounting for the optimizer step brings the factor to ~6× the forward-pass-only FLOPs.
How is hardware efficiency measured in practice?
Hardware efficiency (model FLOPs utilization, MFU) = (achieved FLOPs) / (peak FLOPs). In practice, large model training achieves 30–70% MFU due to communication overhead (gradient synchronization across GPUs), memory bandwidth bottlenecks (loading weights from HBM), and kernel launch latency. Well-optimized training runs for large models typically achieve 40–50% MFU on A100 clusters.
What is the FLOPs difference between training and inference?
Training costs ≈ 6·N·D FLOPs total (forward + backward + optimizer). Inference costs ≈ 2·N FLOPs per generated token (forward pass only). For GPT-3: training ≈ 3.14×10²³ FLOPs total; generating 1,000 tokens at inference ≈ 2 × 175B × 1,000 = 3.5×10¹⁴ FLOPs. The full training run costs ~10⁹ times more FLOPs than generating a single 1,000-token response. See also [inference-vs-training-compute](/ai/inference-vs-training-compute).