Compute FLOPs: Counting Training and Inference Operations for Language Models

Category: training Updated: 2026-02-27

Training FLOPs ≈ 6·N·D for dense transformers (N parameters, D tokens); inference costs ≈ 2·N FLOPs per token; an A100 GPU delivers 312 TFLOPS (BF16), making GPT-3 training require ~10⁴ A100-days.

Key Data Points
MeasureValueUnitNotes
Training FLOPs formulaC ≈ 6 · N · DFLOPsN = parameters, D = training tokens; factor 6 = 2 (forward) + 4 (backward)
Inference FLOPs per tokenC_inf ≈ 2 · NFLOPs/tokenOne forward pass at inference; no backward pass needed
GPT-3 training FLOPs (175B, 300B tokens)3.14 × 10²³FLOPs6 × 175B × 300B = 3.15 × 10²³; consistent with Kaplan formula
NVIDIA A100 BF16 peak312TFLOPSTensor core peak; actual utilization typically 30–70% of theoretical peak
GPT-3 equivalent A100-days~10,000A100-days3.14×10²³ FLOPs / (312×10¹² FLOPs/s × 86,400 s/day) ≈ 11,600 A100-days

Understanding FLOP (floating-point operation) counts is essential for estimating training costs, comparing model efficiency, and reasoning about compute budgets. The key insight is that both training and inference FLOPs scale linearly with the number of parameters.

The 6·N·D Training Formula

For a dense transformer with N parameters trained on D tokens:

C_train ≈ 6 · N · D FLOPs

The factor 6 decomposes as:

  • 2·N·D: forward pass (2 FLOPs per weight per token — one multiply, one add)
  • 4·N·D: backward pass (gradient w.r.t. weights + gradient w.r.t. inputs, each ≈ 2·N·D)

This approximation ignores attention FLOPs (smaller than FFN for typical sequence lengths) and embedding operations (negligible at scale).

FLOPs Per Layer (Base Transformer, N=512, seq_len=512)

OperationFLOPs
Q, K, V projections (3 × d_model²)3 × 2 × 512² × 512 = 805M
Attention (Q·Kᵀ, scale, softmax, ·V)4 × 512² × 64 × 8 = 537M
Output projection (d_model²)2 × 512² × 512 = 268M
FFN (2 × d_model × d_ff)2 × 2 × 512 × 2048 × 512 = 2.1B
Total per layer~3.7B FLOPs/layer

For N=6 layers: ~22B FLOPs per forward pass for 512-token sequence.

Training Cost Comparison

ModelParametersTokensTraining FLOPsApprox A100-days
BERT-base110M13.7B~9.1 × 10¹⁸~0.3
GPT-2 1.5B1.5B40B~3.6 × 10²⁰~13
GPT-3175B300B3.14 × 10²³~11,600
Chinchilla70B1.4T5.88 × 10²³~21,800

Hardware Peak FLOPs

GPUBF16 Tensor FLOPsFP16 Tensor FLOPsHBM Bandwidth
NVIDIA V100125 TFLOPS125 TFLOPS900 GB/s
NVIDIA A100312 TFLOPS312 TFLOPS2,000 GB/s
NVIDIA H100989 TFLOPS989 TFLOPS3,350 GB/s

See inference-vs-training-compute for the training/inference FLOP ratio, scaling-laws for how FLOPs determine optimal model size and token count, and chinchilla-scaling for compute-optimal training allocation.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why is the backward pass approximately 2× the forward pass in FLOPs?

The forward pass through each layer requires one matrix multiply per linear transformation. The backward pass requires two matrix multiplies per layer: one to compute gradients with respect to the inputs (∂L/∂x), and one to compute gradients with respect to the weights (∂L/∂W). So the backward pass is roughly 2× the forward pass, giving 3× total per parameter update. Accounting for the optimizer step brings the factor to ~6× the forward-pass-only FLOPs.

How is hardware efficiency measured in practice?

Hardware efficiency (model FLOPs utilization, MFU) = (achieved FLOPs) / (peak FLOPs). In practice, large model training achieves 30–70% MFU due to communication overhead (gradient synchronization across GPUs), memory bandwidth bottlenecks (loading weights from HBM), and kernel launch latency. Well-optimized training runs for large models typically achieve 40–50% MFU on A100 clusters.

What is the FLOPs difference between training and inference?

Training costs ≈ 6·N·D FLOPs total (forward + backward + optimizer). Inference costs ≈ 2·N FLOPs per generated token (forward pass only). For GPT-3: training ≈ 3.14×10²³ FLOPs total; generating 1,000 tokens at inference ≈ 2 × 175B × 1,000 = 3.5×10¹⁴ FLOPs. The full training run costs ~10⁹ times more FLOPs than generating a single 1,000-token response. See also [inference-vs-training-compute](/ai/inference-vs-training-compute).

← All AI pages · Dashboard