Chain-of-Thought Prompting: Intermediate Reasoning Steps Improve Multi-Step Accuracy

Name: Chain-of-Thought Prompting: Intermediate Reasoning Steps Improve Multi-Step Accuracy
Creator: AI Tower
Published: 2026-02-27

Category: agents-applications Updated: 2026-02-27

Wei et al. (NeurIPS 2022): adding step-by-step reasoning to 8-shot examples raised PaLM 540B GSM8K accuracy 18% → 57%; Kojima et al. (2022): zero-shot CoT 'Let's think step by step' raised MultiArith 17.7% → 78.7%; self-consistency (Wang et al., 2022) adds +17% via majority vote.

Key Data Points
Measure	Value	Unit	Notes
GSM8K: standard vs CoT (PaLM 540B)	18% → 57%	% accuracy	Wei et al. (2022): 8-shot standard prompting vs 8-shot chain-of-thought; +39 percentage points
MultiArith: zero-shot CoT (540B)	17.7% → 78.7%	% accuracy	Kojima et al. (2022): zero-shot standard vs 'Let's think step by step'; +61 percentage points
Self-consistency gain on GSM8K	57% → 74% (k=40 samples)	% accuracy	Wang et al. (2022): majority vote over 40 CoT samples; PaLM 540B; +17 percentage points
Scale threshold for CoT benefit	~100B parameters	parameters	Wei et al. (2022): CoT benefits only emerge reliably above ~100B parameters; smaller models show no gain or regression

Chain-of-thought (CoT) prompting augments few-shot examples with intermediate reasoning steps before the final answer. Rather than (question → answer) exemplars, CoT provides (question → step-by-step reasoning → answer) exemplars, causing the model to generate its own reasoning trace when answering new questions.

Original Wei et al. (2022) Results

Wei et al. evaluated 8-shot CoT prompting with explicit reasoning steps across arithmetic, commonsense, and symbolic reasoning tasks using PaLM 540B.

Dataset	Standard 8-shot	CoT 8-shot	Gain
GSM8K (math)	18.0%	57.0%	+39 pts
MAWPS (math)	73.0%	93.0%	+20 pts
StrategyQA (commonsense)	82.0%	84.0%	+2 pts
Letter Concatenation	67.0%	93.0%	+26 pts

Zero-Shot CoT: Kojima et al. (2022)

The zero-shot variant appends “Let’s think step by step” to the prompt with no exemplars at all:

Standard: “Q: [question] A:” Zero-shot CoT: “Q: [question] A: Let’s think step by step.”

Dataset	Zero-shot	Zero-shot CoT	Gain
MultiArith	17.7%	78.7%	+61 pts
GSM8K	10.4%	40.7%	+30 pts
AddSub	69.6%	74.7%	+5 pts
AQuA-RAT	22.4%	33.5%	+11 pts

Why CoT Works: The Scratchpad Mechanism

Chain-of-thought converts a single multi-step prediction into a sequence of simpler next-token predictions, where each step conditions on previous reasoning steps. The generated text serves as “working memory”: the model stores intermediate results in the output stream rather than relying on internal representations to hold them across many attention layers.

Self-Consistency Sampling (Wang et al., 2022)

Generate k CoT paths independently; take majority vote on final answers:

Paths sampled (k)	GSM8K Accuracy (PaLM 540B)	Compute Cost
1	57.0%	1×
10	66.9%	10×
40	74.4%	40×

Self-consistency trades inference compute for accuracy — highly effective when the inference budget allows multiple samples.

Scale Dependency

Model Size	CoT Benefit on GSM8K
~350M	Negative (CoT hurts vs standard)
~8B	Minimal / negligible
~62B	Small positive
~540B	Large (+39 percentage points)

See prompt-engineering for broader technique comparisons, emergent-capabilities for why CoT benefits emerge sharply at scale, and tool-use-function-calling for how reasoning traces guide tool selection in agentic settings.

🧠 🧠 🧠

Sources

Frequently Asked Questions

Why does chain-of-thought prompting only work at large scale?

Wei et al. (2022) tested CoT across models from ~300M to 540B parameters. Below ~100B parameters, CoT consistently equaled or underperformed standard prompting — models generated plausible-looking but incorrect reasoning chains. Above ~100B parameters, CoT reliably improved accuracy. The explanation: reasoning chains require the model to perform compositional operations (arithmetic, logical deduction) in the generated text. This requires sufficient capacity to both generate coherent language and correctly execute the intermediate computations.

What is self-consistency and how does it improve on basic CoT?

Basic CoT generates one reasoning chain and takes its final answer. Self-consistency (Wang et al., 2022) generates k diverse reasoning paths using temperature sampling and takes a majority vote over the final answers. The intuition: there are multiple valid reasoning paths to a correct answer, but incorrect reasoning produces more varied wrong answers. On GSM8K, self-consistency with k=40 adds +17% over single-path CoT (74% vs 57%), at the cost of 40× more inference compute.

← All AI pages · Dashboard