Chain-of-Thought Prompting: Intermediate Reasoning Steps Improve Multi-Step Accuracy
Wei et al. (NeurIPS 2022): adding step-by-step reasoning to 8-shot examples raised PaLM 540B GSM8K accuracy 18% → 57%; Kojima et al. (2022): zero-shot CoT 'Let's think step by step' raised MultiArith 17.7% → 78.7%; self-consistency (Wang et al., 2022) adds +17% via majority vote.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| GSM8K: standard vs CoT (PaLM 540B) | 18% → 57% | % accuracy | Wei et al. (2022): 8-shot standard prompting vs 8-shot chain-of-thought; +39 percentage points |
| MultiArith: zero-shot CoT (540B) | 17.7% → 78.7% | % accuracy | Kojima et al. (2022): zero-shot standard vs 'Let's think step by step'; +61 percentage points |
| Self-consistency gain on GSM8K | 57% → 74% (k=40 samples) | % accuracy | Wang et al. (2022): majority vote over 40 CoT samples; PaLM 540B; +17 percentage points |
| Scale threshold for CoT benefit | ~100B parameters | parameters | Wei et al. (2022): CoT benefits only emerge reliably above ~100B parameters; smaller models show no gain or regression |
Chain-of-thought (CoT) prompting augments few-shot examples with intermediate reasoning steps before the final answer. Rather than (question → answer) exemplars, CoT provides (question → step-by-step reasoning → answer) exemplars, causing the model to generate its own reasoning trace when answering new questions.
Original Wei et al. (2022) Results
Wei et al. evaluated 8-shot CoT prompting with explicit reasoning steps across arithmetic, commonsense, and symbolic reasoning tasks using PaLM 540B.
| Dataset | Standard 8-shot | CoT 8-shot | Gain |
|---|---|---|---|
| GSM8K (math) | 18.0% | 57.0% | +39 pts |
| MAWPS (math) | 73.0% | 93.0% | +20 pts |
| StrategyQA (commonsense) | 82.0% | 84.0% | +2 pts |
| Letter Concatenation | 67.0% | 93.0% | +26 pts |
Zero-Shot CoT: Kojima et al. (2022)
The zero-shot variant appends “Let’s think step by step” to the prompt with no exemplars at all:
Standard: “Q: [question] A:” Zero-shot CoT: “Q: [question] A: Let’s think step by step.”
| Dataset | Zero-shot | Zero-shot CoT | Gain |
|---|---|---|---|
| MultiArith | 17.7% | 78.7% | +61 pts |
| GSM8K | 10.4% | 40.7% | +30 pts |
| AddSub | 69.6% | 74.7% | +5 pts |
| AQuA-RAT | 22.4% | 33.5% | +11 pts |
Why CoT Works: The Scratchpad Mechanism
Chain-of-thought converts a single multi-step prediction into a sequence of simpler next-token predictions, where each step conditions on previous reasoning steps. The generated text serves as “working memory”: the model stores intermediate results in the output stream rather than relying on internal representations to hold them across many attention layers.
Self-Consistency Sampling (Wang et al., 2022)
Generate k CoT paths independently; take majority vote on final answers:
| Paths sampled (k) | GSM8K Accuracy (PaLM 540B) | Compute Cost |
|---|---|---|
| 1 | 57.0% | 1× |
| 10 | 66.9% | 10× |
| 40 | 74.4% | 40× |
Self-consistency trades inference compute for accuracy — highly effective when the inference budget allows multiple samples.
Scale Dependency
| Model Size | CoT Benefit on GSM8K |
|---|---|
| ~350M | Negative (CoT hurts vs standard) |
| ~8B | Minimal / negligible |
| ~62B | Small positive |
| ~540B | Large (+39 percentage points) |
See prompt-engineering for broader technique comparisons, emergent-capabilities for why CoT benefits emerge sharply at scale, and tool-use-function-calling for how reasoning traces guide tool selection in agentic settings.
Related Pages
Sources
- Wei et al. (2022) — Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022
- Kojima et al. (2022) — Large Language Models are Zero-Shot Reasoners. NeurIPS 2022
- Wang et al. (2022) — Self-Consistency Improves Chain of Thought Reasoning. ICLR 2023
Frequently Asked Questions
Why does chain-of-thought prompting only work at large scale?
Wei et al. (2022) tested CoT across models from ~300M to 540B parameters. Below ~100B parameters, CoT consistently equaled or underperformed standard prompting — models generated plausible-looking but incorrect reasoning chains. Above ~100B parameters, CoT reliably improved accuracy. The explanation: reasoning chains require the model to perform compositional operations (arithmetic, logical deduction) in the generated text. This requires sufficient capacity to both generate coherent language and correctly execute the intermediate computations.
What is self-consistency and how does it improve on basic CoT?
Basic CoT generates one reasoning chain and takes its final answer. Self-consistency (Wang et al., 2022) generates k diverse reasoning paths using temperature sampling and takes a majority vote over the final answers. The intuition: there are multiple valid reasoning paths to a correct answer, but incorrect reasoning produces more varied wrong answers. On GSM8K, self-consistency with k=40 adds +17% over single-path CoT (74% vs 57%), at the cost of 40× more inference compute.