In-Context Learning: Task Adaptation from Prompt Examples Without Weight Updates
Brown et al. (NeurIPS 2020) GPT-3: k-shot ICL from prompt examples without weight updates; 32-shot achieves 79.3 on SuperGLUE vs 88.9 fine-tuned BERT; Min et al. (EMNLP 2022): randomly flipping demonstration labels drops accuracy only ~10%, indicating format/distribution matters more than correct labels.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| GPT-3 32-shot SuperGLUE score | 79.3 | points | Brown et al. (2020): GPT-3 175B 32-shot; fine-tuned BERT-large achieves 88.9 — 9.6-point gap |
| GPT-3 1-shot TriviaQA | 68.0% | Exact Match | Brown et al. (2020): 0-shot = 64.3%; fine-tuned T5 = 50.1%; ICL surpasses fine-tuned T5 |
| ICL emergent parameter threshold | ~1B parameters | parameters | Brown et al. (2020): meaningful ICL gains appear above ~1B parameters; minimal below |
| Label-flip impact on ICL accuracy | ~10% drop | % accuracy | Min et al. (2022): randomly flipping all demonstration labels drops accuracy only ~10%, not ~50% |
In-context learning (ICL) is the ability of large language models to adapt to new tasks by processing demonstrations in the input prompt — without any gradient updates to model weights. GPT-3 (Brown et al., 2020) demonstrated at scale that a single pre-trained model can perform hundreds of different tasks depending solely on how the prompt is structured.
How In-Context Learning Works
A k-shot ICL prompt provides k labeled examples followed by the test input:
Input: The food was delicious. → Sentiment: Positive
Input: The service was terrible. → Sentiment: Negative
Input: The room was clean. → Sentiment: [MODEL PREDICTS]
The model generates predictions conditioned on all preceding context, using attention over the examples to identify the task structure and expected output format.
GPT-3 ICL Performance (Brown et al., 2020)
| Task | 0-shot | 1-shot | Few-shot | Fine-tuned SOTA |
|---|---|---|---|---|
| TriviaQA | 64.3% | 68.0% | 71.2% | ~75% |
| WebQuestions | 14.4% | 25.3% | 41.5% | 41.7% |
| CoQA (F1) | 81.5% | 84.0% | 85.0% | ~90% |
| SuperGLUE | ~71 | ~75 | 79.3 | 88.9 |
Scaling and ICL Ability
| Model Size | SuperGLUE (few-shot) | ICL Benefit |
|---|---|---|
| 350M | ~52 | Minimal |
| 1.3B | ~58 | Small |
| 6.7B | ~66 | Moderate |
| 13B | ~69 | Clear |
| 175B | 79.3 | Strong |
The Bayesian Interpretation (Xie et al., 2021)
Xie et al. model ICL as implicit Bayesian inference:
- Pre-trained LM has prior P(concept) over task concepts from training data structure
- k demonstrations are Bayesian evidence updating this prior: P(concept | demos)
- Generation conditions on the posterior over task concepts
This explains two key observations:
- ICL improves with more examples (more evidence)
- Correct labels matter less than format (demonstrations identify the concept, not its mapping)
What ICL Does and Does Not Learn
| Component of Demonstration | Impact on Accuracy |
|---|---|
| Input-output format | High impact; model must match output structure |
| Label space (set of possible outputs) | High impact |
| Input distribution (what examples look like) | Moderate impact |
| Correct input-label mappings | Low impact (~10% when all flipped) |
See few-shot-learning for k-shot performance benchmarks across tasks, chain-of-thought for reasoning-trace augmentation that dramatically improves ICL on math, and emergent-capabilities for why ICL only emerges at large scale.
Related Pages
Sources
- Brown et al. (2020) — Language Models are Few-Shot Learners (GPT-3). NeurIPS 2020
- Xie et al. (2021) — An Explanation of In-Context Learning as Implicit Bayesian Inference. ICLR 2022
- Min et al. (2022) — Rethinking the Role of Demonstrations for ICL. EMNLP 2022
Frequently Asked Questions
Does in-context learning actually learn from labeled examples, or does it retrieve task patterns?
Min et al. (2022) found that randomly flipping all demonstration labels reduces accuracy by only ~10% (not ~50% as would be expected if correct labels were essential). This suggests ICL primarily identifies which task format to apply — the input format, output format, label space, and data distribution — rather than learning from individual labeled examples. Xie et al. (2021) formalize this as Bayesian inference: demonstrations are evidence that updates a prior over task concepts encoded during pre-training.
What is the difference between in-context learning and fine-tuning?
Fine-tuning updates model weights via gradient descent on task-specific data, permanently adapting the model. In-context learning freezes all weights — adaptation occurs entirely through the attention mechanism processing the prompt. Fine-tuning typically achieves 10–20 higher accuracy (on benchmark comparisons) but requires separate weights per task, training compute, and labeled data. ICL is instant, requires no training, handles many tasks from a single model, but is limited by context window length and provides noisier adaptation.