Few-Shot Learning: Language Model Task Performance from k In-Context Demonstrations

Category: agents-applications Updated: 2026-02-27

Brown et al. (NeurIPS 2020): GPT-3 175B 32-shot SuperGLUE = 79.3 vs fine-tuned BERT 88.9; Zhao et al. (ICML 2021): different orderings of same k examples produce up to ±15% accuracy variance; calibrating against neutral-input priors reduces order sensitivity.

Key Data Points
MeasureValueUnitNotes
GPT-3 32-shot SuperGLUE79.3 pointspointsBrown et al. (2020): 32 examples in context; fine-tuned BERT-large = 88.9; 9.6-point gap
Few-shot vs fine-tuning accuracy gap10–20%% accuracyConsistent gap across NLP benchmarks; fine-tuning remains more accurate for most tasks
Example order sensitivityup to ±15%% accuracy varianceZhao et al. (2021): same k examples in different orders produce large accuracy swings on classification
Standard k values benchmarkedk = 0, 1, 10, 32shotsBrown et al. (2020): 0-shot, 1-shot, and 'few-shot' (context window limit) are standard conditions

Few-shot learning in language models refers to task performance given only k labeled demonstrations in the input prompt, with no gradient updates. Brown et al. (2020)‘s GPT-3 paper established the standard evaluation protocol: benchmark 0-shot, 1-shot, and up to 32-shot (or context-window-limited) performance across diverse tasks.

Standard Benchmark Conditions

ConditionPrompt ExamplesWeight UpdateNotes
Zero-shot0NoTask instruction only
One-shot1NoSingle input-output demonstration
Few-shot2–32 (context-limited)NoTypically 10–32 in GPT-3 paper
Fine-tuned0 at inferenceYesTrained on k examples before deployment

GPT-3 Few-Shot vs Fine-Tuning (Brown et al., 2020)

BenchmarkGPT-3 0-shotGPT-3 few-shotFine-tuned SOTA
SuperGLUE~7179.388.9 (BERT-large)
SQuAD v2 (F1)~6989.291.1
TriviaQA64.3%71.2%~75%
HellaSwag78.9%79.3%86.5% (ALBERT)
NaturalQuestions14.6%29.9%~50% (T5)

Scaling: Few-Shot Accuracy vs Model Size

Model SizeSuperGLUE (few-shot)Incremental Gain
1.3B~58
6.7B~66+8
13B~69+3
175B79.3+10.3

The largest gains occur at the extremes: from small to medium scale (capacity for basic task understanding) and from large to very large scale (multi-step compositional reasoning).

Prompt Calibration (Zhao et al., 2021)

Language models exhibit two systematic biases in few-shot classification:

BiasCauseCalibration Fix
Recency biasLast example in context gets higher attention weightAverage accuracy over multiple orderings
Majority-label biasPre-training prior favors common label stringsDivide probabilities by neutral-input priors

Calibration procedure: compute the model’s predicted probabilities for each label when the input is a neutral string (“N/A”). Use these as priors: p̃(y|x) = p(y|x) / p(y|“N/A”). This substantially reduces order sensitivity and improves calibration.

Few-Shot vs Fine-Tuning: Decision Factors

FactorFew-Shot PreferredFine-Tuning Preferred
Labeled data available<100 examples>1000 examples
Task stabilityTransient / experimentalStable, production use
Model countSingle model, many tasksSeparate model per task acceptable
Accuracy requirementTolerant of 10–20% gapGap is critical
Inference costStandardCan afford extra compute

See in-context-learning for the theoretical account of why few-shot prompting works, prompt-engineering for techniques to reduce order sensitivity, and fine-tuning for when weight updates outperform prompting.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why is few-shot performance sensitive to example order?

Zhao et al. (2021) found accuracy swings of up to ±15% from reordering the same k examples. The cause is a recency bias: tokens near the end of the context receive higher attention weight, making the last few examples disproportionately influential. The model is also biased toward label frequencies matching what it saw during pre-training. Calibration — dividing output probabilities by priors computed on neutral inputs — significantly reduces both recency bias and majority-label bias.

When should few-shot prompting be preferred over fine-tuning?

Few-shot prompting is preferable when: (1) labeled data is very scarce (<100 examples) — insufficient to fine-tune reliably; (2) tasks are transient or low-priority; (3) a single deployed model must handle many different tasks; (4) rapid prototyping without retraining is needed. Fine-tuning is preferable when accuracy is critical, data is available (>1000 examples), the task is stable, and the 10–20% accuracy gap over few-shot matters for the application.

← All AI pages · Dashboard