Instruction Tuning: Zero-Shot Generalization via Multi-Task Fine-Tuning

Name: Instruction Tuning: Zero-Shot Generalization via Multi-Task Fine-Tuning
Creator: AI Tower
Published: 2026-02-27

Category: alignment Updated: 2026-02-27

Wei et al. (2022) FLAN: instruction-tuning on 62 tasks improves zero-shot performance on 25 of 25 held-out tasks; 137B FLAN outperforms GPT-3 175B zero-shot on 20 of 25 tasks.

Key Data Points
Measure	Value	Unit	Notes
FLAN task count (original)	62	tasks	Wei et al. (2022); tasks grouped into 12 clusters; held-out tasks tested zero-shot
FLAN zero-shot improvement rate	25/25	held-out tasks	FLAN 137B outperforms untuned 137B on all 25 held-out task clusters zero-shot
FLAN-T5 (Flan-PaLM) task count	1,836	fine-tuning tasks	Chung et al. (2022): scaling to 1836 tasks further improves zero-shot and few-shot performance
T0 training tasks	171	datasets	Sanh et al. (2022) T0: trained on 171 prompted datasets; zero-shot on 4 held-out SuperGLUE tasks
FLAN MMLU improvement	+10.2%	accuracy	Chung et al. Flan-PaLM 540B vs PaLM 540B on MMLU 5-shot: 70.9% vs 69.3% (+1.6%); Flan improves few-shot too

Instruction tuning (also called instruction fine-tuning or multi-task prompted training) is a post-pretraining phase that dramatically improves a language model’s ability to follow novel instructions zero-shot. By training on diverse instruction-formatted tasks, models learn a general skill of interpreting and executing natural language directives.

The Core Setup

An instruction-tuned model is trained on examples of the form:

[Instruction]: Translate the following English sentence to French.
[Input]: The cat sat on the mat.
[Output]: Le chat était assis sur le tapis.

Tasks are reformatted from existing datasets (NLI, reading comprehension, summarization, translation, commonsense reasoning, etc.) into instruction form. The model is then trained with standard cross-entropy loss on the output tokens.

FLAN: Finetuned Language Models Are Zero-Shot Learners

Wei et al. (2022) applied instruction tuning to a 137B parameter pretrained language model using 62 tasks grouped into 12 clusters. They evaluated zero-shot on held-out tasks not seen during tuning:

Model	Zero-shot avg (held-out)	Few-shot avg (held-out)
GPT-3 175B (no tuning)	baseline	baseline
LaMDA-PT 137B (no tuning)	lower	lower
FLAN 137B	+29.8% vs LaMDA-PT	+16.7%

FLAN 137B outperforms GPT-3 175B zero-shot on 20 of 25 held-out task clusters — despite using fewer parameters — demonstrating that instruction tuning is more efficient than raw scale for zero-shot generalization.

Cluster Held-Out Evaluation Design

A key methodological contribution: tasks are grouped into semantic clusters, and entire clusters are held out at test time (not just individual tasks). This prevents data contamination from tasks that differ only superficially.

Held-out cluster	Example tasks	FLAN result
Reading comprehension	NaturalQA, TriviaQA	+15% over untuned
Commonsense	HellaSwag, PiQA	+18% over untuned
Closed-book QA	NaturalQA closed	+22% over untuned
NLI	ANLI, SNLI	+25% over untuned
Coreference	Winogrande	+12% over untuned

Scaling: Flan-T5 and Flan-PaLM

Chung et al. (2022) extended instruction tuning to 1,836 tasks and applied it to models up to 540B parameters, finding continued improvement with both model scale and task count:

Model	Tasks	MMLU 5-shot	BBH 3-shot
PaLM 62B	0	52.9%	35.2%
Flan-PaLM 62B	1836	59.6%	45.9%
PaLM 540B	0	69.3%	52.3%
Flan-PaLM 540B	1836	73.5%	66.3%

Instruction tuning improves both zero-shot and few-shot performance at all tested scales.

Comparison to Alternatives

Method	Data required	Human labeling	Generalizes zero-shot
Standard fine-tuning	Task-specific pairs	No (uses existing datasets)	No
Instruction tuning	Multi-task instruction pairs	Minimal (repurposes datasets)	Yes
RLHF	Preference comparisons	Yes (human raters)	Improves alignment
Prompt engineering	No training	No	Limited

See fine-tuning for the general fine-tuning framework, rlhf for the reinforcement learning from human feedback method that builds on instruction tuning, and alignment-problem for why zero-shot generalization is central to the alignment challenge.

🧠 🧠 🧠

Sources

Frequently Asked Questions

What is the difference between instruction tuning and standard fine-tuning?

Standard fine-tuning adapts a model to a single task by continuing training on that task's labeled data. Instruction tuning trains on a large collection of tasks formatted as natural language instructions, explicitly optimizing for generalization. The key result (Wei et al., 2022): models fine-tuned on 62 instructionformatted tasks generalize zero-shot to held-out tasks, while the same model without instruction tuning does not generalize. Instruction tuning teaches the model to 'follow instructions' as a meta-skill.

Why does instruction tuning only help larger models?

Wei et al. (2022) showed that instruction tuning degrades performance on held-out tasks for models with fewer than ~100B parameters. For small models, instruction tuning on many tasks causes interference — the model memorizes task-specific patterns rather than learning generalizable instruction-following. Above ~100B parameters, the model has enough capacity to abstract the meta-skill of following natural language instructions.

What does prompt/instruction format matter?

Sanh et al. (2022) showed that prompting format significantly affects zero-shot transfer. They trained T0 on 171 datasets each with multiple human-written prompt templates, forcing the model to be robust to natural variation in how instructions are expressed. This prompt-variety training improved generalization compared to training on a single canonical format per task.

← All AI pages · Dashboard