Instruction Tuning: Zero-Shot Generalization via Multi-Task Fine-Tuning

Category: alignment Updated: 2026-02-27

Wei et al. (2022) FLAN: instruction-tuning on 62 tasks improves zero-shot performance on 25 of 25 held-out tasks; 137B FLAN outperforms GPT-3 175B zero-shot on 20 of 25 tasks.

Key Data Points
MeasureValueUnitNotes
FLAN task count (original)62tasksWei et al. (2022); tasks grouped into 12 clusters; held-out tasks tested zero-shot
FLAN zero-shot improvement rate25/25held-out tasksFLAN 137B outperforms untuned 137B on all 25 held-out task clusters zero-shot
FLAN-T5 (Flan-PaLM) task count1,836fine-tuning tasksChung et al. (2022): scaling to 1836 tasks further improves zero-shot and few-shot performance
T0 training tasks171datasetsSanh et al. (2022) T0: trained on 171 prompted datasets; zero-shot on 4 held-out SuperGLUE tasks
FLAN MMLU improvement+10.2%accuracyChung et al. Flan-PaLM 540B vs PaLM 540B on MMLU 5-shot: 70.9% vs 69.3% (+1.6%); Flan improves few-shot too

Instruction tuning (also called instruction fine-tuning or multi-task prompted training) is a post-pretraining phase that dramatically improves a language model’s ability to follow novel instructions zero-shot. By training on diverse instruction-formatted tasks, models learn a general skill of interpreting and executing natural language directives.

The Core Setup

An instruction-tuned model is trained on examples of the form:

[Instruction]: Translate the following English sentence to French.
[Input]: The cat sat on the mat.
[Output]: Le chat était assis sur le tapis.

Tasks are reformatted from existing datasets (NLI, reading comprehension, summarization, translation, commonsense reasoning, etc.) into instruction form. The model is then trained with standard cross-entropy loss on the output tokens.

FLAN: Finetuned Language Models Are Zero-Shot Learners

Wei et al. (2022) applied instruction tuning to a 137B parameter pretrained language model using 62 tasks grouped into 12 clusters. They evaluated zero-shot on held-out tasks not seen during tuning:

ModelZero-shot avg (held-out)Few-shot avg (held-out)
GPT-3 175B (no tuning)baselinebaseline
LaMDA-PT 137B (no tuning)lowerlower
FLAN 137B+29.8% vs LaMDA-PT+16.7%

FLAN 137B outperforms GPT-3 175B zero-shot on 20 of 25 held-out task clusters — despite using fewer parameters — demonstrating that instruction tuning is more efficient than raw scale for zero-shot generalization.

Cluster Held-Out Evaluation Design

A key methodological contribution: tasks are grouped into semantic clusters, and entire clusters are held out at test time (not just individual tasks). This prevents data contamination from tasks that differ only superficially.

Held-out clusterExample tasksFLAN result
Reading comprehensionNaturalQA, TriviaQA+15% over untuned
CommonsenseHellaSwag, PiQA+18% over untuned
Closed-book QANaturalQA closed+22% over untuned
NLIANLI, SNLI+25% over untuned
CoreferenceWinogrande+12% over untuned

Scaling: Flan-T5 and Flan-PaLM

Chung et al. (2022) extended instruction tuning to 1,836 tasks and applied it to models up to 540B parameters, finding continued improvement with both model scale and task count:

ModelTasksMMLU 5-shotBBH 3-shot
PaLM 62B052.9%35.2%
Flan-PaLM 62B183659.6%45.9%
PaLM 540B069.3%52.3%
Flan-PaLM 540B183673.5%66.3%

Instruction tuning improves both zero-shot and few-shot performance at all tested scales.

Comparison to Alternatives

MethodData requiredHuman labelingGeneralizes zero-shot
Standard fine-tuningTask-specific pairsNo (uses existing datasets)No
Instruction tuningMulti-task instruction pairsMinimal (repurposes datasets)Yes
RLHFPreference comparisonsYes (human raters)Improves alignment
Prompt engineeringNo trainingNoLimited

See fine-tuning for the general fine-tuning framework, rlhf for the reinforcement learning from human feedback method that builds on instruction tuning, and alignment-problem for why zero-shot generalization is central to the alignment challenge.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

What is the difference between instruction tuning and standard fine-tuning?

Standard fine-tuning adapts a model to a single task by continuing training on that task's labeled data. Instruction tuning trains on a large collection of tasks formatted as natural language instructions, explicitly optimizing for generalization. The key result (Wei et al., 2022): models fine-tuned on 62 instructionformatted tasks generalize zero-shot to held-out tasks, while the same model without instruction tuning does not generalize. Instruction tuning teaches the model to 'follow instructions' as a meta-skill.

Why does instruction tuning only help larger models?

Wei et al. (2022) showed that instruction tuning degrades performance on held-out tasks for models with fewer than ~100B parameters. For small models, instruction tuning on many tasks causes interference — the model memorizes task-specific patterns rather than learning generalizable instruction-following. Above ~100B parameters, the model has enough capacity to abstract the meta-skill of following natural language instructions.

What does prompt/instruction format matter?

Sanh et al. (2022) showed that prompting format significantly affects zero-shot transfer. They trained T0 on 171 datasets each with multiple human-written prompt templates, forcing the model to be robust to natural variation in how instructions are expressed. This prompt-variety training improved generalization compared to training on a single canonical format per task.

← All AI pages · Dashboard