Instruction Tuning: Zero-Shot Generalization via Multi-Task Fine-Tuning
Wei et al. (2022) FLAN: instruction-tuning on 62 tasks improves zero-shot performance on 25 of 25 held-out tasks; 137B FLAN outperforms GPT-3 175B zero-shot on 20 of 25 tasks.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| FLAN task count (original) | 62 | tasks | Wei et al. (2022); tasks grouped into 12 clusters; held-out tasks tested zero-shot |
| FLAN zero-shot improvement rate | 25/25 | held-out tasks | FLAN 137B outperforms untuned 137B on all 25 held-out task clusters zero-shot |
| FLAN-T5 (Flan-PaLM) task count | 1,836 | fine-tuning tasks | Chung et al. (2022): scaling to 1836 tasks further improves zero-shot and few-shot performance |
| T0 training tasks | 171 | datasets | Sanh et al. (2022) T0: trained on 171 prompted datasets; zero-shot on 4 held-out SuperGLUE tasks |
| FLAN MMLU improvement | +10.2% | accuracy | Chung et al. Flan-PaLM 540B vs PaLM 540B on MMLU 5-shot: 70.9% vs 69.3% (+1.6%); Flan improves few-shot too |
Instruction tuning (also called instruction fine-tuning or multi-task prompted training) is a post-pretraining phase that dramatically improves a language model’s ability to follow novel instructions zero-shot. By training on diverse instruction-formatted tasks, models learn a general skill of interpreting and executing natural language directives.
The Core Setup
An instruction-tuned model is trained on examples of the form:
[Instruction]: Translate the following English sentence to French.
[Input]: The cat sat on the mat.
[Output]: Le chat était assis sur le tapis.
Tasks are reformatted from existing datasets (NLI, reading comprehension, summarization, translation, commonsense reasoning, etc.) into instruction form. The model is then trained with standard cross-entropy loss on the output tokens.
FLAN: Finetuned Language Models Are Zero-Shot Learners
Wei et al. (2022) applied instruction tuning to a 137B parameter pretrained language model using 62 tasks grouped into 12 clusters. They evaluated zero-shot on held-out tasks not seen during tuning:
| Model | Zero-shot avg (held-out) | Few-shot avg (held-out) |
|---|---|---|
| GPT-3 175B (no tuning) | baseline | baseline |
| LaMDA-PT 137B (no tuning) | lower | lower |
| FLAN 137B | +29.8% vs LaMDA-PT | +16.7% |
FLAN 137B outperforms GPT-3 175B zero-shot on 20 of 25 held-out task clusters — despite using fewer parameters — demonstrating that instruction tuning is more efficient than raw scale for zero-shot generalization.
Cluster Held-Out Evaluation Design
A key methodological contribution: tasks are grouped into semantic clusters, and entire clusters are held out at test time (not just individual tasks). This prevents data contamination from tasks that differ only superficially.
| Held-out cluster | Example tasks | FLAN result |
|---|---|---|
| Reading comprehension | NaturalQA, TriviaQA | +15% over untuned |
| Commonsense | HellaSwag, PiQA | +18% over untuned |
| Closed-book QA | NaturalQA closed | +22% over untuned |
| NLI | ANLI, SNLI | +25% over untuned |
| Coreference | Winogrande | +12% over untuned |
Scaling: Flan-T5 and Flan-PaLM
Chung et al. (2022) extended instruction tuning to 1,836 tasks and applied it to models up to 540B parameters, finding continued improvement with both model scale and task count:
| Model | Tasks | MMLU 5-shot | BBH 3-shot |
|---|---|---|---|
| PaLM 62B | 0 | 52.9% | 35.2% |
| Flan-PaLM 62B | 1836 | 59.6% | 45.9% |
| PaLM 540B | 0 | 69.3% | 52.3% |
| Flan-PaLM 540B | 1836 | 73.5% | 66.3% |
Instruction tuning improves both zero-shot and few-shot performance at all tested scales.
Comparison to Alternatives
| Method | Data required | Human labeling | Generalizes zero-shot |
|---|---|---|---|
| Standard fine-tuning | Task-specific pairs | No (uses existing datasets) | No |
| Instruction tuning | Multi-task instruction pairs | Minimal (repurposes datasets) | Yes |
| RLHF | Preference comparisons | Yes (human raters) | Improves alignment |
| Prompt engineering | No training | No | Limited |
See fine-tuning for the general fine-tuning framework, rlhf for the reinforcement learning from human feedback method that builds on instruction tuning, and alignment-problem for why zero-shot generalization is central to the alignment challenge.
Related Pages
Sources
- Wei et al. (2022) — Finetuned Language Models are Zero-Shot Learners. ICLR 2022
- Sanh et al. (2022) — Multitask Prompted Training Enables Zero-Shot Task Generalization. ICLR 2022
- Chung et al. (2022) — Scaling Instruction-Finetuned Language Models. arXiv
Frequently Asked Questions
What is the difference between instruction tuning and standard fine-tuning?
Standard fine-tuning adapts a model to a single task by continuing training on that task's labeled data. Instruction tuning trains on a large collection of tasks formatted as natural language instructions, explicitly optimizing for generalization. The key result (Wei et al., 2022): models fine-tuned on 62 instructionformatted tasks generalize zero-shot to held-out tasks, while the same model without instruction tuning does not generalize. Instruction tuning teaches the model to 'follow instructions' as a meta-skill.
Why does instruction tuning only help larger models?
Wei et al. (2022) showed that instruction tuning degrades performance on held-out tasks for models with fewer than ~100B parameters. For small models, instruction tuning on many tasks causes interference — the model memorizes task-specific patterns rather than learning generalizable instruction-following. Above ~100B parameters, the model has enough capacity to abstract the meta-skill of following natural language instructions.
What does prompt/instruction format matter?
Sanh et al. (2022) showed that prompting format significantly affects zero-shot transfer. They trained T0 on 171 datasets each with multiple human-written prompt templates, forcing the model to be robust to natural variation in how instructions are expressed. This prompt-variety training improved generalization compared to training on a single canonical format per task.