Fine-Tuning Language Models: Full, Adapter, and Parameter-Efficient Methods
Howard & Ruder (2018) ULMFiT established pretraining + fine-tuning as the dominant NLP paradigm; PEFT methods (LoRA, adapters) achieve within 1% of full fine-tuning quality while updating <1% of parameters.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| ULMFiT classification error reduction | 18–24% | error reduction | Howard & Ruder (2018): across 6 text classification datasets vs training from scratch |
| Adapter parameter overhead | 0.5–3.6% | additional parameters | Houlsby et al. (2019): bottleneck adapters within 0.4% of full fine-tuning on GLUE |
| Prompt tuning parity threshold | ~10B | model parameters | Lester et al. (2021): prompt tuning matches fine-tuning only at ≥10B scale |
| ULMFiT training data reduction | 100× | less labeled data | Howard & Ruder: pretraining enables competitive performance with 100× less task-labeled data |
| Catastrophic forgetting | Present without mitigation | Full fine-tuning on new task degrades performance on pretraining knowledge; use lower LR, freezing |
Fine-tuning is the dominant paradigm for adapting large pretrained language models to specific tasks. Rather than training task-specific models from scratch, fine-tuning continues gradient-based optimization on task data starting from pretrained weights, leveraging the representations learned during pretraining.
The Pretraining + Fine-Tuning Paradigm
Howard & Ruder (2018) formalized the pretraining + fine-tuning recipe for NLP:
- Pretrain a language model on large general-domain text (WikiText-103 in ULMFiT; web-scale corpora for modern LLMs)
- Language model fine-tuning: continue training on task-domain text without labels (optional domain adaptation step)
- Classifier fine-tuning: train a task-specific head on labeled task data
The result: 18–24% error reduction across 6 text classification datasets versus training from scratch, with 100× less labeled data required.
Fine-Tuning Spectrum
| Method | Params updated | Inference cost | Memory | Isolation |
|---|---|---|---|---|
| Full fine-tuning | 100% | Same as base | High (all optimizer states) | None |
| Layer freezing (top-N layers) | 10–30% | Same as base | Medium | Partial |
| Adapter (Houlsby et al.) | 0.5–3.6% | Slight overhead | Low | High |
| LoRA (Hu et al.) | 0.01–1% | Same (after merge) | Low | Medium |
| Prefix tuning (Li & Liang) | <0.1% | Slight overhead | Very low | Medium |
| Prompt tuning (Lester et al.) | <0.01% | Same | Minimal | Low |
Adapter Layers (Houlsby et al., 2019)
Adapters insert small bottleneck modules within each transformer layer. Each adapter is:
h → LayerNorm → W_down ∈ ℝ^{d×r} → ReLU → W_up ∈ ℝ^{r×d} → + h
Where r ≪ d (bottleneck dimension, typically 64 for d=1024). Only adapter weights are trained; all other parameters are frozen.
| Adapter bottleneck r | GLUE avg | Additional params |
|---|---|---|
| Full fine-tuning | 80.0 | +0% |
| r = 256 | 79.9 | +3.6% |
| r = 64 | 79.6 | +0.9% |
| r = 8 | 79.2 | +0.1% |
Within 0.8 GLUE points of full fine-tuning using 0.1% additional parameters.
ULMFiT Discriminative Learning Rates
Howard & Ruder’s key insight for full fine-tuning: use different learning rates per layer, decreasing for earlier (more general) layers:
η_l = η_{L} / 2.6^{L−l}
Where L is the total number of layers and l is the current layer index. Deeper layers (close to the output) are updated more aggressively; lower layers preserving general representations are updated slowly.
Prompt Tuning (Lester et al., 2021)
Prompt tuning prepends a small number of learned “soft prompt” tokens to the input. Only these prefix embeddings are trained; the entire model is frozen:
| Scale | Prompt tuning accuracy | Model fine-tuning accuracy |
|---|---|---|
| 770M params | 89.5% | 91.3% |
| 3B params | 91.4% | 92.5% |
| 11B params | 92.7% | 92.5% |
At 11B scale, prompt tuning matches model fine-tuning exactly — but fails at smaller scales. This strong scale dependence limits practical applicability.
Practical Considerations
| Decision | Recommendation |
|---|---|
| Learning rate | 10×–100× lower than pretraining LR |
| Warm-up | 6% of training steps |
| Batch size | Larger than pretraining to stabilize gradients |
| Epochs | 1–3 epochs; more risks overfitting |
| Regularization | Weight decay 0.1, dropout 0.1 |
See lora-fine-tuning for the parameter-efficient LoRA method, instruction-tuning for multi-task instruction fine-tuning, and pre-training for the pretraining stage that fine-tuning builds upon.
Related Pages
Sources
- Howard & Ruder (2018) — Universal Language Model Fine-Tuning for Text Classification. ACL 2018
- Houlsby et al. (2019) — Parameter-Efficient Transfer Learning for NLP. ICML 2019
- Lester et al. (2021) — The Power of Scale for Parameter-Efficient Prompt Tuning. EMNLP 2021
Frequently Asked Questions
What is catastrophic forgetting in fine-tuning and how is it mitigated?
Catastrophic forgetting occurs when fine-tuning on a new task overwrites the pretrained weights, degrading general capabilities. Mitigations include: (1) using a much lower learning rate during fine-tuning than pretraining (Howard & Ruder use discriminative learning rates, e.g., LR/2.6 per layer deeper); (2) gradual unfreezing — fine-tuning only the top layers first, then progressively unfreezing deeper layers; (3) PEFT methods (LoRA, adapters) that freeze pretrained weights entirely; (4) multi-task training that keeps the model exposed to diverse tasks.
When should you use full fine-tuning vs LoRA vs adapters vs prompt tuning?
Full fine-tuning: when you have sufficient compute, a large labeled dataset, and need maximum task performance. LoRA: when you need to maintain multiple task-specific variants of a model, or when memory is limited — same inference cost after weight merging. Adapters (Houlsby): when you need strict parameter isolation between tasks (adapters are modular and can be swapped). Prompt tuning: only at very large scale (≥10B parameters); simpler but underperforms at smaller scales. In practice, LoRA has largely superseded adapters due to lower overhead and no inference latency.