Fine-Tuning Language Models: Full, Adapter, and Parameter-Efficient Methods

Name: Fine-Tuning Language Models: Full, Adapter, and Parameter-Efficient Methods
Creator: AI Tower
Published: 2026-02-27

Category: alignment Updated: 2026-02-27

Howard & Ruder (2018) ULMFiT established pretraining + fine-tuning as the dominant NLP paradigm; PEFT methods (LoRA, adapters) achieve within 1% of full fine-tuning quality while updating <1% of parameters.

Key Data Points
Measure	Value	Unit	Notes
ULMFiT classification error reduction	18–24%	error reduction	Howard & Ruder (2018): across 6 text classification datasets vs training from scratch
Adapter parameter overhead	0.5–3.6%	additional parameters	Houlsby et al. (2019): bottleneck adapters within 0.4% of full fine-tuning on GLUE
Prompt tuning parity threshold	~10B	model parameters	Lester et al. (2021): prompt tuning matches fine-tuning only at ≥10B scale
ULMFiT training data reduction	100×	less labeled data	Howard & Ruder: pretraining enables competitive performance with 100× less task-labeled data
Catastrophic forgetting	Present without mitigation		Full fine-tuning on new task degrades performance on pretraining knowledge; use lower LR, freezing

Fine-tuning is the dominant paradigm for adapting large pretrained language models to specific tasks. Rather than training task-specific models from scratch, fine-tuning continues gradient-based optimization on task data starting from pretrained weights, leveraging the representations learned during pretraining.

The Pretraining + Fine-Tuning Paradigm

Howard & Ruder (2018) formalized the pretraining + fine-tuning recipe for NLP:

Pretrain a language model on large general-domain text (WikiText-103 in ULMFiT; web-scale corpora for modern LLMs)
Language model fine-tuning: continue training on task-domain text without labels (optional domain adaptation step)
Classifier fine-tuning: train a task-specific head on labeled task data

The result: 18–24% error reduction across 6 text classification datasets versus training from scratch, with 100× less labeled data required.

Fine-Tuning Spectrum

Method	Params updated	Inference cost	Memory	Isolation
Full fine-tuning	100%	Same as base	High (all optimizer states)	None
Layer freezing (top-N layers)	10–30%	Same as base	Medium	Partial
Adapter (Houlsby et al.)	0.5–3.6%	Slight overhead	Low	High
LoRA (Hu et al.)	0.01–1%	Same (after merge)	Low	Medium
Prefix tuning (Li & Liang)	<0.1%	Slight overhead	Very low	Medium
Prompt tuning (Lester et al.)	<0.01%	Same	Minimal	Low

Adapter Layers (Houlsby et al., 2019)

Adapters insert small bottleneck modules within each transformer layer. Each adapter is:

h → LayerNorm → W_down ∈ ℝ^{d×r} → ReLU → W_up ∈ ℝ^{r×d} → + h

Where r ≪ d (bottleneck dimension, typically 64 for d=1024). Only adapter weights are trained; all other parameters are frozen.

Adapter bottleneck r	GLUE avg	Additional params
Full fine-tuning	80.0	+0%
r = 256	79.9	+3.6%
r = 64	79.6	+0.9%
r = 8	79.2	+0.1%

Within 0.8 GLUE points of full fine-tuning using 0.1% additional parameters.

ULMFiT Discriminative Learning Rates

Howard & Ruder’s key insight for full fine-tuning: use different learning rates per layer, decreasing for earlier (more general) layers:

η_l = η_{L} / 2.6^{L−l}

Where L is the total number of layers and l is the current layer index. Deeper layers (close to the output) are updated more aggressively; lower layers preserving general representations are updated slowly.

Prompt Tuning (Lester et al., 2021)

Prompt tuning prepends a small number of learned “soft prompt” tokens to the input. Only these prefix embeddings are trained; the entire model is frozen:

Scale	Prompt tuning accuracy	Model fine-tuning accuracy
770M params	89.5%	91.3%
3B params	91.4%	92.5%
11B params	92.7%	92.5%

At 11B scale, prompt tuning matches model fine-tuning exactly — but fails at smaller scales. This strong scale dependence limits practical applicability.

Practical Considerations

Decision	Recommendation
Learning rate	10×–100× lower than pretraining LR
Warm-up	6% of training steps
Batch size	Larger than pretraining to stabilize gradients
Epochs	1–3 epochs; more risks overfitting
Regularization	Weight decay 0.1, dropout 0.1

See lora-fine-tuning for the parameter-efficient LoRA method, instruction-tuning for multi-task instruction fine-tuning, and pre-training for the pretraining stage that fine-tuning builds upon.

🧠 🧠 🧠

Sources

Frequently Asked Questions

What is catastrophic forgetting in fine-tuning and how is it mitigated?

Catastrophic forgetting occurs when fine-tuning on a new task overwrites the pretrained weights, degrading general capabilities. Mitigations include: (1) using a much lower learning rate during fine-tuning than pretraining (Howard & Ruder use discriminative learning rates, e.g., LR/2.6 per layer deeper); (2) gradual unfreezing — fine-tuning only the top layers first, then progressively unfreezing deeper layers; (3) PEFT methods (LoRA, adapters) that freeze pretrained weights entirely; (4) multi-task training that keeps the model exposed to diverse tasks.

When should you use full fine-tuning vs LoRA vs adapters vs prompt tuning?

Full fine-tuning: when you have sufficient compute, a large labeled dataset, and need maximum task performance. LoRA: when you need to maintain multiple task-specific variants of a model, or when memory is limited — same inference cost after weight merging. Adapters (Houlsby): when you need strict parameter isolation between tasks (adapters are modular and can be swapped). Prompt tuning: only at very large scale (≥10B parameters); simpler but underperforms at smaller scales. In practice, LoRA has largely superseded adapters due to lower overhead and no inference latency.

← All AI pages · Dashboard