Fine-Tuning Language Models: Full, Adapter, and Parameter-Efficient Methods

Category: alignment Updated: 2026-02-27

Howard & Ruder (2018) ULMFiT established pretraining + fine-tuning as the dominant NLP paradigm; PEFT methods (LoRA, adapters) achieve within 1% of full fine-tuning quality while updating <1% of parameters.

Key Data Points
MeasureValueUnitNotes
ULMFiT classification error reduction18–24%error reductionHoward & Ruder (2018): across 6 text classification datasets vs training from scratch
Adapter parameter overhead0.5–3.6%additional parametersHoulsby et al. (2019): bottleneck adapters within 0.4% of full fine-tuning on GLUE
Prompt tuning parity threshold~10Bmodel parametersLester et al. (2021): prompt tuning matches fine-tuning only at ≥10B scale
ULMFiT training data reduction100×less labeled dataHoward & Ruder: pretraining enables competitive performance with 100× less task-labeled data
Catastrophic forgettingPresent without mitigationFull fine-tuning on new task degrades performance on pretraining knowledge; use lower LR, freezing

Fine-tuning is the dominant paradigm for adapting large pretrained language models to specific tasks. Rather than training task-specific models from scratch, fine-tuning continues gradient-based optimization on task data starting from pretrained weights, leveraging the representations learned during pretraining.

The Pretraining + Fine-Tuning Paradigm

Howard & Ruder (2018) formalized the pretraining + fine-tuning recipe for NLP:

  1. Pretrain a language model on large general-domain text (WikiText-103 in ULMFiT; web-scale corpora for modern LLMs)
  2. Language model fine-tuning: continue training on task-domain text without labels (optional domain adaptation step)
  3. Classifier fine-tuning: train a task-specific head on labeled task data

The result: 18–24% error reduction across 6 text classification datasets versus training from scratch, with 100× less labeled data required.

Fine-Tuning Spectrum

MethodParams updatedInference costMemoryIsolation
Full fine-tuning100%Same as baseHigh (all optimizer states)None
Layer freezing (top-N layers)10–30%Same as baseMediumPartial
Adapter (Houlsby et al.)0.5–3.6%Slight overheadLowHigh
LoRA (Hu et al.)0.01–1%Same (after merge)LowMedium
Prefix tuning (Li & Liang)<0.1%Slight overheadVery lowMedium
Prompt tuning (Lester et al.)<0.01%SameMinimalLow

Adapter Layers (Houlsby et al., 2019)

Adapters insert small bottleneck modules within each transformer layer. Each adapter is:

h → LayerNorm → W_down ∈ ℝ^{d×r} → ReLU → W_up ∈ ℝ^{r×d} → + h

Where r ≪ d (bottleneck dimension, typically 64 for d=1024). Only adapter weights are trained; all other parameters are frozen.

Adapter bottleneck rGLUE avgAdditional params
Full fine-tuning80.0+0%
r = 25679.9+3.6%
r = 6479.6+0.9%
r = 879.2+0.1%

Within 0.8 GLUE points of full fine-tuning using 0.1% additional parameters.

ULMFiT Discriminative Learning Rates

Howard & Ruder’s key insight for full fine-tuning: use different learning rates per layer, decreasing for earlier (more general) layers:

η_l = η_{L} / 2.6^{L−l}

Where L is the total number of layers and l is the current layer index. Deeper layers (close to the output) are updated more aggressively; lower layers preserving general representations are updated slowly.

Prompt Tuning (Lester et al., 2021)

Prompt tuning prepends a small number of learned “soft prompt” tokens to the input. Only these prefix embeddings are trained; the entire model is frozen:

ScalePrompt tuning accuracyModel fine-tuning accuracy
770M params89.5%91.3%
3B params91.4%92.5%
11B params92.7%92.5%

At 11B scale, prompt tuning matches model fine-tuning exactly — but fails at smaller scales. This strong scale dependence limits practical applicability.

Practical Considerations

DecisionRecommendation
Learning rate10×–100× lower than pretraining LR
Warm-up6% of training steps
Batch sizeLarger than pretraining to stabilize gradients
Epochs1–3 epochs; more risks overfitting
RegularizationWeight decay 0.1, dropout 0.1

See lora-fine-tuning for the parameter-efficient LoRA method, instruction-tuning for multi-task instruction fine-tuning, and pre-training for the pretraining stage that fine-tuning builds upon.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

What is catastrophic forgetting in fine-tuning and how is it mitigated?

Catastrophic forgetting occurs when fine-tuning on a new task overwrites the pretrained weights, degrading general capabilities. Mitigations include: (1) using a much lower learning rate during fine-tuning than pretraining (Howard & Ruder use discriminative learning rates, e.g., LR/2.6 per layer deeper); (2) gradual unfreezing — fine-tuning only the top layers first, then progressively unfreezing deeper layers; (3) PEFT methods (LoRA, adapters) that freeze pretrained weights entirely; (4) multi-task training that keeps the model exposed to diverse tasks.

When should you use full fine-tuning vs LoRA vs adapters vs prompt tuning?

Full fine-tuning: when you have sufficient compute, a large labeled dataset, and need maximum task performance. LoRA: when you need to maintain multiple task-specific variants of a model, or when memory is limited — same inference cost after weight merging. Adapters (Houlsby): when you need strict parameter isolation between tasks (adapters are modular and can be swapped). Prompt tuning: only at very large scale (≥10B parameters); simpler but underperforms at smaller scales. In practice, LoRA has largely superseded adapters due to lower overhead and no inference latency.

← All AI pages · Dashboard