LoRA: Low-Rank Adaptation for Parameter-Efficient Fine-Tuning
LoRA (Hu et al., 2021): rank-4 decomposition ΔW=BA reduces trainable parameters to 0.01% of full model while matching full fine-tuning BLEU on E2E NLG; no inference latency added after weight merging.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| LoRA trainable parameters (rank-4) | ~0.01% | of full model | Hu et al.: 4.7M trainable vs 175B total for GPT-3 scale model at rank 4 |
| Rank used in Hu et al. experiments | 4–8 | rank r | Ranks 4 and 8 match or exceed full fine-tuning; very small r suffices for most tasks |
| E2E NLG BLEU — LoRA vs full fine-tuning | 68.6 vs 68.2 | BLEU | LoRA (rank 4) slightly outperforms full fine-tuning on E2E NLG benchmark (Hu et al. Table 4) |
| Memory reduction (LoRA vs full fine-tune) | 3× | GPU memory | No optimizer states for frozen weights; full fine-tuning stores Adam states for all params |
| QLoRA quantization | 4-bit NormalFloat | quantization | Dettmers et al.: 4-bit quantized base model + LoRA adapters; 65B model fits on single GPU |
LoRA (Low-Rank Adaptation) addresses the computational challenge of fine-tuning large pretrained models: full fine-tuning requires optimizer states, gradients, and weight copies for every parameter — scaling prohibitively with model size. LoRA reparameterizes weight updates as products of small matrices, reducing trainable parameters by orders of magnitude while retaining task performance.
The Core Reparameterization
For a pretrained weight matrix W₀ ∈ ℝ^{d×k}, full fine-tuning learns a dense update ΔW ∈ ℝ^{d×k}. LoRA instead constrains:
ΔW = B · A, where B ∈ ℝ^{d×r}, A ∈ ℝ^{r×k}, rank r ≪ min(d, k)
The forward pass becomes:
h = W₀x + ΔWx = W₀x + BAx
- W₀ is frozen (no gradient computed)
- Only A and B are trained
- A is initialized from N(0, σ²); B is initialized to zero (so ΔW = 0 at start)
- Scaling factor α/r is applied (α is a hyperparameter, typically equal to r)
Parameter Efficiency
For a weight matrix of size d=4096, k=4096 (typical attention projection in a large model):
| Method | Trainable params (per matrix) | vs Full Fine-Tune |
|---|---|---|
| Full fine-tuning | 4096 × 4096 = 16.7M | 1× |
| LoRA rank 64 | (4096+4096) × 64 = 524K | 3.1% |
| LoRA rank 8 | (4096+4096) × 8 = 65.5K | 0.39% |
| LoRA rank 4 | (4096+4096) × 4 = 32.8K | 0.20% |
| LoRA rank 1 | (4096+4096) × 1 = 8.2K | 0.05% |
For a 175B parameter model, LoRA at rank 4 applied to attention Q/V matrices reduces trainable parameters from 175B to ~4.7M — a reduction of ~37,000×.
Benchmark Results (Hu et al., 2021)
| Method | E2E BLEU | WikiSQL Acc | SAMSum R-1 | Trainable params |
|---|---|---|---|---|
| Full fine-tune | 68.2 | 74.0% | 50.3 | 175B |
| Adapter (Houlsby) | 66.3 | 73.2% | 49.8 | +0.3% |
| Prefix tuning | 67.0 | 73.9% | 49.8 | +0.1% |
| LoRA (rank 4) | 68.6 | 73.8% | 50.8 | 0.01% |
LoRA matches or slightly exceeds full fine-tuning on all three benchmarks while using a fraction of the trainable parameters.
Rank Sensitivity Analysis
| Rank r | E2E BLEU | WikiSQL Acc | Behavior |
|---|---|---|---|
| 1 | 68.0 | 73.5% | Near-optimal; lowest cost |
| 2 | 68.4 | 73.7% | Marginal improvement |
| 4 | 68.6 | 73.8% | Sweet spot |
| 8 | 68.5 | 73.9% | Plateau |
| 64 | 68.5 | 74.0% | No benefit over r=4 |
The empirical result that r=4 nearly saturates performance supports the low intrinsic dimensionality hypothesis of Aghajanyan et al. (2021).
QLoRA: Quantization + LoRA
Dettmers et al. (2023) combined LoRA with 4-bit quantization (NF4 — Normal Float 4, optimized for normally distributed weights):
| Method | GPU memory (65B model) | Performance vs 16-bit |
|---|---|---|
| 16-bit full fine-tune | ~780 GB (not feasible on ≤8 GPUs) | 100% |
| 16-bit LoRA | ~200 GB | ~99% |
| QLoRA (4-bit NF4 + LoRA) | ~48 GB (1× A100 80GB) | ~99% |
QLoRA makes it possible to fine-tune very large models on a single consumer GPU, enabling instruction tuning and alignment fine-tuning at dramatically lower hardware cost.
See fine-tuning for the general fine-tuning framework, instruction-tuning for the instruction-following fine-tuning paradigm LoRA is commonly applied to, and knowledge-distillation for an alternative approach to creating smaller, more efficient models.
Related Pages
Sources
- Hu et al. (2021) — LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022
- Dettmers et al. (2023) — QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023
- Aghajanyan et al. (2021) — Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. ACL 2021
Frequently Asked Questions
Why does low-rank adaptation work — doesn't the model need to change all its weights?
Aghajanyan et al. (2021) showed that fine-tuning has a low intrinsic dimensionality: models trained for downstream tasks converge to solutions that can be expressed as perturbations in a very low-dimensional subspace of weight space. LoRA exploits this by restricting weight updates to rank-r matrices. Even with r=1 or r=4, the model can capture the task-specific signal because the pretrained weights already encode most of the required knowledge; only a small directional update is needed.
How is LoRA merged for inference — does it add compute?
At inference time, LoRA weights are merged into the frozen weights: W' = W + BA. This requires a single O(d²) addition done once before serving. After merging, the model has the same architecture and computational cost as the original — no adapter layers, no extra forward pass branches, no latency penalty. This is a key advantage over other PEFT methods that leave adapter modules in the computation graph.
Which weight matrices should LoRA be applied to?
Hu et al. (2021) tested applying LoRA to different subsets of weight matrices in the attention mechanism. Applying LoRA to all four attention matrices (Q, K, V, output projection) at rank 8 achieves the best results. Applying only to query and value matrices at rank 16 achieves comparable results. The feedforward layers can also be adapted but empirically contribute less per parameter. Most practitioners apply LoRA to Q, K, V projections at minimum.