LoRA: Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

Name: LoRA: Low-Rank Adaptation for Parameter-Efficient Fine-Tuning
Creator: AI Tower
Published: 2026-02-27

Category: alignment Updated: 2026-02-27

LoRA (Hu et al., 2021): rank-4 decomposition ΔW=BA reduces trainable parameters to 0.01% of full model while matching full fine-tuning BLEU on E2E NLG; no inference latency added after weight merging.

Key Data Points
Measure	Value	Unit	Notes
LoRA trainable parameters (rank-4)	~0.01%	of full model	Hu et al.: 4.7M trainable vs 175B total for GPT-3 scale model at rank 4
Rank used in Hu et al. experiments	4–8	rank r	Ranks 4 and 8 match or exceed full fine-tuning; very small r suffices for most tasks
E2E NLG BLEU — LoRA vs full fine-tuning	68.6 vs 68.2	BLEU	LoRA (rank 4) slightly outperforms full fine-tuning on E2E NLG benchmark (Hu et al. Table 4)
Memory reduction (LoRA vs full fine-tune)	3×	GPU memory	No optimizer states for frozen weights; full fine-tuning stores Adam states for all params
QLoRA quantization	4-bit NormalFloat	quantization	Dettmers et al.: 4-bit quantized base model + LoRA adapters; 65B model fits on single GPU

LoRA (Low-Rank Adaptation) addresses the computational challenge of fine-tuning large pretrained models: full fine-tuning requires optimizer states, gradients, and weight copies for every parameter — scaling prohibitively with model size. LoRA reparameterizes weight updates as products of small matrices, reducing trainable parameters by orders of magnitude while retaining task performance.

The Core Reparameterization

For a pretrained weight matrix W₀ ∈ ℝ^{d×k}, full fine-tuning learns a dense update ΔW ∈ ℝ^{d×k}. LoRA instead constrains:

ΔW = B · A, where B ∈ ℝ^{d×r}, A ∈ ℝ^{r×k}, rank r ≪ min(d, k)

The forward pass becomes:

h = W₀x + ΔWx = W₀x + BAx

W₀ is frozen (no gradient computed)
Only A and B are trained
A is initialized from N(0, σ²); B is initialized to zero (so ΔW = 0 at start)
Scaling factor α/r is applied (α is a hyperparameter, typically equal to r)

Parameter Efficiency

For a weight matrix of size d=4096, k=4096 (typical attention projection in a large model):

Method	Trainable params (per matrix)	vs Full Fine-Tune
Full fine-tuning	4096 × 4096 = 16.7M	1×
LoRA rank 64	(4096+4096) × 64 = 524K	3.1%
LoRA rank 8	(4096+4096) × 8 = 65.5K	0.39%
LoRA rank 4	(4096+4096) × 4 = 32.8K	0.20%
LoRA rank 1	(4096+4096) × 1 = 8.2K	0.05%

For a 175B parameter model, LoRA at rank 4 applied to attention Q/V matrices reduces trainable parameters from 175B to ~4.7M — a reduction of ~37,000×.

Benchmark Results (Hu et al., 2021)

Method	E2E BLEU	WikiSQL Acc	SAMSum R-1	Trainable params
Full fine-tune	68.2	74.0%	50.3	175B
Adapter (Houlsby)	66.3	73.2%	49.8	+0.3%
Prefix tuning	67.0	73.9%	49.8	+0.1%
LoRA (rank 4)	68.6	73.8%	50.8	0.01%

LoRA matches or slightly exceeds full fine-tuning on all three benchmarks while using a fraction of the trainable parameters.

Rank Sensitivity Analysis

Rank r	E2E BLEU	WikiSQL Acc	Behavior
1	68.0	73.5%	Near-optimal; lowest cost
2	68.4	73.7%	Marginal improvement
4	68.6	73.8%	Sweet spot
8	68.5	73.9%	Plateau
64	68.5	74.0%	No benefit over r=4

The empirical result that r=4 nearly saturates performance supports the low intrinsic dimensionality hypothesis of Aghajanyan et al. (2021).

QLoRA: Quantization + LoRA

Dettmers et al. (2023) combined LoRA with 4-bit quantization (NF4 — Normal Float 4, optimized for normally distributed weights):

Method	GPU memory (65B model)	Performance vs 16-bit
16-bit full fine-tune	~780 GB (not feasible on ≤8 GPUs)	100%
16-bit LoRA	~200 GB	~99%
QLoRA (4-bit NF4 + LoRA)	~48 GB (1× A100 80GB)	~99%

QLoRA makes it possible to fine-tune very large models on a single consumer GPU, enabling instruction tuning and alignment fine-tuning at dramatically lower hardware cost.

See fine-tuning for the general fine-tuning framework, instruction-tuning for the instruction-following fine-tuning paradigm LoRA is commonly applied to, and knowledge-distillation for an alternative approach to creating smaller, more efficient models.

🧠 🧠 🧠

Sources

Frequently Asked Questions

Why does low-rank adaptation work — doesn't the model need to change all its weights?

Aghajanyan et al. (2021) showed that fine-tuning has a low intrinsic dimensionality: models trained for downstream tasks converge to solutions that can be expressed as perturbations in a very low-dimensional subspace of weight space. LoRA exploits this by restricting weight updates to rank-r matrices. Even with r=1 or r=4, the model can capture the task-specific signal because the pretrained weights already encode most of the required knowledge; only a small directional update is needed.

How is LoRA merged for inference — does it add compute?

At inference time, LoRA weights are merged into the frozen weights: W' = W + BA. This requires a single O(d²) addition done once before serving. After merging, the model has the same architecture and computational cost as the original — no adapter layers, no extra forward pass branches, no latency penalty. This is a key advantage over other PEFT methods that leave adapter modules in the computation graph.

Which weight matrices should LoRA be applied to?

Hu et al. (2021) tested applying LoRA to different subsets of weight matrices in the attention mechanism. Applying LoRA to all four attention matrices (Q, K, V, output projection) at rank 8 achieves the best results. Applying only to query and value matrices at rank 16 achieves comparable results. The feedforward layers can also be adapted but empirically contribute less per parameter. Most practitioners apply LoRA to Q, K, V projections at minimum.

← All AI pages · Dashboard