Paper breakdown

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu et al. · 2021 · ICLR 2022

Freezes the pretrained weight matrix W and learns a low-rank update BA at fine-tune time, where rank(BA) is typically 4-32 vs. thousands. Cuts trainable parameters by three to four orders of magnitude with no inference-time penalty.

arXiv:2106.09685

Overview

Hu et al. (2021) observed that the weight updates produced by full fine-tuning of a pretrained transformer are approximately low rank. They proposed: instead of learning $W \to W + \Delta W$ for the full $\Delta W \in \mathbb{R}^{d \times k}$ , fix $W$ and learn $\Delta W = B A$ where $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$ , and $r \ll \min(d, k)$ . For typical attention projections in a 7B-parameter model, $r$ between 4 and 16 captures most of the fine-tune signal at $\sim 10^{-4}$ the parameter count.

The architectural implication is small. The training implication is large: optimizer state for $A$ and $B$ fits on a consumer GPU even when $W$ does not. The deployment implication is also small in the right direction: at inference time the merged weight $W + BA$ is a single matrix, so there is no extra forward-pass cost — the variant is functionally equivalent to a full fine-tune with the rank constraint.

LoRA became, alongside QLoRA (Dettmers et al., 2023), the standard recipe for adapting open-weight foundation models. By 2024 most public fine-tuning recipes are LoRA-flavored; HuggingFace's PEFT library is built around it. The paper is also a clean example of an empirical observation (low-rank fine-tune updates) translating directly into a method.

Mathematical Contributions

The decomposition

For a pretrained linear layer with weights $W_0 \in \mathbb{R}^{d \times k}$ , full fine-tuning replaces $W_0$ with $W_0 + \Delta W$ where every entry of $\Delta W$ is trainable, requiring $dk$ parameters. LoRA factors $\Delta W = B A$ with $B \in \mathbb{R}^{d \times r}, A \in \mathbb{R}^{r \times k}$ , requiring $r(d + k)$ parameters. The forward pass is:

$h = (W_0 + \Delta W) x = W_0 x + B A x$

with $W_0$ frozen and only $A, B$ trained. For attention layers in a 7B-parameter LLaMA, taking $r = 8$ gives roughly $0.1\%$ of the original parameters as trainable.

Initialization

The paper initializes $A$ from a small random Gaussian and $B = 0$ . Then $\Delta W = BA = 0$ at the start, so the network is exactly equivalent to the pretrained model on step 0. This is the LoRA equivalent of "start from a clean fine-tune" and is critical: any nonzero initialization for both factors would perturb the model before training.

Scaling factor

The full update is written:

$h = W_0 x + \frac{\alpha}{r}\, B A x$

where $\alpha$ is a fixed scalar (typically chosen so that $\alpha / r$ is constant as $r$ varies). The factor $\alpha/r$ decouples the effective learning rate from the rank, so changing $r$ does not require re-tuning the optimizer. In practice $\alpha = 16, r = 8$ is a common default; $\alpha = 2r$ is also common.

Where to apply it

The paper applies LoRA to the query and value projections $W_q, W_v$ in attention layers and reports that adapting only those two is sufficient to match full fine-tune performance on the GLUE and instruction-following benchmarks they evaluate. Adding LoRA to $W_k, W_o$ and to MLP layers gives marginal improvement. The core experimental claim is that the attention projections are the right adaptation surface; the MLP is closer to a feature library that fine-tuning rarely needs to overwrite.

This claim does not survive cleanly into the 2023+ era of large-scale instruction-tuning and tool-use fine-tuning, where LoRA on all linear layers (including MLP) is the more common configuration. The original ablation was at small scale and on academic benchmarks.

The intrinsic-rank hypothesis

Section 4.2 measures the singular values of $\Delta W$ from a full fine-tune and finds they decay rapidly: the top 8 singular values capture $> 90\%$ of the Frobenius norm. This motivates the rank constraint as a structural prior, not just a parameter budget. The paper credits Aghajanyan, Zettlemoyer, and Gupta (2020) for the prior observation that fine-tuning operates in a low intrinsic dimension of the parameter space.

The hypothesis is a useful prior but not a theorem. There are tasks (notably some math reasoning and out-of-distribution adaptation) where higher rank or full fine-tune is measurably better. LoRA is a strong default, not a universal substitute.

The merge at inference

After training, $W = W_0 + BA$ is a single matrix and replaces $W_0$ in the deployed model. The forward pass at inference has no extra layer, no extra memory load, no extra FLOPs — it is bit-for-bit identical to a model with $W$ as its only weight. This is the property that distinguishes LoRA from prefix tuning, prompt tuning, adapters, and IA³, which all add inference-time cost.

Memory savings during training

The dominant memory cost in fine-tuning a large model is optimizer state — Adam stores two floats (first and second moment) per trainable parameter at fp32, so 8 bytes per param. A 7B model fully fine-tuned needs ~56 GB of optimizer state alone. With LoRA at $r = 8$ on attention layers, the trainable count drops to ~ $10^7$ , optimizer state to ~80 MB. The forward-pass activations still go through the full $W_0$ , but those are recomputable.

QLoRA (Dettmers et al., 2023) extends this by quantizing $W_0$ to 4-bit while keeping $A, B$ at fp16 or bfp16. The combination fits a 65B model fine-tune on a single 48 GB GPU.

Connections to TheoremPath Topics

Fine-tuning and adaptation — the broader landscape of full fine-tune vs PEFT vs prompt-tuning.
Transformer architecture — the architecture LoRA modifies.
Attention mechanism theory — the projections $W_q, W_k, W_v, W_o$ that LoRA targets.
Knowledge distillation — the alternative approach to specialization (smaller separate model rather than smaller delta).

Why It Matters Now

LoRA changed who can fine-tune. Before LoRA, adapting a 13B-parameter model required a multi-GPU node with ~150 GB of GPU memory; after LoRA + QLoRA, the same task fits on one consumer 24 GB card. This shifted research and product development away from the largest labs and made the open-weight ecosystem (LLaMA, Mistral, Qwen, DeepSeek, etc.) usable for downstream applications.

The structural prior — fine-tune updates are low rank — has held up across a wide range of tasks but is not universal. For tasks that genuinely require new capabilities (long-horizon math, tool use, in-context retrieval over very long contexts), full fine-tune or higher-rank LoRA outperforms small- $r$ LoRA. The 2024 picture is that LoRA at $r \in [16, 64]$ on all linear layers (attention and MLP) is the practical default; the original $r \in [4, 8]$ on only $W_q, W_v$ is no longer the recommended setting for instruction tuning.

The paper also planted the question of whether the pretrained model itself operates near a low-dimensional manifold of useful directions, which the fine-tune just tilts along. Some interpretability work (Anthropic's superposition line, sparse-autoencoder feature decomposition) is a related angle on the same observation: the active dimension of the model is much smaller than the parameter count.

References

Canonical:

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR. arXiv:2106.09685.

Direct precursors:

Aghajanyan, A., Zettlemoyer, L., & Gupta, S. (2021). "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning." ACL. arXiv:2012.13255. The empirical observation that fine-tuning operates in low intrinsic dimension.
Houlsby, N. et al. (2019). "Parameter-Efficient Transfer Learning for NLP." ICML. arXiv:1902.00751. Adapter modules — the prior PEFT method.
Li, X. L., & Liang, P. (2021). "Prefix-Tuning: Optimizing Continuous Prompts for Generation." ACL. arXiv:2101.00190. Trainable virtual prompt tokens.

Direct descendants:

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." NeurIPS. arXiv:2305.14314. 4-bit quantized base + LoRA adapter; fits 65B fine-tune on one GPU.
Liu, S.-Y. et al. (2024). "DoRA: Weight-Decomposed Low-Rank Adaptation." ICML. arXiv:2402.09353. Decomposes $W$ into magnitude and direction; LoRAs only the direction.
Hayou, S., Ghosh, N., & Yu, B. (2024). "LoRA+: Efficient Low Rank Adaptation of Large Models." ICML. arXiv:2402.12354. Sets a different learning rate for $A$ and $B$ .

Empirical evaluations:

Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., & Raffel, C. (2022). "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning." NeurIPS. arXiv:2205.05638. Compares PEFT methods including LoRA at scale.
Biderman, D. et al. (2024). "LoRA Learns Less and Forgets Less." arXiv:2405.09673. LoRA preserves base capabilities better than full fine-tune.

Standard textbook:

Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. Chapter 9.5 — fine-tuning and adapters.

Connected topics

Last reviewed: May 5, 2026