Skip to main content

Paper breakdown

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu et al. · 2021 · ICLR 2022

Freezes the pretrained weight matrix W and learns a low-rank update BA at fine-tune time, where rank(BA) is typically 4-32 vs. thousands. Cuts trainable parameters by three to four orders of magnitude with no inference-time penalty.

Overview

Hu et al. (2021) observed that the weight updates produced by full fine-tuning of a pretrained transformer are approximately low rank. They proposed: instead of learning WW+ΔWW \to W + \Delta W for the full ΔWRd×k\Delta W \in \mathbb{R}^{d \times k}, fix WW and learn ΔW=BA\Delta W = B A where BRd×rB \in \mathbb{R}^{d \times r}, ARr×kA \in \mathbb{R}^{r \times k}, and rmin(d,k)r \ll \min(d, k). For typical attention projections in a 7B-parameter model, rr between 4 and 16 captures most of the fine-tune signal at 104\sim 10^{-4} the parameter count.

The architectural implication is small. The training implication is large: optimizer state for AA and BB fits on a consumer GPU even when WW does not. The deployment implication is also small in the right direction: at inference time the merged weight W+BAW + BA is a single matrix, so there is no extra forward-pass cost — the variant is functionally equivalent to a full fine-tune with the rank constraint.

LoRA became, alongside QLoRA (Dettmers et al., 2023), the standard recipe for adapting open-weight foundation models. By 2024 most public fine-tuning recipes are LoRA-flavored; HuggingFace's PEFT library is built around it. The paper is also a clean example of an empirical observation (low-rank fine-tune updates) translating directly into a method.

Mathematical Contributions

The decomposition

For a pretrained linear layer with weights W0Rd×kW_0 \in \mathbb{R}^{d \times k}, full fine-tuning replaces W0W_0 with W0+ΔWW_0 + \Delta W where every entry of ΔW\Delta W is trainable, requiring dkdk parameters. LoRA factors ΔW=BA\Delta W = B A with BRd×r,ARr×kB \in \mathbb{R}^{d \times r}, A \in \mathbb{R}^{r \times k}, requiring r(d+k)r(d + k) parameters. The forward pass is:

h=(W0+ΔW)x=W0x+BAxh = (W_0 + \Delta W) x = W_0 x + B A x

with W0W_0 frozen and only A,BA, B trained. For attention layers in a 7B-parameter LLaMA, taking r=8r = 8 gives roughly 0.1%0.1\% of the original parameters as trainable.

Initialization

The paper initializes AA from a small random Gaussian and B=0B = 0. Then ΔW=BA=0\Delta W = BA = 0 at the start, so the network is exactly equivalent to the pretrained model on step 0. This is the LoRA equivalent of "start from a clean fine-tune" and is critical: any nonzero initialization for both factors would perturb the model before training.

Scaling factor

The full update is written:

h=W0x+αrBAxh = W_0 x + \frac{\alpha}{r}\, B A x

where α\alpha is a fixed scalar (typically chosen so that α/r\alpha / r is constant as rr varies). The factor α/r\alpha/r decouples the effective learning rate from the rank, so changing rr does not require re-tuning the optimizer. In practice α=16,r=8\alpha = 16, r = 8 is a common default; α=2r\alpha = 2r is also common.

Where to apply it

The paper applies LoRA to the query and value projections Wq,WvW_q, W_v in attention layers and reports that adapting only those two is sufficient to match full fine-tune performance on the GLUE and instruction-following benchmarks they evaluate. Adding LoRA to Wk,WoW_k, W_o and to MLP layers gives marginal improvement. The core experimental claim is that the attention projections are the right adaptation surface; the MLP is closer to a feature library that fine-tuning rarely needs to overwrite.

This claim does not survive cleanly into the 2023+ era of large-scale instruction-tuning and tool-use fine-tuning, where LoRA on all linear layers (including MLP) is the more common configuration. The original ablation was at small scale and on academic benchmarks.

The intrinsic-rank hypothesis

Section 4.2 measures the singular values of ΔW\Delta W from a full fine-tune and finds they decay rapidly: the top 8 singular values capture >90%> 90\% of the Frobenius norm. This motivates the rank constraint as a structural prior, not just a parameter budget. The paper credits Aghajanyan, Zettlemoyer, and Gupta (2020) for the prior observation that fine-tuning operates in a low intrinsic dimension of the parameter space.

The hypothesis is a useful prior but not a theorem. There are tasks (notably some math reasoning and out-of-distribution adaptation) where higher rank or full fine-tune is measurably better. LoRA is a strong default, not a universal substitute.

The merge at inference

After training, W=W0+BAW = W_0 + BA is a single matrix and replaces W0W_0 in the deployed model. The forward pass at inference has no extra layer, no extra memory load, no extra FLOPs — it is bit-for-bit identical to a model with WW as its only weight. This is the property that distinguishes LoRA from prefix tuning, prompt tuning, adapters, and IA³, which all add inference-time cost.

Memory savings during training

The dominant memory cost in fine-tuning a large model is optimizer state — Adam stores two floats (first and second moment) per trainable parameter at fp32, so 8 bytes per param. A 7B model fully fine-tuned needs ~56 GB of optimizer state alone. With LoRA at r=8r = 8 on attention layers, the trainable count drops to ~10710^7, optimizer state to ~80 MB. The forward-pass activations still go through the full W0W_0, but those are recomputable.

QLoRA (Dettmers et al., 2023) extends this by quantizing W0W_0 to 4-bit while keeping A,BA, B at fp16 or bfp16. The combination fits a 65B model fine-tune on a single 48 GB GPU.

Connections to TheoremPath Topics

Why It Matters Now

LoRA changed who can fine-tune. Before LoRA, adapting a 13B-parameter model required a multi-GPU node with ~150 GB of GPU memory; after LoRA + QLoRA, the same task fits on one consumer 24 GB card. This shifted research and product development away from the largest labs and made the open-weight ecosystem (LLaMA, Mistral, Qwen, DeepSeek, etc.) usable for downstream applications.

The structural prior — fine-tune updates are low rank — has held up across a wide range of tasks but is not universal. For tasks that genuinely require new capabilities (long-horizon math, tool use, in-context retrieval over very long contexts), full fine-tune or higher-rank LoRA outperforms small-rr LoRA. The 2024 picture is that LoRA at r[16,64]r \in [16, 64] on all linear layers (attention and MLP) is the practical default; the original r[4,8]r \in [4, 8] on only Wq,WvW_q, W_v is no longer the recommended setting for instruction tuning.

The paper also planted the question of whether the pretrained model itself operates near a low-dimensional manifold of useful directions, which the fine-tune just tilts along. Some interpretability work (Anthropic's superposition line, sparse-autoencoder feature decomposition) is a related angle on the same observation: the active dimension of the model is much smaller than the parameter count.

References

Canonical:

  • Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR. arXiv:2106.09685.

Direct precursors:

  • Aghajanyan, A., Zettlemoyer, L., & Gupta, S. (2021). "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning." ACL. arXiv:2012.13255. The empirical observation that fine-tuning operates in low intrinsic dimension.
  • Houlsby, N. et al. (2019). "Parameter-Efficient Transfer Learning for NLP." ICML. arXiv:1902.00751. Adapter modules — the prior PEFT method.
  • Li, X. L., & Liang, P. (2021). "Prefix-Tuning: Optimizing Continuous Prompts for Generation." ACL. arXiv:2101.00190. Trainable virtual prompt tokens.

Direct descendants:

  • Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." NeurIPS. arXiv:2305.14314. 4-bit quantized base + LoRA adapter; fits 65B fine-tune on one GPU.
  • Liu, S.-Y. et al. (2024). "DoRA: Weight-Decomposed Low-Rank Adaptation." ICML. arXiv:2402.09353. Decomposes WW into magnitude and direction; LoRAs only the direction.
  • Hayou, S., Ghosh, N., & Yu, B. (2024). "LoRA+: Efficient Low Rank Adaptation of Large Models." ICML. arXiv:2402.12354. Sets a different learning rate for AA and BB.

Empirical evaluations:

  • Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., & Raffel, C. (2022). "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning." NeurIPS. arXiv:2205.05638. Compares PEFT methods including LoRA at scale.
  • Biderman, D. et al. (2024). "LoRA Learns Less and Forgets Less." arXiv:2405.09673. LoRA preserves base capabilities better than full fine-tune.

Standard textbook:

  • Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. Chapter 9.5 — fine-tuning and adapters.

Connected topics

Last reviewed: May 5, 2026