LLM Construction
KV Cache Optimization
Advanced techniques for managing the KV cache memory bottleneck: paged attention for fragmentation-free allocation, prefix caching for shared prompts, token eviction for long sequences, and quantized KV cache for reduced footprint.
Prerequisites
Why This Matters
The KV cache is the dominant memory consumer during LLM inference. For a 70B model serving 128K-token contexts, the KV cache requires approximately 40 GB per sequence. Naive memory management wastes 60-80% of allocated GPU memory through fragmentation and over-allocation. The techniques in this topic collectively reduce memory waste, increase batch sizes, and enable longer contexts. They are the difference between serving 4 concurrent requests and serving 20 on the same hardware.
Mental Model
Think of GPU memory for KV cache like a hard drive for a file system. Naive allocation is like giving every file its maximum possible size upfront. Paged attention is like a file system with block allocation: allocate small blocks on demand, use a page table to track them. Prefix caching is like hard links: multiple files sharing the same data blocks. Token eviction is like an LRU cache eviction policy. Quantization is like file compression.
Paged Attention (vLLM)
Standard KV cache allocation pre-allocates a contiguous memory region for each sequence at the maximum context length. Problems:
- Internal fragmentation: a sequence using 500 of 4096 allocated tokens wastes 88% of its allocation
- External fragmentation: after sequences of different lengths complete, the freed memory has gaps that cannot be used for new large allocations
- No sharing: two sequences with the same system prompt store identical KV cache entries separately
Paged Attention
Paged attention (Kwon et al., 2023) divides the KV cache into fixed-size blocks (pages), each storing key-value pairs for a fixed number of tokens (typically 16 or 32). A block table maps logical token positions to physical memory blocks for each sequence. Blocks are allocated on demand from a shared pool and need not be contiguous.
Paged Attention Achieves Near-Optimal Memory Utilization
Statement
With paged attention and block size , the memory waste per sequence is at most token slots (the internal fragmentation of the last block). The memory utilization ratio is:
For , utilization approaches 1. In contrast, contiguous pre-allocation at maximum length has utilization , which can be arbitrarily small.
Kwon et al. (2023) report that vLLM achieves memory utilization above 96% compared to 20-40% for naive pre-allocation in typical serving workloads.
Intuition
The only wasted memory is the unused portion of the last block. With block size 16 and a 1000-token sequence, you allocate 63 blocks (1008 slots), wasting only 8 slots (0.8%). Naive pre-allocation at would waste 3096 slots (75.6%).
Why It Matters
The 2-4x throughput improvement reported by vLLM comes almost entirely from this memory efficiency gain. Higher utilization means more sequences can fit in GPU memory simultaneously, which means larger batch sizes and better GPU utilization during the memory-bandwidth-bound decode phase.
Failure Mode
Paged attention adds a layer of indirection (the block table lookup) at each attention computation. This adds a small latency overhead per token, typically under 5% on modern hardware. The overhead increases if the block table becomes very large (extremely long sequences with very small block sizes).
Prefix Caching
Many serving scenarios involve repeated prefixes. Every request to a chatbot shares the same system prompt. Every request in a batch of document QA shares the same document context. Prefix caching exploits this by storing KV cache blocks for common prefixes once and sharing them across requests.
Prefix Caching
Prefix caching identifies requests with shared token prefixes and maps their block tables to the same physical memory blocks for those shared positions. When a request diverges from the prefix, new blocks are allocated for the diverging tokens. This uses copy-on-write semantics: the shared blocks are read-only, and a request only gets its own copy of a block when it needs to modify it (which does not happen for KV cache, since past key-value pairs are immutable).
Prefix Sharing Reduces Memory Proportional to Shared Length
Statement
Without prefix caching, the total KV cache memory for requests is proportional to . With prefix caching, the shared prefix occupies blocks stored once, and each request adds unique blocks. The total memory is:
The memory savings ratio is approximately:
For large and substantial, savings approach . A shared system prompt of 2000 tokens in a 4000-token context with 100 requests saves approximately 50% of KV cache memory.
Intuition
If all requests share 50% of their tokens as a common prefix, you store that prefix once instead of times. The memory for the shared portion drops from to .
Why It Matters
In production chatbot deployments, system prompts are often 1000-3000 tokens. With thousands of concurrent requests, prefix caching saves terabytes of aggregate KV cache memory across a serving cluster.
Failure Mode
Prefix caching only helps when requests genuinely share a prefix. For workloads with diverse, unique prompts (e.g., code completion on different codebases), the cache hit rate is low and the overhead of tracking potential prefixes may not be worthwhile.
Token Eviction for Long Sequences
For sequences that exceed available memory, or when aggressive memory savings are needed, some tokens can be dropped from the KV cache.
Attention-based eviction. Track the cumulative attention weight each cached token receives. Tokens that consistently receive near-zero attention are unlikely to be needed in the future and can be evicted. H2O (Heavy Hitter Oracle, Zhang et al., 2023) keeps the tokens with highest cumulative attention scores plus a sliding window of recent tokens.
Sliding window attention. Keep only the most recent tokens in the KV cache plus a set of "sink" tokens (the first few tokens, which empirically always receive high attention). StreamingLLM (Xiao et al., 2023) uses this approach to enable infinite-length generation with fixed memory.
The trade-off is clear: evicted tokens are permanently lost. If the model later needs information from an evicted token, generation quality degrades.
Quantized KV Cache
The KV cache is typically stored in the same precision as model activations (fp16 or bf16). But the key and value vectors tolerate quantization better than model weights because they are consumed once (during the attention computation) rather than reused across the entire forward pass.
INT8 KV cache. Quantize cached keys and values to 8-bit integers. This halves the memory footprint. Dequantization happens on-the-fly during the attention computation. The accuracy degradation is typically under 0.5% on standard benchmarks.
INT4 KV cache. Quantize to 4-bit integers. This quarters the memory footprint but introduces more noticeable quality degradation, especially for tasks requiring precise recall of information from early in the context. Group quantization (quantizing blocks of values with per-group scale factors) mitigates this.
For a 70B model with GQA (, ) serving 128K-token contexts:
- fp16 KV cache: approximately 40 GB per sequence
- INT8 KV cache: approximately 20 GB per sequence
- INT4 KV cache: approximately 10 GB per sequence
FP8 KV cache. FP8 formats (E4M3 or E5M2, Micikevicius et al., 2022) give the same 2x memory reduction as INT8 but preserve the exponent range, which matters for outlier keys and values. TensorRT-LLM and vLLM both support FP8 KV cache. The halved HBM footprint also halves HBM traffic during the memory-bandwidth-bound decode phase, so FP8 improves throughput in addition to capacity.
Multi-head Latent Attention (MLA)
Multi-head Latent Attention
MLA (DeepSeek-V2, 2024; DeepSeek-V3, 2024) compresses the KV cache by projecting keys and values into a shared low-rank latent with , caches the latent instead of per-head K,V, and reconstructs per-head keys and values via learned up-projections at attention time.
MLA is mathematically distinct from GQA. GQA shares K,V across groups of query heads (head-group sharing). MLA compresses the entire head dimension into a shared latent (rank compression). For DeepSeek-V3, while , a 32x compression of the cached dimension. DeepSeek-V2 reports MLA achieving lower KV cache memory than GQA at matched quality.
The trade-off is extra compute at attention time: per-token up-projections from latent to per-head keys and values. For memory-bandwidth-bound decode, this is a favorable trade (extra FLOPs are cheap relative to HBM traffic).
Cross-request Caching: SGLang RadixAttention
Prefix caching as described above handles exact prefix matches within a batch or across sequential requests in vLLM. RadixAttention (Zheng et al., 2023, SGLang) generalizes this with a tree-structured prefix cache and LRU eviction: multiple requests sharing any prefix of the prompt tree share KV blocks, not just contiguous start prefixes. This is most valuable for agentic workloads (ReAct-style loops with shared system prompt and tool descriptions, branching on tool call outputs) and few-shot inference (thousands of requests sharing the same few-shot header).
Anthropic's Prompt Caching API (2024) exposes this capability to end users: requests mark cache breakpoints, and the server persists KV blocks across requests up to a TTL.
Context-aware Eviction: SnapKV and MInference
H2O and StreamingLLM evict based on cumulative attention scores or position. More recent work exploits structured sparsity in long-context attention.
SnapKV (Li et al., 2024) uses an observation window of the most recent tokens to identify which earlier positions received concentrated attention from the observation window, then keeps only those KV entries for subsequent decoding. This is lossy like H2O but selects positions using a targeted probe rather than running averages.
MInference (Jiang et al., 2024) identifies three structured sparsity patterns in long-context attention maps (A-shape, Vertical-Slash, Block-Sparse) and runs a sparse attention kernel that skips masked positions. Unlike eviction, MInference keeps all KV entries in memory but skips computation on statistically unimportant positions.
CacheBlend (Yao et al., 2024) targets RAG: when a request concatenates retrieved chunks, CacheBlend selectively recomputes KV entries for cross-chunk attention rather than reusing per-chunk caches wholesale, preserving attention fidelity at retrieved-chunk boundaries.
CPU and NVMe Offloading
For contexts or cache footprints that exceed GPU HBM, the KV cache can be tiered: hot blocks on HBM, warm blocks on CPU DRAM, cold blocks on NVMe. vLLM's SwapBlockManager implements HBM-to-CPU swapping block-wise on preemption; DeepSpeed-Inference (ZeRO-Inference) extends this to NVMe. The penalty is PCIe bandwidth (32 GB/s on PCIe 4.0 x16) for CPU-to-GPU and single-GB/s for NVMe-to-GPU, both much slower than HBM. Offloading is therefore valuable for capacity (serving contexts that would not fit at all) but not for throughput on critical-path decode tokens.
Common Confusions
Paged attention is about memory management, not attention computation
Paged attention does not change the mathematical attention operation. The queries, keys, values, and softmax computation are identical. What changes is where the keys and values are stored in GPU memory and how that memory is allocated and tracked. The attention kernel must be modified to handle non-contiguous memory (gathering blocks via the page table), but the mathematical result is the same.
Token eviction is lossy, prefix caching is lossless
Prefix caching is an exact optimization: the shared KV cache entries are identical to what would be computed independently. Token eviction discards information and can degrade generation quality. They solve different problems: prefix caching reduces redundancy across requests; token eviction reduces memory per request.
KV cache quantization is different from weight quantization
Weight quantization compresses model parameters, reducing model size and compute. KV cache quantization compresses the runtime cache, reducing inference memory. They can be applied independently. A model can use INT4 weights with fp16 KV cache, or fp16 weights with INT8 KV cache, or both.
Summary
- Paged attention: allocate KV cache in blocks on demand, like virtual memory pages
- vLLM achieves 96%+ memory utilization vs. 20-40% for naive pre-allocation
- Prefix caching: share KV blocks for common prefixes across requests, saving memory
- Token eviction (H2O, StreamingLLM): drop low-attention tokens, enabling unbounded context with fixed memory
- Quantized KV cache: INT8 halves memory (minimal quality loss), INT4 quarters it (some quality loss)
- These optimizations compose: paged attention + prefix caching + INT8 quantization together
Exercises
Problem
A serving system uses paged attention with block size . A request uses 500 tokens. How many blocks are allocated? What is the memory utilization? Compare to naive pre-allocation at maximum length .
Problem
A serving cluster handles 1000 concurrent chatbot requests. Each request has a 2048-token system prompt and an average conversation length of 1024 user/assistant tokens (total 3072 tokens per request). The model uses GQA with , , layers, in fp16. Compare total KV cache memory with and without prefix caching.
References
Canonical:
- Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (2023)
Current:
- Zhang et al., "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models" (2023)
- Xiao et al., "Efficient Streaming Language Models with Attention Sinks" (2023). StreamingLLM
- Hooper et al., "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization" (2024)
- DeepSeek-AI, "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (2024), arXiv:2405.04434. MLA
- DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024), arXiv:2412.19437
- Zheng et al., "SGLang: Efficient Execution of Structured Language Model Programs" (2023), arXiv:2312.07104. RadixAttention
- Li et al., "SnapKV: LLM Knows What You are Looking for Before Generation" (2024), arXiv:2404.14469
- Jiang et al., "MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention" (2024), arXiv:2407.02490
- Yao et al., "CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion" (2024), arXiv:2405.16444
- Micikevicius et al., "FP8 Formats for Deep Learning" (2022), arXiv:2209.05433
- Anthropic, "Prompt Caching" (2024). Anthropic API documentation
Next Topics
The natural next steps from KV cache optimization:
- Inference systems overview: the full stack of LLM serving optimization
Last reviewed: April 18, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- KV Cachelayer 5 · tier 2
Derived topics
1- Prefix Cachinglayer 5 · tier 2
Graph-backed continuations