Inference Servers for LLMs

Why This Matters

A 7B decoder with 32 layers, 32 attention heads, head dimension 128, and FP16 key-value cache stores $2 \cdot 32 \cdot 32 \cdot 128 \cdot 2 = 524288$ bytes per generated token. One 8k-token conversation then occupies 4 GiB of KV cache if the model uses full multi-head KV. Grouped-query attention cuts that number, but it does not remove the scheduling problem.

Serving is not a single forward pass. One user sends a 64-token prompt and asks for 20 tokens. Another sends 7k tokens and asks for 1k tokens. A server must choose which prompts enter prefill, which active sequences decode one token, how KV blocks are placed, and when bytes are flushed to the client so tail latency stays bounded.

Core Definitions

Definition

Prefill

Prefill is the phase that processes the prompt tokens and creates the initial KV cache. For a prompt of length $n$ , self-attention work scales like $O(n^2)$ inside the attention sublayer, but GPUs can process many prompt positions in parallel.

Definition

Decode

Decode is the autoregressive phase after prefill. Each step appends one token per active sequence, reads the existing KV cache, writes new K and V rows, and samples the next token. Decode often becomes memory-bandwidth limited at small batch sizes.

Definition

Continuous batching

Continuous batching schedules at token granularity. Finished sequences leave immediately, and newly ready sequences enter on the next decoding iteration instead of waiting for the whole batch to end.

Definition

Paged attention

Paged attention stores each sequence's KV cache in fixed-size blocks and uses a block table to map logical token positions to physical blocks. vLLM uses this to reduce internal and external fragmentation in KV memory.

Why Static Batching Wastes Work

Static batching pads every request in a batch to the same prompt length and often reserves the same output length. It matches the dense tensor model, but request lengths in chat traffic are heterogeneous.

Consider three requests entering together.

Request	Prompt tokens	Generated tokens
A	8	4
B	64	12
C	1024	3

Static prefill pads all prompts to 1024 tokens. The server computes attention masks for a logical tensor of $3 \cdot 1024$ prompt positions even though two rows contain mostly padding. Static decode then keeps all three batch slots until the longest generation ends.

A decode timeline with static batching looks like this.

Step	A	B	C
1	live	live	live
2	live	live	live
3	live	live	live
4	live	live	done
5	pad	live	pad
12	pad	live	pad

After step 4, two thirds of the batch slots are idle. A dynamic batcher improves admission by waiting a few milliseconds for compatible requests, but once a batch starts, membership may still be fixed.

Continuous batching, used by Orca-style systems and vLLM, changes the unit of scheduling. At decode step 5, slots from A and C can be assigned to new requests D and E. The GPU kernel still sees a rectangular batch for the current step, but the server's metadata changes every step.

A simplified scheduler loop is short.

while (server_running) {
  admit_ready_prefills();          // prompts become KV blocks
  Batch b = pick_active_decodes(); // one token per live sequence
  run_decode_kernel(b);
  sample_and_append_tokens(b);
  evict_finished_sequences();
  flush_streaming_tokens();
}

The hard part is not the loop. The hard part is making pick_active_decodes keep GPU occupancy high without letting a 20k-token prompt block small requests behind it.

KV Cache Layout and Paged Attention

A decoder layer writes K and V vectors for every token. With 32 layers, 8 KV heads, head dimension 128, and FP16 entries, the KV bytes per token are:

2 \cdot 32 \cdot 8 \cdot 128 \cdot 2 = 131072

That is 128 KiB per token. A block size of 16 tokens stores 2 MiB of KV data for one sequence. Without paging, a server often reserves a contiguous maximum-length region per request. If a request reserves 2048 tokens but ends after 129 tokens, most of the allocation is dead until the request is removed.

Paged attention replaces the contiguous region with fixed blocks.

// One row per live sequence.
// block_table[seq][logical_block] gives a physical KV block id.
uint32_t block_table[3][8] = {
  {17, 42, 91,  0,  0,  0, 0, 0},  // seq 0 uses 3 blocks
  { 5,  6,  9, 10, 11, 12, 0, 0},  // seq 1 uses 6 blocks
  {88,  0,  0,  0,  0,  0, 0, 0}   // seq 2 uses 1 block
};

For block size 16, logical token 37 in sequence 0 maps to logical block 2 and offset 5. The physical block id is 91. If each token's KV record is 128 KiB, the byte offset inside that physical block is $5 \cdot 131072 = 655360$ .

A byte-level view of one block table entry on a little-endian machine is direct. Physical block id 91 is hexadecimal 0x0000005b; as four bytes it is:

5b 00 00 00

The GPU attention kernel reads the block table, gathers K and V from physical blocks, and computes attention for the current query. The indirection costs memory reads, but it avoids copying huge KV regions when sequences grow, fork, or finish. Kwon et al. report that this design raises effective KV utilization by eliminating most fragmentation from variable request lengths.

Prefill and Decode Scheduling

Prefill and decode stress different parts of the machine. Prefill for a long prompt has large matrix multiplications and attention over many prompt positions. Decode for a small active batch repeatedly reads model weights and KV cache to produce one token.

A Roofline-style lower bound shows the difference. Suppose a 7B model has 14 GB of FP16 weights. If batch size is 1 during decode, each generated token must stream roughly 14 GB of weights, ignoring KV and activations. On a GPU with 1.5 TB/s memory bandwidth, the weight-read lower bound is:

14 \text{ GB} / 1.5 \text{ TB/s} = 9.3 \text{ ms}

With batch size 32, the same weight read produces 32 tokens, so the weight-read bound per token is about 0.29 ms. Continuous batching exists partly to increase this reuse.

The conflict is that long prefill kernels can delay decode tokens already promised to users. A common policy separates work into queues.

struct Request {
  int prompt_tokens;
  int generated_tokens;
  int max_new_tokens;
  bool has_kv;
};

if (!r.has_kv) {
  prefill_queue.push(r);   // large, parallel, may be chunked
} else {
  decode_queue.push(r);    // latency-sensitive one-token steps
}

Chunked prefill splits a long prompt into pieces, for example 512 tokens at a time, and interleaves decode batches between chunks. Recent serving designs also disaggregate prefill and decode onto different machines. Prefill workers handle large prompt ingestion and write KV state or compact transfer records. Decode workers keep active continuations hot and focus on token latency. This adds network and placement complexity, so it pays most when prompt lengths are long and decode batches would otherwise be interrupted.

Speculative Decoding

Speculative decoding uses a cheaper draft model to propose $k$ tokens. The target model verifies those proposed tokens in one forward pass over the extended prefix. If the draft agrees with the target distribution under the chosen acceptance test, several tokens are emitted after one target call.

Let target decode cost be 1 unit per verification pass, draft cost be $c$ per token, and expected emitted tokens per verification be $E$ . The approximate speedup is:

S = E / (1 + k c)

For $k = 4$ , draft cost $c = 0.1$ , and independent acceptance probability $p = 0.8$ , a simple emitted-token estimate is:

E = 1 + p + p^2 + p^3 + p^4 = 3.3616

The speedup estimate is $3.3616 / 1.4 = 2.40$ . If $p$ falls to 0.35, then $E = 1.874$ and speedup drops to 1.34. The draft model must be cheap and close enough to the target on the served distribution.

Speculation changes batching. The target pass verifies several positions for each sequence, so the batch is no longer exactly one token per active request. Schedulers often cap speculative length for fairness and for memory predictability.

Quantization and the User-Facing Stream

Serving-time quantization reduces the bytes moved per token. INT8 weights cut weight bandwidth roughly in half compared with FP16. INT4 cuts it to one quarter before scales, zero points, packing overhead, and dequantization work. FP8 KV cache halves KV memory compared with FP16.

A packed INT4 byte contains two 4-bit weights.

#include <stdint.h>

uint8_t pack_int4(int8_t lo, int8_t hi) {
  return (uint8_t)((lo & 0x0f) | ((hi & 0x0f) << 4));
}

// lo = -3 has low nibble 0xd, hi = 6 has high nibble 0x6.
// packed byte is 0x6d.

If a model's FP16 weights occupy 14 GB, INT4 storage is about 3.5 GB plus metadata. If each group of 128 weights stores a FP16 scale, the scale overhead is 2 bytes per 128 weights, or 1.56 percent relative to INT4 payload. Quality loss depends on calibration, outlier handling, and which tensors are quantized.

Users do not see token kernels. They see a stream. Server-Sent Events send UTF-8 frames such as:

data: {"token":"cat"}

The first bytes are:

64 61 74 61 3a 20 7b 22 74 6f 6b 65 6e 22

That is data: {"token" in ASCII. WebSockets use framed binary or text messages instead of the SSE line format. Both let the server flush each token as soon as sampling finishes, which reduces perceived latency even when total generation time is unchanged.

Key Result

The serving frontier is governed by two invariants.

First, the per-token cost cannot beat the dominant hardware ceiling for the active phase.

\text{time per step} \geq \max(\text{FLOPs}/\text{peak FLOP/s}, \text{bytes}/\text{bandwidth})

Second, the scheduler trades throughput against tail latency. Larger batches improve weight reuse and reduce cost per token, but waiting to form those batches increases time-to-first-token and p95 latency.

A cost calculation makes this visible. If a GPU costs USD 3 per hour and a server emits 2000 tokens per second, the compute cost per million tokens is:

3 / (2000 \cdot 3600) \cdot 1000000 = 0.4167

At 500 tokens per second, the same GPU costs USD 1.6667 per million tokens. Quantization, paged attention, and continuous batching all push tokens per second upward. Strict latency targets push the other way by limiting queueing and batch size.

Common Confusions

Watch Out

Confusing dynamic batching with continuous batching

Dynamic batching often means collecting requests for a short window and then running them as a fixed group. Continuous batching changes membership after each decode step. If sequence A finishes at step 4, its slot can be reused at step 5.

Watch Out

Counting prefill tokens and decode tokens as the same unit

One prefill token in a long prompt and one decode token do not have the same latency effect. Prefill is more parallel and can be chunked. Decode is serial along each sequence and determines the visible streaming cadence.

Watch Out

Treating paged attention as smaller KV tensors

Paged attention does not change the KV vector size for a token. It changes allocation and lookup. The savings come from less fragmentation and better sharing of free blocks, not from a smaller attention state.

Exercises

ExerciseCore

Problem

A model has 40 layers, 8 KV heads, head dimension 128, and FP16 KV entries. Compute KV cache bytes per token and the memory needed for a 4096-token request.

ExerciseCore

Problem

Three decode requests have remaining output lengths 2, 5, and 9. A static batch starts with all three requests and cannot admit new work. How many batch slots are wasted across the 9 decode steps after requests finish?

ExerciseAdvanced

Problem

Speculative decoding uses $k = 5$ , draft cost $c = 0.08$ , and independent acceptance probability $p = 0.7$ . Use $E = 1 + p + p^2 + p^3 + p^4 + p^5$ . Estimate the speedup.

References

Canonical:

Vaswani et al., Attention Is All You Need (2017), §§3.2-3.5, on scaled dot-product attention, multi-head attention, and autoregressive decoding
NVIDIA, CUDA C++ Programming Guide (2025), §§5.2, 8.3, 10.2, on occupancy, memory hierarchy, and asynchronous execution
Williams, Waterman, and Patterson, Roofline: An Insightful Visual Performance Model for Multicore Architectures (2009), §§1-4, on compute and bandwidth ceilings
Kwon et al., vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention (2023), §§2-5, on KV fragmentation, paged attention, and continuous batching
Yu et al., Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022), §§3-5, on iteration-level scheduling and selective batching

Accessible:

Hugging Face, Text Generation Inference Documentation, on production APIs, streaming, batching, and quantization options
NVIDIA, TensorRT-LLM Documentation, on runtime concepts, inflight batching, quantization, and serving workflows
Jay Mody, Inferencing Large Language Models, on KV cache size, batching, and latency calculations

Next Topics

/computationpath/cuda-execution-model
/computationpath/gpu-memory-hierarchy
/computationpath/quantization-and-compression
/computationpath/distributed-inference
/topics/transformer-attention