Cache Hierarchy and AMAT

Why This Matters

A register read costs about 1 CPU cycle. A load that misses all cache levels and reaches DRAM can cost about 200 cycles. At 3 GHz, 200 cycles is about 67 ns. A core waiting on one such load has time to execute hundreds of simple integer instructions, if those instructions do not depend on the missing value.

Modern CPUs hide part of this gap with L1, L2, and L3 caches. A typical path is L1 data cache at 32 KB and about 4 cycles, L2 at 256 KB to 1 MB and about 12 cycles, shared L3 at about 40 cycles, then DRAM at about 200 cycles. The programmer still controls the address stream. Stride, layout, tiling, and working-set size decide whether the hardware sees predictable locality or repeated long-latency misses.

Core Definitions

Definition

Cache line

A cache line is the fixed-size block moved between adjacent levels of the memory hierarchy. A common line size is 64 bytes. Loading one byte at address A usually fetches all bytes in the aligned interval from A & ~63 through (A & ~63) + 63.

Definition

Temporal locality

Temporal locality means that an address used now is likely to be used again soon. Reusing a matrix tile while it remains in L1 or L2 is a direct example.

Definition

Spatial locality

Spatial locality means that addresses near a recently used address are likely to be used soon. A linear scan over float x[n] uses spatial locality because each 64-byte line contains 16 adjacent 4-byte floats.

Definition

Associativity

A cache with $S$ sets and $k$ ways stores each memory line in exactly one set and in any of $k$ slots inside that set. Direct-mapped means $k = 1$ . Fully associative means there is one set and every line can occupy any slot.

Address Split and Cache Lines

A byte address is split into tag, index, and offset bits. The offset selects a byte inside a cache line. The index selects a set. The tag distinguishes memory blocks that map to the same set.

For a 32 KB direct-mapped L1 data cache with 64-byte lines, the number of lines is $32768 / 64 = 512$ . That needs 9 index bits. The line offset needs 6 bits. In a 32-bit address, the remaining 17 bits are the tag.

For address 0x12345678, the split is:

address = 0x12345678
offset  = address & 0x3f              = 0x38
index   = (address >> 6) & 0x1ff      = 0x159
tag     = address >> 15               = 0x2468

The accessed byte lies 56 bytes into the line. The cache checks set 0x159. In a direct-mapped cache, that set has one possible resident line. A hit requires the stored valid bit to be 1 and the stored tag to equal 0x2468.

The byte interval brought into cache is:

line base = 0x12345678 & ~0x3f = 0x12345640
line end  = 0x1234567f

For an 8-way 32 KB cache with the same line size, there are still 512 total lines, but only $512 / 8 = 64$ sets. The offset is still 6 bits, the index is now 6 bits, and the tag is 20 bits.

address = 0x12345678
offset  = address & 0x3f             = 0x38
index   = (address >> 6) & 0x3f      = 0x19
tag     = address >> 12              = 0x12345

Set associativity reduces conflict misses. Eight distinct lines with the same set index can coexist in an 8-way set. A ninth line mapping to that set forces replacement.

Mapping Schemes and Replacement

A direct-mapped cache has the simplest lookup. Each line maps to one slot. The hardware compares one tag, so hit time is low, but conflicts are common. If two hot arrays map to the same indices, they can evict each other every iteration.

A fully associative cache has no index conflict because any line can occupy any slot. The hardware must compare the requested tag against every resident tag, so large fully associative caches are costly. Translation lookaside buffers often use high associativity; large data caches usually do not.

A $k$ -way set-associative cache is the common compromise. Each access checks $k$ tags in one selected set. Replacement chooses which way to evict on a miss when all ways are valid.

LRU, least recently used, evicts the line whose last access is oldest. Exact LRU is practical for small $k$ , but its state grows with associativity. Pseudo-LRU stores fewer bits and approximates LRU. Random replacement picks a victim at random. Random can outperform bad pseudo-LRU cases, but it lacks the recency signal that helps ordinary loops.

Here is a 2-way set with one set and line-sized symbolic addresses. The cache starts empty:

access A: miss, set = [A, -]
access B: miss, set = [A, B]
access A: hit,  recency = A newer than B
access C: miss, evict B under LRU, set = [A, C]
access B: miss, evict A under LRU, set = [B, C]

The last two misses are capacity misses for a 2-line cache under this sequence. With 3 ways, the sequence A, B, A, C, B would have only the first three compulsory misses.

Writes, Dirty Lines, and Allocation

Loads only need to decide whether the requested line is present. Stores also need a policy for updating lower levels.

Write-through updates the cache and the next lower level on every store. This simplifies recovery because lower memory has the current data, but it increases write traffic. Write-back updates only the cache line and marks it dirty. The modified line is written to the next level when it is evicted.

Write-allocate brings a missing line into cache before performing the store. No-write-allocate writes the store to the lower level without filling the cache. Write-back usually pairs with write-allocate. Write-through often pairs with no-write-allocate for streaming stores, though real processors also provide special non-temporal store instructions.

Consider a 64-byte line holding sixteen 4-byte integers. A store to a[3] changes bytes 12 through 15 of the resident line:

line base 0x1000
before: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 ...
store a[3] = 0x11223344 on a little-endian machine
after:  00 00 00 00  00 00 00 00  00 00 00 00  44 33 22 11 ...
dirty bit = 1 under write-back

The dirty bit is one bit of metadata for the whole line. Evicting this line later writes all 64 bytes downward, more than just the 4 modified bytes.

Coherence adds another layer when multiple cores cache the same line. Protocols such as MESI track whether a line is modified, exclusive, shared, or invalid. The high-level invariant is that cores must not silently read stale data after another core writes the same location. False sharing occurs when two cores write different variables that reside on the same 64-byte line, causing coherence traffic despite no logical data sharing.

Locality in Code

A row-major C array stores adjacent columns next to each other. For int a[1024][1024], a[i][j] and a[i][j+1] are 4 bytes apart, while a[i+1][j] is 4096 bytes away.

// Good spatial locality for row-major C arrays.
long sum_rows(int n, int a[n][n]) {
    long s = 0;
    for (int i = 0; i < n; i++)
        for (int j = 0; j < n; j++)
            s += a[i][j];
    return s;
}

// Poor spatial locality: stride is n * sizeof(int).
long sum_cols(int n, int a[n][n]) {
    long s = 0;
    for (int j = 0; j < n; j++)
        for (int i = 0; i < n; i++)
            s += a[i][j];
    return s;
}

If n = 1024, the column loop touches one 4-byte integer every 4096 bytes. With 64-byte lines, it uses 1 integer from each fetched line and discards the other 15 before reuse. The row loop uses all 16 integers from a line before moving on.

Tiling changes iteration order so a small working set is reused before eviction. A blocked matrix multiply for square matrices can keep submatrices in cache:

void mm_blocked(int n, int B, double *A, double *Bmat, double *C) {
    for (int ii = 0; ii < n; ii += B)
        for (int kk = 0; kk < n; kk += B)
            for (int jj = 0; jj < n; jj += B)
                for (int i = ii; i < ii + B && i < n; i++)
                    for (int k = kk; k < kk + B && k < n; k++) {
                        double aik = A[i*n + k];
                        for (int j = jj; j < jj + B && j < n; j++)
                            C[i*n + j] += aik * Bmat[k*n + j];
                    }
}

For B = 32, one double-precision tile has $32 \cdot 32 \cdot 8 = 8192$ bytes. Three tiles for A, Bmat, and C occupy 24576 bytes, which fits within a 32 KB L1 before accounting for other data and associativity conflicts. The exact best block size depends on cache size, line size, associativity, registers, and vector width.

Struct layout also matters. An array of structs interleaves fields:

struct Particle { float x, y, z, mass; };
struct Particle p[1024];

A loop that sums only x loads y, z, and mass as unused bytes. A struct-of-arrays layout packs used fields:

struct Particles {
    float x[1024], y[1024], z[1024], mass[1024];
};

Now summing x streams through contiguous floats. Each 64-byte line contributes 16 useful values.

Prefetch hints can reduce exposed latency when the future address is known:

for (int i = 0; i < n; i++) {
    __builtin_prefetch(&x[i + 64], 0, 1); // read, low temporal locality
    s += x[i] * y[i];
}

The hint is not a contract. It may be ignored, issued too early, or issued too late. Hardware prefetchers already detect many unit-stride streams.

Key Result

Proposition

Average Memory Access Time

Statement

For a one-level cache, the average access time is $AMAT = hit\_time + miss\_rate \cdot miss\_penalty$ . For a hierarchy, the miss penalty of one level is the average access time of the next level.

Intuition

Every access pays the hit-time check. Only misses pay the extra cost. If 5 percent of loads miss L1, then 95 percent stop after the L1 hit time and 5 percent continue downward.

Proof Sketch

Let $H$ be the hit time, $m$ the miss probability, and $P$ the extra cost after a miss. The access time random variable is $H$ on a hit and $H + P$ on a miss. Its expectation is $(1-m)H + m(H+P) = H + mP$ . Applying the same equation recursively gives the multilevel form.

Why It Matters

Take L1 hit time 4 cycles, L1 miss rate 5 percent, L2 hit time 12 cycles, L2 local miss rate 10 percent, L3 hit time 40 cycles, L3 local miss rate 50 percent, and DRAM penalty 200 cycles. The lower-level average is $40 + 0.50 \cdot 200 = 140$ cycles for L3 service from the L2 perspective. The L2 miss penalty expression is $12 + 0.10 \cdot 140 = 26$ cycles. The full AMAT is $4 + 0.05 \cdot 26 = 5.3$ cycles. If the L1 miss rate rises to 30 percent, the AMAT becomes $4 + 0.30 \cdot 26 = 11.8$ cycles.

Failure Mode

Do not mix global and local miss rates. If 5 percent of all references miss L1 and 10 percent of L1 misses also miss L2, then the global L2 miss rate is 0.5 percent of all references, while the local L2 miss rate is 10 percent.

report a correction →

Common Confusions

Watch Out

A larger cache is not always faster

A larger cache often has a longer hit time because it needs more indexing, tag comparison, wiring, and energy per access. L1 is kept small so the common hit case stays near 4 cycles. L3 is larger but slower, so it is a filter before DRAM rather than a substitute for L1.

Watch Out

A cache miss does not always mean DRAM

An L1 miss can hit in L2. An L2 miss can hit in L3. Only an LLC miss reaches memory. Performance counters often report misses at several levels; read their definitions before computing AMAT.

Watch Out

Same cache line is not same variable

False sharing occurs at line granularity. Two independent counters 8 bytes apart can occupy the same 64-byte line. If different cores update them, coherence invalidations occur even though the source code has separate variables.

Exercises

ExerciseCore

Problem

A 64 KB, 4-way set-associative cache uses 64-byte lines and 32-bit byte addresses. Compute the number of sets and the tag, index, and offset bit counts. For address 0x00ABCDEF, compute offset, index, and tag.

ExerciseCore

Problem

Compute AMAT for L1 hit time 4 cycles, L1 miss rate 8 percent, L2 hit time 12 cycles, L2 local miss rate 25 percent, L3 hit time 40 cycles, L3 local miss rate 40 percent, and DRAM penalty 200 cycles.

ExerciseAdvanced

Problem

A row-major double a[512][512] is scanned by columns. Assume 64-byte lines and that the cache cannot retain a line until the next use from the same row. How many useful bytes are consumed per fetched line, and what fraction of fetched bytes are useful?

References

Canonical:

Hennessy and Patterson, Computer Architecture: A Quantitative Approach (6th ed, 2017), ch. 1, §1.5 — quantitative performance metrics and CPI effects
Hennessy and Patterson, Computer Architecture: A Quantitative Approach (6th ed, 2017), ch. 2, §§2.1-2.6 — cache basics, performance, and memory hierarchy design
Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective (3rd ed, 2016), ch. 6, §§6.2-6.6 — locality, cache organization, and cache-conscious programming
Patterson and Hennessy, Computer Organization and Design: The Hardware/Software Interface (5th ed, 2014), ch. 5, §§5.1-5.8, cache blocks, misses, write policy, and virtual memory interaction
Intel, Intel 64 and IA-32 Architectures Optimization Reference Manual (2024), ch. 2 and ch. 9, cache hierarchy, prefetching, and memory-ordering performance details

Accessible:

Ulrich Drepper, What Every Programmer Should Know About Memory (2007)
Agner Fog, The microarchitecture of Intel, AMD and VIA CPUs
Wence, Cache Blocking and Matrix Transposition, Durham University performance engineering notes

Next Topics

/computationpath/virtual-memory
/computationpath/simd-vectorization
/computationpath/cpu-performance-counters
/computationpath/memory-consistency
/topics/asymptotic-analysis