Why This Matters
A call to malloc(24) often consumes 32 or 48 bytes of heap space after alignment and metadata. A server doing 1 million allocations per second can spend more time moving through allocator locks and cache lines than running the application code that asked for memory.
The allocator is the layer between typed C objects and virtual memory pages. It must return aligned chunks, reuse freed chunks, split and coalesce blocks, avoid unbounded fragmentation, and survive contention from many threads.
Core Definitions
Heap Block
A heap block is a contiguous region managed by an allocator. It usually contains a header, an optional footer, and a payload pointer returned to the program. If the returned pointer is , allocator metadata often lives at addresses before .
Alignment
An allocator returns a pointer whose address is a multiple of an alignment such as 8 or 16 bytes. If an object of size must be rounded up to , the allocated payload size is .
Internal Fragmentation
Internal fragmentation is allocated space inside a returned block that the program did not request. If malloc(24) returns a 32-byte payload class, 8 bytes are internal slack before metadata is counted.
External Fragmentation
External fragmentation is free memory that exists but cannot satisfy a request because it is split into pieces. Four free 32-byte blocks cannot satisfy one 96-byte request without coalescing.
The Allocator Interface and Kernel Boundary
The C interface is small:
#include <stdlib.h>
void *p = malloc(24);
if (p == NULL) abort();
free(p);
The allocator must hand out a pointer that the caller can use for any object requiring the platform alignment. On many 64-bit ABIs, malloc returns at least 16-byte aligned pointers. The allocator also must know the size of a block at free(p) time, even though free receives no size argument. That is why production allocators store metadata near the payload or in side tables.
Allocators get larger virtual memory regions from the kernel. Historically, sbrk() moves the program break, extending one contiguous heap region.
#include <unistd.h>
#include <stdio.h>
int main(void) {
void *a = sbrk(0);
void *r = sbrk(4096);
void *b = sbrk(0);
printf("old break %p, returned %p, new break %p\n", a, r, b);
}
A modern allocator also uses mmap() for page-granular mappings, often for large allocations or allocator-controlled arenas.
#include <sys/mman.h>
#include <unistd.h>
void *p = mmap(NULL, 1 << 20,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS,
-1, 0);
if (p == MAP_FAILED) _exit(1);
munmap(p, 1 << 20);
sbrk grows one break-managed region. mmap creates independent mappings that can be returned to the kernel with munmap. Real libc allocators combine both, but the common abstraction is the same: get pages from the kernel, subdivide them into chunks, then reuse chunks in user space without a syscall on every malloc.
Implicit Free Lists and Boundary Tags
An implicit free list scans all heap blocks, using the size field in each header to jump to the next block. A minimal 64-bit layout can pack size and allocation bits into one word because aligned sizes leave low bits unused.
Suppose the block header is 8 bytes, the footer is 8 bytes for free blocks, and sizes are multiples of 16. A 24-byte request rounds to a 32-byte payload. If allocated blocks carry only a header, the total block size is 40, rounded to 48.
| Offset from block start | Bytes | Meaning |
|---|---|---|
| 0 | 8 | header, size 48 plus allocated bit |
| 8 | 32 | user payload, first 24 bytes requested |
| 40 | 8 | padding or allocator-reserved space |
A common header encoding uses the low bit as the allocated flag.
| Header value | Binary low bits | Meaning |
|---|---|---|
0x31 | ...0001 | size 48, allocated |
0x30 | ...0000 | size 48, free |
Boundary tags add a footer containing the same size and status, usually for free blocks. When free runs, it can inspect the previous block footer and next block header to coalesce adjacent free blocks in constant time.
Consider three consecutive blocks:
| Block | Address range | Header | State |
|---|---|---|---|
| A | 0x1000..0x102f | 0x31 | allocated, 48 bytes |
| B | 0x1030..0x106f | 0x41 | allocated, 64 bytes |
| C | 0x1070..0x10bf | 0x50 | free, 80 bytes |
If free(B) runs, B becomes free. Its footer at 0x1068 stores 0x40. The allocator sees C is free by reading C's header at 0x1070, merges B and C, and writes a 144-byte free block:
| Field | Address | Value |
|---|---|---|
| merged header | 0x1030 | 0x90 |
| merged footer | 0x10b8 | 0x90 |
The next implicit-list traversal jumps from 0x1030 to 0x10c0. Without the footer, finding the previous block would require a scan from the heap start or a side data structure.
Explicit Free Lists and Segregated Fits
An explicit free list stores pointers inside the payload area of free blocks. Allocated blocks do not pay those pointer fields. A doubly linked free block with 8-byte pointers might look like this:
| Offset | Bytes | Meaning |
|---|---|---|
| 0 | 8 | header |
| 8 | 8 | prev free pointer |
| 16 | 8 | next free pointer |
| 24 | ... | unused free payload |
| end minus 8 | 8 | footer |
A simple insert-at-front free path:
typedef struct freeblk {
size_t header;
struct freeblk *prev;
struct freeblk *next;
} freeblk;
static freeblk *free_head;
static void insert_free(freeblk *b) {
b->prev = NULL;
b->next = free_head;
if (free_head) free_head->prev = b;
free_head = b;
}
Singly linked lists are smaller but make removal from the middle slower unless the allocator tracks the previous pointer during search. Doubly linked lists cost one extra word per free block and allow constant-time unlink once a block is known.
Search policy changes fragmentation and CPU cost. First-fit scans from the start and takes the first block large enough. Next-fit resumes from the previous search position. Best-fit scans more blocks and takes the smallest adequate block.
With free block sizes [32, 80, 48, 128] and a 40-byte rounded request:
| Policy | Chosen block | Remainder |
|---|---|---|
| first-fit | 80 | 40 |
| best-fit | 48 | 8 |
| next-fit from index 3 | 128 | 88 |
Best-fit reduces immediate leftover space here, but an 8-byte remainder may be too small to become a valid free block. First-fit often costs fewer loads because it stops earlier. Next-fit can avoid repeatedly scanning the same hot prefix, but it can also skip an adequate block and increase fragmentation.
Segregated fits keep multiple free lists by size class, for example 16, 32, 48, 64, 80, 96, 128, then powers of two. A 40-byte rounded request searches the 48-byte bin first. If it is empty, the allocator searches a larger bin and splits a block if the remainder can form a valid free block.
Fragmentation With Numbers
Assume headers are 8 bytes, allocated payloads round to 16 bytes, and no footer is stored for allocated blocks.
Three requests have user sizes 1, 24, and 33 bytes.
| User request | Rounded payload | Block size | Internal slack |
|---|---|---|---|
| 1 | 16 | 24 | 15 |
| 24 | 32 | 40 | 8 |
| 33 | 48 | 56 | 15 |
User data totals 58 bytes. Allocator block space totals 120 bytes. Payload slack totals 38 bytes, and headers total 24 bytes. The utilization over allocator-managed block space is .
External fragmentation appears after frees. Suppose the heap contains:
| Block | Size | State |
|---|---|---|
| A | 64 | free |
| B | 64 | allocated |
| C | 64 | free |
| D | 64 | allocated |
Total free memory is 128 bytes, but a 96-byte request fails unless the allocator obtains more memory from the kernel. If B is freed, coalescing A, B, and C creates one 192-byte free block and the request can be satisfied.
A useful allocator metric is peak utilization:
Here is live payload bytes after operation , and is heap size obtained from the kernel by step . A trace with peak live payload 600 KiB and heap size 1 MiB has .
ptmalloc2, jemalloc, tcmalloc, and Slabs
glibc's default allocator descends from ptmalloc2, itself derived from Doug Lea's allocator. It uses chunks with metadata, bins for free chunks, and multiple arenas. Per-thread arenas reduce one global lock bottleneck, but memory freed in one arena is not freely reusable by another arena without allocator-specific transfer paths. Under many threads, arena-local free memory can raise total resident memory even when total live payload is steady.
jemalloc and tcmalloc use size classes and thread-local or CPU-local caches. A small allocation such as 24 bytes maps to a class, often 32 bytes. The fast path pops one object from a per-thread cache without taking a central heap lock. When the cache is empty, it refills a batch from a central structure. When the cache grows too large, it returns a batch.
A simplified 32-byte size-class run over one 4096-byte page:
| Item | Bytes |
|---|---|
| page size | 4096 |
| run header bitmap | 64 |
| usable object area | 4032 |
| objects of 32 bytes | 126 |
The allocator can track which slots are live with a bitmap. Slot 5 starts at page + 64 + 5 * 32, so its offset is 224. Freeing a slot clears bit 5. Reallocating pops the next zero bit.
tcmalloc's classic design separates thread caches, central free lists, and page heap spans. jemalloc uses arenas, runs or extents, bins, and per-thread caches. Details differ, but the invariant is shared: small objects are grouped by size class, and hot paths avoid one process-wide lock.
Kernel slab allocators solve a related but narrower problem. The kernel repeatedly allocates fixed-size objects such as inodes, dentries, task structures, and socket buffers. A slab cache stores already constructed objects of one type, often with per-CPU magazines. For a 192-byte object in an 8192-byte slab with a 128-byte slab header, at most objects fit, leaving 0 bytes because .
Key Result
General-purpose allocation is a constrained online packing problem. The allocator sees a sequence of requests and frees without knowing the future.
Two invariants matter most:
No online allocator can always choose the globally best placement for future requests. A best-fit placement that minimizes leftover bytes now can create unusable splinters. A first-fit placement that is cheap now can leave large holes at the end unused until coalescing.
For production servers, the design target is not only low fragmentation. It is the combination of bounded metadata cost, predictable fast paths, cache locality, and low lock contention. That is why jemalloc and tcmalloc make small allocations look like indexed operations on size-class caches rather than linked-list searches over the whole heap.
Common Confusions
Freeing memory does not always return pages to the OS
free(p) returns the block to the allocator. The allocator may keep its pages mapped for later allocations. A process's resident set size can stay high after a large phase ends, especially when free chunks are split across arenas or cannot form whole pages.
Double free is not the same bug as use-after-free
A double free calls free twice on the same live allocation identity. A use-after-free reads or writes through a pointer after the allocation has ended. A program can have one without the other.
Best-fit does not mean lowest real memory use
Best-fit can create small remainders that cannot hold allocator metadata and minimum payload. If the minimum free block is 32 bytes, splitting 80 bytes for a 56-byte block leaves 24 bytes, which must be absorbed as internal waste.
An off-by-one heap write can corrupt metadata
Writing one byte beyond a payload can change the low byte of the next chunk header. In allocators that store size and flags there, this can turn a valid size into a false size or alter the allocation bit.
Exercises
Problem
An allocator uses 8-byte headers, 8-byte footers only for free blocks, and 16-byte payload alignment. What block size does malloc(25) consume when allocated? How many bytes are internal slack, not counting the header?
Problem
A heap has four consecutive blocks with sizes and states: 32 allocated, 48 free, 64 free, 80 allocated. Headers and footers encode only the total block size plus an allocated bit. After coalescing all adjacent free blocks, list the resulting block sequence.
Problem
For free list sizes [96, 40, 72, 128], a 64-byte request arrives. The minimum splittable remainder is 32 bytes. Which block is chosen by first-fit and best-fit, and what free block remains after splitting?
References
Canonical:
- Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau, Operating Systems: Three Easy Pieces (2018), ch. 13-23, virtual memory mechanisms, address spaces, paging, and free-space management
- Randal E. Bryant and David R. O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd ed. (2016), ch. 9, dynamic memory allocation and implicit free lists
- Donald E. Knuth, The Art of Computer Programming, Volume 1: Fundamental Algorithms, 3rd ed. (1997), §2.5, storage allocation and boundary-tag methods
- Paul R. Wilson, Mark S. Johnstone, Michael Neely, and David Boles, "Dynamic Storage Allocation: A Survey and Critical Review" (1995), survey sections on placement, splitting, coalescing, and fragmentation
- Marshall K. McKusick and George V. Neville-Neil, The Design and Implementation of the FreeBSD Operating System, 2nd ed. (2014), ch. 7, kernel memory allocation and UMA slab allocation
- Daniel P. Bovet and Marco Cesati, Understanding the Linux Kernel, 3rd ed. (2005), ch. 8, slab allocator and kernel memory management
Accessible:
- Doug Lea, "A Memory Allocator", notes on dlmalloc design and boundary tags
- Jason Evans, "A Scalable Concurrent malloc(3) Implementation for FreeBSD" (2006), jemalloc design paper
- Sanjay Ghemawat and Paul Menage, "TCMalloc: Thread-Caching Malloc", public design notes
Next Topics
/computationpath/virtual-memory/computationpath/cache-locality/computationpath/memory-safety/computationpath/concurrency-locks