Memory Allocation: malloc, free, jemalloc

Why This Matters

A call to malloc(24) often consumes 32 or 48 bytes of heap space after alignment and metadata. A server doing 1 million allocations per second can spend more time moving through allocator locks and cache lines than running the application code that asked for memory.

The allocator is the layer between typed C objects and virtual memory pages. It must return aligned chunks, reuse freed chunks, split and coalesce blocks, avoid unbounded fragmentation, and survive contention from many threads.

Core Definitions

Definition

Heap Block

A heap block is a contiguous region managed by an allocator. It usually contains a header, an optional footer, and a payload pointer returned to the program. If the returned pointer is $p$ , allocator metadata often lives at addresses before $p$ .

Definition

Alignment

An allocator returns a pointer whose address is a multiple of an alignment such as 8 or 16 bytes. If an object of size $s$ must be rounded up to $a$ , the allocated payload size is $\lceil s/a \rceil a$ .

Definition

Internal Fragmentation

Internal fragmentation is allocated space inside a returned block that the program did not request. If malloc(24) returns a 32-byte payload class, 8 bytes are internal slack before metadata is counted.

Definition

External Fragmentation

External fragmentation is free memory that exists but cannot satisfy a request because it is split into pieces. Four free 32-byte blocks cannot satisfy one 96-byte request without coalescing.

The Allocator Interface and Kernel Boundary

The C interface is small:

#include <stdlib.h>

void *p = malloc(24);
if (p == NULL) abort();

free(p);

The allocator must hand out a pointer that the caller can use for any object requiring the platform alignment. On many 64-bit ABIs, malloc returns at least 16-byte aligned pointers. The allocator also must know the size of a block at free(p) time, even though free receives no size argument. That is why production allocators store metadata near the payload or in side tables.

Allocators get larger virtual memory regions from the kernel. Historically, sbrk() moves the program break, extending one contiguous heap region.

#include <unistd.h>
#include <stdio.h>

int main(void) {
    void *a = sbrk(0);
    void *r = sbrk(4096);
    void *b = sbrk(0);
    printf("old break %p, returned %p, new break %p\n", a, r, b);
}

A modern allocator also uses mmap() for page-granular mappings, often for large allocations or allocator-controlled arenas.

#include <sys/mman.h>
#include <unistd.h>

void *p = mmap(NULL, 1 << 20,
               PROT_READ | PROT_WRITE,
               MAP_PRIVATE | MAP_ANONYMOUS,
               -1, 0);
if (p == MAP_FAILED) _exit(1);

munmap(p, 1 << 20);

sbrk grows one break-managed region. mmap creates independent mappings that can be returned to the kernel with munmap. Real libc allocators combine both, but the common abstraction is the same: get pages from the kernel, subdivide them into chunks, then reuse chunks in user space without a syscall on every malloc.

Implicit Free Lists and Boundary Tags

An implicit free list scans all heap blocks, using the size field in each header to jump to the next block. A minimal 64-bit layout can pack size and allocation bits into one word because aligned sizes leave low bits unused.

Suppose the block header is 8 bytes, the footer is 8 bytes for free blocks, and sizes are multiples of 16. A 24-byte request rounds to a 32-byte payload. If allocated blocks carry only a header, the total block size is 40, rounded to 48.

Offset from block start	Bytes	Meaning
0	8	header, size 48 plus allocated bit
8	32	user payload, first 24 bytes requested
40	8	padding or allocator-reserved space

A common header encoding uses the low bit as the allocated flag.

Header value	Binary low bits	Meaning
`0x31`	`...0001`	size 48, allocated
`0x30`	`...0000`	size 48, free

Boundary tags add a footer containing the same size and status, usually for free blocks. When free runs, it can inspect the previous block footer and next block header to coalesce adjacent free blocks in constant time.

Consider three consecutive blocks:

Block	Address range	Header	State
A	`0x1000..0x102f`	`0x31`	allocated, 48 bytes
B	`0x1030..0x106f`	`0x41`	allocated, 64 bytes
C	`0x1070..0x10bf`	`0x50`	free, 80 bytes

If free(B) runs, B becomes free. Its footer at 0x1068 stores 0x40. The allocator sees C is free by reading C's header at 0x1070, merges B and C, and writes a 144-byte free block:

Field	Address	Value
merged header	`0x1030`	`0x90`
merged footer	`0x10b8`	`0x90`

The next implicit-list traversal jumps from 0x1030 to 0x10c0. Without the footer, finding the previous block would require a scan from the heap start or a side data structure.

Explicit Free Lists and Segregated Fits

An explicit free list stores pointers inside the payload area of free blocks. Allocated blocks do not pay those pointer fields. A doubly linked free block with 8-byte pointers might look like this:

Offset	Bytes	Meaning
0	8	header
8	8	`prev` free pointer
16	8	`next` free pointer
24	...	unused free payload
end minus 8	8	footer

A simple insert-at-front free path:

typedef struct freeblk {
    size_t header;
    struct freeblk *prev;
    struct freeblk *next;
} freeblk;

static freeblk *free_head;

static void insert_free(freeblk *b) {
    b->prev = NULL;
    b->next = free_head;
    if (free_head) free_head->prev = b;
    free_head = b;
}

Singly linked lists are smaller but make removal from the middle slower unless the allocator tracks the previous pointer during search. Doubly linked lists cost one extra word per free block and allow constant-time unlink once a block is known.

Search policy changes fragmentation and CPU cost. First-fit scans from the start and takes the first block large enough. Next-fit resumes from the previous search position. Best-fit scans more blocks and takes the smallest adequate block.

With free block sizes [32, 80, 48, 128] and a 40-byte rounded request:

Policy	Chosen block	Remainder
first-fit	80	40
best-fit	48	8
next-fit from index 3	128	88

Best-fit reduces immediate leftover space here, but an 8-byte remainder may be too small to become a valid free block. First-fit often costs fewer loads because it stops earlier. Next-fit can avoid repeatedly scanning the same hot prefix, but it can also skip an adequate block and increase fragmentation.

Segregated fits keep multiple free lists by size class, for example 16, 32, 48, 64, 80, 96, 128, then powers of two. A 40-byte rounded request searches the 48-byte bin first. If it is empty, the allocator searches a larger bin and splits a block if the remainder can form a valid free block.

Fragmentation With Numbers

Assume headers are 8 bytes, allocated payloads round to 16 bytes, and no footer is stored for allocated blocks.

Three requests have user sizes 1, 24, and 33 bytes.

User request	Rounded payload	Block size	Internal slack
1	16	24	15
24	32	40	8
33	48	56	15

User data totals 58 bytes. Allocator block space totals 120 bytes. Payload slack totals 38 bytes, and headers total 24 bytes. The utilization over allocator-managed block space is $58/120 = 0.483$ .

External fragmentation appears after frees. Suppose the heap contains:

Block	Size	State
A	64	free
B	64	allocated
C	64	free
D	64	allocated

Total free memory is 128 bytes, but a 96-byte request fails unless the allocator obtains more memory from the kernel. If B is freed, coalescing A, B, and C creates one 192-byte free block and the request can be satisfied.

A useful allocator metric is peak utilization:

U_k = \frac{\max_{i \leq k} P_i}{H_k}

Here $P_i$ is live payload bytes after operation $i$ , and $H_k$ is heap size obtained from the kernel by step $k$ . A trace with peak live payload 600 KiB and heap size 1 MiB has $U = 600/1024 = 0.586$ .

ptmalloc2, jemalloc, tcmalloc, and Slabs

glibc's default allocator descends from ptmalloc2, itself derived from Doug Lea's allocator. It uses chunks with metadata, bins for free chunks, and multiple arenas. Per-thread arenas reduce one global lock bottleneck, but memory freed in one arena is not freely reusable by another arena without allocator-specific transfer paths. Under many threads, arena-local free memory can raise total resident memory even when total live payload is steady.

jemalloc and tcmalloc use size classes and thread-local or CPU-local caches. A small allocation such as 24 bytes maps to a class, often 32 bytes. The fast path pops one object from a per-thread cache without taking a central heap lock. When the cache is empty, it refills a batch from a central structure. When the cache grows too large, it returns a batch.

A simplified 32-byte size-class run over one 4096-byte page:

Item	Bytes
page size	4096
run header bitmap	64
usable object area	4032
objects of 32 bytes	126

The allocator can track which slots are live with a bitmap. Slot 5 starts at page + 64 + 5 * 32, so its offset is 224. Freeing a slot clears bit 5. Reallocating pops the next zero bit.

tcmalloc's classic design separates thread caches, central free lists, and page heap spans. jemalloc uses arenas, runs or extents, bins, and per-thread caches. Details differ, but the invariant is shared: small objects are grouped by size class, and hot paths avoid one process-wide lock.

Kernel slab allocators solve a related but narrower problem. The kernel repeatedly allocates fixed-size objects such as inodes, dentries, task structures, and socket buffers. A slab cache stores already constructed objects of one type, often with per-CPU magazines. For a 192-byte object in an 8192-byte slab with a 128-byte slab header, at most $\lfloor(8192 - 128)/192\rfloor = 42$ objects fit, leaving 0 bytes because $42 \cdot 192 = 8064$ .

Key Result

General-purpose allocation is a constrained online packing problem. The allocator sees a sequence of requests and frees without knowing the future.

Two invariants matter most:

\text{block size} = \text{header bytes} + \text{rounded payload bytes} + \text{optional footer bytes}

\text{rounded payload} = a \left\lceil \frac{\text{requested bytes}}{a} \right\rceil

No online allocator can always choose the globally best placement for future requests. A best-fit placement that minimizes leftover bytes now can create unusable splinters. A first-fit placement that is cheap now can leave large holes at the end unused until coalescing.

For production servers, the design target is not only low fragmentation. It is the combination of bounded metadata cost, predictable fast paths, cache locality, and low lock contention. That is why jemalloc and tcmalloc make small allocations look like indexed operations on size-class caches rather than linked-list searches over the whole heap.

Common Confusions

Watch Out

Freeing memory does not always return pages to the OS

free(p) returns the block to the allocator. The allocator may keep its pages mapped for later allocations. A process's resident set size can stay high after a large phase ends, especially when free chunks are split across arenas or cannot form whole pages.

Watch Out

Double free is not the same bug as use-after-free

A double free calls free twice on the same live allocation identity. A use-after-free reads or writes through a pointer after the allocation has ended. A program can have one without the other.

Watch Out

Best-fit does not mean lowest real memory use

Best-fit can create small remainders that cannot hold allocator metadata and minimum payload. If the minimum free block is 32 bytes, splitting 80 bytes for a 56-byte block leaves 24 bytes, which must be absorbed as internal waste.

Watch Out

An off-by-one heap write can corrupt metadata

Writing one byte beyond a payload can change the low byte of the next chunk header. In allocators that store size and flags there, this can turn a valid size into a false size or alter the allocation bit.

Exercises

ExerciseCore

Problem

An allocator uses 8-byte headers, 8-byte footers only for free blocks, and 16-byte payload alignment. What block size does malloc(25) consume when allocated? How many bytes are internal slack, not counting the header?

ExerciseCore

Problem

A heap has four consecutive blocks with sizes and states: 32 allocated, 48 free, 64 free, 80 allocated. Headers and footers encode only the total block size plus an allocated bit. After coalescing all adjacent free blocks, list the resulting block sequence.

ExerciseAdvanced

Problem

For free list sizes [96, 40, 72, 128], a 64-byte request arrives. The minimum splittable remainder is 32 bytes. Which block is chosen by first-fit and best-fit, and what free block remains after splitting?

References

Canonical:

Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau, Operating Systems: Three Easy Pieces (2018), ch. 13-23, virtual memory mechanisms, address spaces, paging, and free-space management
Randal E. Bryant and David R. O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd ed. (2016), ch. 9, dynamic memory allocation and implicit free lists
Donald E. Knuth, The Art of Computer Programming, Volume 1: Fundamental Algorithms, 3rd ed. (1997), §2.5, storage allocation and boundary-tag methods
Paul R. Wilson, Mark S. Johnstone, Michael Neely, and David Boles, "Dynamic Storage Allocation: A Survey and Critical Review" (1995), survey sections on placement, splitting, coalescing, and fragmentation
Marshall K. McKusick and George V. Neville-Neil, The Design and Implementation of the FreeBSD Operating System, 2nd ed. (2014), ch. 7, kernel memory allocation and UMA slab allocation
Daniel P. Bovet and Marco Cesati, Understanding the Linux Kernel, 3rd ed. (2005), ch. 8, slab allocator and kernel memory management

Accessible:

Doug Lea, "A Memory Allocator", notes on dlmalloc design and boundary tags
Jason Evans, "A Scalable Concurrent malloc(3) Implementation for FreeBSD" (2006), jemalloc design paper
Sanjay Ghemawat and Paul Menage, "TCMalloc: Thread-Caching Malloc", public design notes

Next Topics

/computationpath/virtual-memory
/computationpath/cache-locality
/computationpath/memory-safety
/computationpath/concurrency-locks