Skip to main content

Operating Systems · 35 min

System Calls

The user/kernel boundary. How syscalls switch privilege, pass arguments in registers, and route file, process, memory, and synchronization operations through the kernel.

Why This Matters

A single write(1, "x\n", 2) on Linux crosses from user mode into kernel mode, checks the file descriptor table, copies 2 bytes from the process address space, and calls a device or filesystem path. At 3.5 GHz, a 300 ns syscall consumes about 1050 cycles before the disk, terminal, socket, or pipe does any work.

Neural network serving code calls read, write, mmap, futex, epoll_wait, and clock_gettime even when most tensor math runs in user space or on accelerators. The boundary sets the minimum cost of request handling, model loading, memory mapping, logging, and thread coordination.

Core Definitions

Definition

User mode

User mode is the processor privilege level used by ordinary process code. On x86 this is normally ring 3, with current privilege level CPL = 3. On ARMv8-A this is exception level EL0. User code cannot execute privileged instructions, program page tables directly, or access kernel-only virtual addresses.

Definition

Kernel mode

Kernel mode is the privilege level used by the operating system kernel. On x86 this is normally ring 0, with CPL = 0. On ARMv8-A this is EL1 for a non-hypervisor kernel. Kernel code can configure address translation, schedule threads, handle interrupts, and access kernel mappings.

Definition

System call

A system call is a controlled entry from user code to kernel code. It passes a syscall number and arguments in an ABI-defined form, switches privilege, runs a kernel handler, and returns a result or negative error code.

Definition

vDSO

The virtual dynamic shared object is a kernel-provided page mapped into user processes. It contains user-mode code for operations such as clock_gettime when the kernel can expose enough read-only state to avoid a privilege transition.

User Mode, Kernel Mode, and the Trap Boundary

The CPU enforces the boundary. A user process may load and store its own mapped pages, execute arithmetic, branch, call library functions, and issue a syscall instruction. It may not write control registers, update page tables, mask interrupts, or read arbitrary physical memory.

On x86-64 Linux, normal process code runs at ring 3. Kernel code runs at ring 0. The lower ring number has more privilege. The current privilege level is derived from the code segment selector, even in long mode where segmentation is mostly disabled.

On ARMv8-A, user applications run at EL0. The kernel runs at EL1. Hypervisors use EL2, and secure monitor firmware may use EL3. A Linux syscall from an application is normally svc #0, which raises an exception from EL0 into an EL1 vector.

A toy memory map makes the protection rule concrete.

Process virtual address space on a typical 48-bit x86-64 Linux system

0x0000000000400000  user text        r-x  /bin/demo
0x0000000000600000  user data        rw-  /bin/demo
0x00007ffff7dd0000  libc.so          r-x
0x00007ffffffde000  user stack       rw-
0xffff888000000000  kernel direct map rw-  supervisor only
0xffffffff81000000  kernel text      r-x  supervisor only

If user code dereferences 0xffffffff81000000, the page-table user/supervisor bit blocks the load and the CPU raises a page fault. The fault handler runs in the kernel, but the process does not get the data. The kernel may send SIGSEGV.

Every read and write goes through the kernel because file descriptors name kernel objects. Descriptor 1 is not a pointer to a terminal. It is an index into the process file descriptor table, which points to a kernel struct file, which points to filesystem, pipe, socket, or device operations.

process fd table

fd 0 -> struct file for terminal input
fd 1 -> struct file for terminal output
fd 2 -> struct file for terminal error
fd 3 -> struct file for model.bin

The x86-64 Syscall ABI

Linux on x86-64 uses the syscall instruction for the fast syscall path. User code places the syscall number in rax. It places up to 6 arguments in registers.

rax  syscall number
rdi  argument 1
rsi  argument 2
rdx  argument 3
r10  argument 4
r8   argument 5
r9   argument 6

The fourth argument uses r10, not rcx, because syscall itself overwrites rcx with the user return instruction pointer. It also overwrites r11 with saved flags. The kernel return path uses sysret or iretq depending on the case.

The instruction-level transition is precise. syscall loads the kernel entry instruction pointer from the model-specific register IA32_LSTAR, saves the next user RIP in RCX, saves RFLAGS in R11, masks selected flags, and changes the code privilege to ring 0. The instruction does not push a trap frame and does not by itself load a fresh stack pointer. Linux entry code uses per-CPU state to move execution onto the kernel stack before running ordinary C handlers.

A minimal write without libc shows the register contract.

.intel_syntax noprefix
.global _start
_start:
    mov rax, 1          # __NR_write on x86-64 Linux
    mov rdi, 1          # fd 1
    lea rsi, [rip + msg]
    mov rdx, 3          # byte count
    syscall             # returns bytes written or -errno in rax

    mov rax, 60         # __NR_exit
    xor rdi, rdi
    syscall

msg:
    .ascii "hi\n"

Before syscall, the relevant state is as follows.

RAX = 0x0000000000000001
RDI = 0x0000000000000001
RSI = 0x000000000040101b
RDX = 0x0000000000000003

memory at 0x40101b:
68 69 0a
 h  i \n

After a successful call, RAX = 3. If the file descriptor is invalid, the kernel returns -EBADF, which is -9. The raw register then contains 0xfffffffffffffff7. The C library wrapper converts that to -1 and sets errno = EBADF.

A raw C wrapper for a 3-argument syscall looks like this.

#include <stdint.h>

static inline long raw_syscall3(long nr, long a1, long a2, long a3) {
    long ret;
    __asm__ volatile (
        "syscall"
        : "=a"(ret)
        : "a"(nr), "D"(a1), "S"(a2), "d"(a3)
        : "rcx", "r11", "memory"
    );
    return ret;
}

The clobber list matters. rcx and r11 are destroyed by the CPU instruction, not by the compiler.

ARM SVC and the Same Contract

On AArch64 Linux, user code places the syscall number in x8 and arguments in x0 through x5. It executes svc #0. Return value comes back in x0.

.global _start
_start:
    mov x8, #64          // __NR_write on AArch64 Linux
    mov x0, #1           // fd 1
    adr x1, msg
    mov x2, #3
    svc #0

    mov x8, #93          // __NR_exit
    mov x0, #0
    svc #0

msg:
    .ascii "hi\n"

svc records exception syndrome information, saves the return address in ELR_EL1, saves processor state in SPSR_EL1, changes from EL0 to EL1, and jumps to the configured exception vector. Linux then dispatches by syscall number.

The programming model is the same across architectures. User code cannot jump to an arbitrary kernel address. It enters through a CPU-defined gate, with registers arranged by ABI.

Common Syscalls in ML and Systems Programs

The common calls fall into a small set.

read(fd, buf, count)       copy bytes from kernel object to user memory
write(fd, buf, count)      copy bytes from user memory to kernel object
open/openat(path, flags)   create a file description and return an fd
close(fd)                  drop one fd table reference
mmap(addr, len, prot, flags, fd, off) map file or anonymous memory
brk(addr)                  move the process heap break
fork()                     create a child process
execve(path, argv, envp)   replace the current program image
wait4(pid, status, options, rusage) wait for child state change
futex(uaddr, op, val, timeout, uaddr2, val3) block or wake using a user word

A model loader often uses openat, fstat, mmap, madvise, and close. With a 4096-byte page size, mapping a 7 GiB weight file does not copy 7 GiB at mmap time. It creates virtual memory areas and page-table work happens on demand as pages fault in.

int fd = open("weights.bin", 0);              // often openat under libc
void *p = mmap(0, 4096, 1, 2, fd, 0);         // PROT_READ, MAP_PRIVATE

If the file begins with bytes 7f 45 4c 46 02 01 01 00, the first user load from p[0] triggers a page fault if the page is not resident. The kernel reads or locates the file page, maps it into the process, and restarts the load. Then the process sees byte 0x7f.

futex is the syscall behind many mutex and condition-variable slow paths. The uncontended path is user-space atomic instructions. The contended path enters the kernel to sleep.

// Sketch of a mutex slow path. Real libraries handle more states.
if (__atomic_exchange_n(&lock_word, 1, __ATOMIC_ACQUIRE) == 1) {
    syscall(SYS_futex, &lock_word, FUTEX_WAIT, 1, 0, 0, 0);
}

The kernel only needs the address and the expected value. If lock_word is no longer 1, FUTEX_WAIT returns without sleeping. That check prevents a missed wakeup.

Observing Syscalls with strace and perf

strace uses ptrace to stop a process at syscall entry and exit. A small program

#include <unistd.h>

int main(void) {
    write(1, "ok\n", 3);
    return 0;
}

produces output like this.

write(1, "ok\n", 3) = 3
exit_group(0) = ?
+++ exited with 0 +++

The string display is decoded from the user pointer passed in rsi. The kernel still copies from the traced process address space. strace is for diagnosis, not fine timing, because it adds ptrace stops.

perf trace samples or traces kernel events with lower overhead in many cases.

perf trace -e syscalls:sys_enter_write,syscalls:sys_exit_write ./demo

For aggregate counts, use perf stat.

perf stat -e syscalls:sys_enter_read,syscalls:sys_enter_write ./server

If a request path logs 5 lines and each line calls write, 100000 requests cause 500000 write syscalls. At 300 ns each, the syscall boundary alone is 150 ms of CPU time. If mitigations and kernel configuration push the cost to 900 ns, the same boundary time is 450 ms.

vDSO and io_uring

Some calls look like syscalls at the C API but do not need privilege on every invocation. clock_gettime(CLOCK_MONOTONIC, &ts) can often read a vDSO page containing timekeeping parameters and combine them with a user-mode counter such as the x86 time-stamp counter. No syscall instruction is needed on the fast path. If the clock id or hardware mode is unsupported, libc falls back to a real syscall.

You can see the mapping.

$ grep vdso /proc/self/maps
7ffd2d5d3000-7ffd2d5d5000 r-xp 00000000 00:00 0  [vdso]

io_uring moves some I/O submission and completion traffic into shared rings. The process sets up rings with io_uring_setup, maps them with mmap, writes submission queue entries, and enters the kernel with io_uring_enter when it must submit or wait. Batching reduces boundary crossings.

A normal submission queue entry is 64 bytes. The completion queue entry is 16 bytes.

io_uring_sqe byte layout, common fields

offset  size  field
0       1     opcode
1       1     flags
2       2     ioprio
4       4     fd
8       8     off
16      8     addr
24      4     len
28      4     rw_flags
32      8     user_data
40      2     buf_index
42      2     personality
44      4     file_index or splice_fd_in
48      16    remaining fields and padding

For a read into buffer 0x70000000 from fd 5, length 4096, offset 0, the process writes one SQE with opcode IORING_OP_READ, fd 5, addr 0x70000000, len 4096, and a chosen user_data, for example 0xabc. The kernel later writes a CQE with user_data = 0xabc and res = 4096 or a negative error.

Key Result

For a Linux syscall path, keep two invariants in mind.

raw return[0,2631] means success for most calls\text{raw return} \in [0, 2^{63}-1] \text{ means success for most calls}

raw return[4095,1] means errno\text{raw return} \in [-4095, -1] \text{ means } -\text{errno}

The second invariant is why libc wrappers can map a kernel return value of -2 to errno = ENOENT and return -1. Code that calls syscall directly must do this check itself if it wants C library behavior.

The other invariant is about copies and authority. User pointers passed to the kernel are not trusted capabilities. A pointer is an address in the caller's virtual address space, and the kernel must check access while copying. write(1, p, n) means "copy n bytes from my address p if valid, then write them to fd 1." It does not give the device direct permission to read arbitrary process memory unless a separate DMA or pinning path is set up.

Cost is bounded below by the entry and exit path.

cycles=nanoseconds×GHz\text{cycles} = \text{nanoseconds} \times \text{GHz}

At 3.5 GHz, 100 ns is 350 cycles, 300 ns is 1050 cycles, and 1000 ns is 3500 cycles. CPU vulnerability mitigations such as stronger kernel/user isolation and branch predictor controls increased the cost on many machines, so measurements from older systems understate current latency.

Common Confusions

Watch Out

Calling libc is not the same as entering the kernel

write in C is a libc symbol. It usually executes a real syscall. clock_gettime may run in the vDSO and never change privilege. open may call openat internally. Use strace or perf trace to see the boundary crossings, not function names alone.

Watch Out

syscall does not use the normal function-call ABI

The syscall ABI is not the System V function-call ABI. Argument 4 is in r10, not rcx. The syscall number is in rax. rcx and r11 are clobbered by the instruction.

Watch Out

mmap is not a giant read

Mapping a 7 GiB file creates address-space metadata and page-table state. The bytes are faulted and mapped as touched. Sequentially reading every page later still performs I/O or page-cache work, but that cost is not paid as one large copy inside mmap.

Exercises

ExerciseCore

Problem

On x86-64 Linux, prepare registers for read(3, 0x7fffffffe000, 16). The syscall number for read is 0. Show rax, rdi, rsi, and rdx before syscall. If the kernel reads bytes 41 42 0a and then reaches end of available pipe data, what is the return value and what bytes are written?

ExerciseCore

Problem

A service handles 200000 requests per second. Each request performs 2 read syscalls, 1 write syscall, and 1 futex syscall on average. Estimate CPU time per second spent only on syscall boundary cost if one syscall costs 250 ns. Give the same estimate for 900 ns.

ExerciseAdvanced

Problem

Write the register assignment for a 6-argument x86-64 syscall mmap(0, 4096, 1, 2, 5, 0). Use the Linux x86-64 syscall number 9. Then explain why argument 4 is not placed in rcx.

References

Canonical:

  • Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau, Operating Systems: Three Easy Pieces (2023), ch. 4-12, 27-32, 36-43, 48 — processes, virtualization, concurrency, filesystems, and persistence paths that motivate syscalls
  • Randal E. Bryant and David R. O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd ed. (2016), ch. 8-10 — exceptions, virtual memory, linking, and Unix I/O
  • Michael Kerrisk, The Linux Programming Interface (2010), ch. 3-6, 13, 24-28, 49 — Linux syscall API, files, processes, memory mappings, and clocks
  • David A. Patterson and John L. Hennessy, Computer Organization and Design, RISC-V Edition (2017), §2.8, §4.9 ; exceptions, privilege, and processor control transfer
  • Daniel P. Bovet and Marco Cesati, Understanding the Linux Kernel, 3rd ed. (2005), ch. 4, ch. 10, ch. 12 ; interrupts, exceptions, system calls, process address spaces, and VFS internals

Accessible:

  • Linux man-pages project, syscall(2), vdso(7), strace(1), io_uring_setup(2), io_uring_enter(2)
  • MIT 6.S081, Operating System Engineering lecture notes on traps, syscalls, and page tables
  • Jens Axboe, Efficient IO with io_uring (2019), LPC materials and kernel documentation on the io_uring interface

Next Topics

  • /computationpath/virtual-memory
  • /computationpath/processes-and-fork
  • /computationpath/files-and-file-descriptors
  • /computationpath/threads-futexes-and-schedulers
  • /computationpath/io-uring-and-async-io