Why This Matters
A VM booting a 1 GiB Linux guest has its own kernel, page tables, init process, device model, and virtual disks. A container running the same user program often adds only new namespace entries, cgroup accounting, and an overlay filesystem view. On the same host, the first choice buys a stronger kernel boundary; the second buys denser packing and faster start.
Cloud schedulers, CI systems, model-serving fleets, and notebook platforms all choose between these two abstractions. The difference is not syntax in a YAML file. It is whether privileged instructions, page-table writes, interrupts, block I/O, PIDs, network devices, and filesystem paths are mediated at the hardware boundary or at the Linux kernel boundary.
Core Definitions
Virtual machine monitor
A virtual machine monitor, or hypervisor, presents each guest with a virtual hardware interface. The guest OS executes as if it owns CPUs, memory, devices, interrupts, and storage, while the monitor multiplexes real hardware among guests.
Type-1 and Type-2 hypervisors
A Type-1 hypervisor runs directly on the machine and hosts guests above it, as with ESXi, Xen, and Hyper-V. A Type-2 hypervisor runs as a process on a host OS, as with VirtualBox or QEMU without direct kernel integration. QEMU with KVM is mixed: QEMU is a user process, while KVM puts CPU virtualization support inside the Linux kernel.
Container
A container is a process tree constrained by kernel mechanisms. Linux namespaces change what the processes can name, such as PIDs, mounts, network interfaces, hostnames, IPC objects, and users. Linux cgroups meter or limit CPU, memory, and I/O use.
OCI image
An OCI image is a content-addressed bundle of filesystem layers and metadata. A runtime such as runc unpacks or mounts those layers, applies namespace and cgroup settings, then starts the configured process.
Hardware Virtualization
Classical virtualization starts from a simple loop. Run guest code directly on the CPU until it performs an operation that must be mediated. Trap into the monitor. Emulate the operation against virtual machine state. Return to guest code.
A privileged instruction is a good first example. On x86, cli clears the interrupt flag and disables maskable interrupts. A guest kernel wants to execute it during critical sections. The host must not let the guest disable interrupts for the whole machine.
; guest kernel code
cli ; guest thinks interrupts are off
mov eax, 1
sti ; guest thinks interrupts are on
Under trap-and-emulate, cli causes a trap when executed outside the required privilege level. The monitor records guest_if = 0 in the virtual CPU state but leaves the real host interrupt setting under monitor control. sti later records guest_if = 1.
The trap path has fixed cost. If a trap costs 1500 cycles on a 3.0 GHz core, one trap costs 0.5 microseconds. A device driver that triggers 200000 exits per second spends about 0.1 CPU seconds per wall second just crossing the VM boundary. That is why CPU-bound numeric code often runs close to native speed in a VM, while naive device emulation can be much slower.
Early x86 had instructions that were sensitive but did not trap when run outside ring 0. Hardware-assisted virtualization fixed this by adding a guest mode. Intel VT-x calls the transitions VM entry and VM exit. AMD-V uses comparable support under the name SVM. A guest kernel can run its ring-0 code in guest mode, while the processor exits to the hypervisor for configured events such as privileged control-register writes, I/O port access, selected exceptions, and external interrupts.
A Type-1 hypervisor owns the machine from boot and schedules guests directly. A Type-2 setup relies on a host kernel for memory management, device drivers, and process scheduling. KVM makes Linux act as the hypervisor for CPU and memory virtualization, while QEMU supplies device models and process-level management.
Memory Virtualization
Virtual memory already translates guest virtual addresses to guest physical addresses. The host must add one more translation, from guest physical addresses to host physical addresses.
With shadow page tables, the hypervisor builds hardware-visible page tables that compose both mappings. Suppose a guest maps virtual page 0x40012 to guest physical frame 0x9012, and the monitor maps guest frame 0x9012 to host frame 0x2ab7. For a 4 KiB page and offset 0x078, the actual memory address is:
The shadow PTE must contain host frame 0x2ab7, not guest frame 0x9012. When the guest OS writes its own page table, the hypervisor must notice, validate, and update the shadow entry. That write-tracking cost is the main problem.
Extended Page Tables on Intel and Nested Page Tables on AMD move the second translation into hardware. The MMU first walks guest page tables to get a guest physical address, then walks EPT or NPT tables to get a host physical address. The TLB caches the composed result.
A 4-level x86-64 page walk can read up to 4 entries. With EPT, a TLB miss can require up to 24 memory references in the worst case because each guest page-table entry address also needs nested translation. Real CPUs cache intermediate page-walk data, but the bound explains why huge pages matter. A 2 MiB page maps 512 times as much memory as a 4 KiB page, reducing TLB pressure for large tensor buffers and database caches.
Paravirtual I/O
Device emulation copies the shape of real hardware. A guest driver writes to an emulated e1000 network card register; QEMU traps the access, decodes it, updates a virtual device, and eventually asks the host kernel to transmit bytes. This preserves compatibility but burns exits and copies.
Paravirtualization changes the contract. The guest uses a driver designed for virtualization. The common Linux interface is virtio. A virtio block or network device uses shared descriptor rings between guest and host.
A simplified descriptor has three fields plus flags:
struct vring_desc {
unsigned long long addr; // guest physical address
unsigned int len; // byte count
unsigned short flags; // next, write, indirect
unsigned short next; // next descriptor index
};
On a little-endian machine, a descriptor for a 4096-byte transmit buffer at guest physical address 0x0000000010203000 with flags 0x0001 and next index 7 has this 16-byte layout:
00 30 20 10 00 00 00 00
00 10 00 00
01 00
07 00
The guest appends the descriptor index to an available ring and kicks the host with one notification. The host reads the shared memory ring, maps the guest buffer, performs I/O, then writes completion data to the used ring. Batching 64 packets per notification replaces 64 exits with one exit plus shared-memory reads.
Containers As Kernel Partitioning
A container does not boot another kernel. It starts ordinary Linux processes with altered name spaces and resource accounting.
The main namespace types map to distinct kernel tables. The PID namespace gives the process tree its own PID numbering. The mount namespace gives it a different filesystem tree. The network namespace gives it its own interfaces, routing table, ports, and firewall state. The UTS namespace separates hostname and domain name. The IPC namespace separates System V IPC and POSIX message queues. The user namespace maps user and group IDs inside the container to different IDs outside it.
This C fragment starts a child in new UTS, mount, PID, IPC, and network namespaces. The child sees itself as PID 1 after the fork into the new PID namespace.
#define _GNU_SOURCE
#include <sched.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
int main() {
unshare(CLONE_NEWUTS | CLONE_NEWNS | CLONE_NEWPID |
CLONE_NEWIPC | CLONE_NEWNET);
if (fork() == 0) {
sethostname("box", 3);
printf("inside pid namespace, pid=%d\n", getpid());
while (1) pause();
}
wait(NULL);
return 0;
}
Cgroups account for resources and apply limits. In cgroup v2, memory and CPU controls are files in a cgroup directory. A container runtime might write these values before starting the process:
memory.max 536870912
cpu.max 50000 100000
io.max 8:0 rbps=10485760 wbps=10485760
The memory limit is 512 MiB. The CPU quota gives 50 ms of CPU time per 100 ms period, so the group receives half of one CPU. The I/O line limits reads and writes on block device major 8, minor 0 to 10 MiB per second each.
Docker is mostly orchestration around these kernel features. It pulls an OCI image, prepares a root filesystem, computes namespace and cgroup settings, then asks a runtime such as runc to create the container. Kubernetes adds fleet scheduling. It places containers on nodes, restarts failed ones, configures service networking, and records desired state.
Images And Layered Filesystems
OCI images avoid copying a full root filesystem for every container. Layers are content-addressed tar archives. overlay2 presents a merged view with lower read-only layers and one writable upper layer.
Consider three layers. Layer A contains /app/config with bytes:
6d 6f 64 65 3d 41 0a
That is the ASCII string mode=A\n. Layer B replaces the same path with:
6d 6f 64 65 3d 42 0a
Layer C adds /app/run with bytes:
2e 2f 73 65 72 76 65 72 0a
The merged view reads /app/config from B, not A, because upper layers mask lower paths. When a running container writes /app/config, overlayfs copies the file into the writable upper directory and modifies that copy. A second container started from the same image gets its own upper directory, so the write does not mutate the image layers.
Layering saves space when many containers share a base image. If the base is 300 MiB, the framework layer is 800 MiB, the model-server layer is 120 MiB, and each container writes 40 MiB, then 20 containers require about 300 + 800 + 120 + 20 * 40 = 2020 MiB of local storage rather than 20 * 1260 = 25200 MiB.
The Model
Popek-Goldberg Classical Virtualization Criterion
Statement
For a conventional third-generation computer, a virtual machine monitor can construct an efficient and equivalent virtual machine if the set of sensitive instructions is a subset of the privileged instructions.
Intuition
A sensitive instruction can change resource allocation or observe privileged machine state. If every such instruction traps outside supervisor mode, the monitor can run ordinary guest instructions directly and intercept exactly the operations that need mediation.
Proof Sketch
Run guest user code and most guest kernel code directly at reduced privilege. When the guest executes a privileged sensitive instruction, the processor traps to the monitor. The monitor checks the virtual machine state, emulates the instruction against that state, updates any real resources needed, and resumes the guest. Non-sensitive instructions have the same effect when run directly, so direct execution preserves behavior except for timing.
Why It Matters
The theorem separates CPU virtualization from full machine emulation. It explains why trap-and-emulate worked naturally on some architectures and why early x86 needed binary translation, paravirtualization, or VT-x and AMD-V.
Failure Mode
If an instruction reads privileged state without trapping, a guest can observe the host privilege level instead of the virtual one. The monitor then cannot preserve equivalence by direct execution alone.
The practical cost model is:
For containers, the analogous overhead is usually in namespace setup, cgroup accounting, overlay copy-up, and network path choices. CPU-bound code that does not hit those paths often measures near native. VMs commonly add a few percent CPU overhead for compute-heavy work with hardware support, while also carrying a full guest OS image, guest kernel memory, and guest background services. I/O-heavy workloads depend on virtio, batching, page size, and host device assignment.
Security follows the boundary. A VM puts a guest kernel behind a hardware virtualization interface. A container shares the host kernel; a kernel bug reachable from inside the container can cross the boundary. User namespaces, seccomp filters, capabilities, LSMs such as SELinux or AppArmor, and read-only mounts reduce exposure, but they do not turn a container into a separate kernel.
Common Confusions
A container is not a small VM
A container has no private kernel. Running uname -a inside a container reports the host kernel version because the syscall is served by the host kernel. PID 1 inside a PID namespace is just a process with special signal and orphan-reaping behavior in that namespace.
KVM is not the same thing as QEMU
KVM is the Linux kernel facility that exposes hardware virtualization to user space. QEMU can emulate devices and manage VM process state. A common VM uses both: KVM for vCPU execution and memory virtualization, QEMU for device models and VM control.
Layered images are not layered running filesystems forever
OCI image layers are immutable inputs. A running container has a writable upper layer. The first write to a lower-layer file causes copy-up, so a small write to a large file can create a large private copy.
Exercises
Problem
A VM runs on a 3.2 GHz CPU. Its workload causes 80000 VM exits per second. Each exit costs 1800 cycles, including entry back to the guest. What fraction of one CPU core is spent on exits?
Problem
A cgroup v2 directory contains memory.max = 1073741824 and cpu.max = 25000 100000. Interpret both limits. Then compute the CPU time available over 10 seconds of wall time.
Problem
A guest maps guest virtual page 0x7f123 to guest physical frame 0x45555. EPT maps guest physical frame 0x45555 to host physical frame 0xabcde. The page size is 4 KiB and the byte offset is 0x2c0. Give the guest virtual address, guest physical address, and host physical address.
References
Canonical:
- Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau, Operating Systems: Three Easy Pieces (2018), ch. 4-12, 27-32, 36-43, 48, processes, virtual memory, concurrency, file systems, and distributed systems background
- James E. Smith and Ravi Nair, Virtual Machines: Versatile Platforms for Systems and Processes (2005), ch. 1-3, 8, virtual machine taxonomy, process VMs, system VMs, and implementation techniques
- John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, 6th ed. (2019), §5.6, virtual machines and architectural support
- Intel, Intel 64 and IA-32 Architectures Software Developer's Manual, Vol. 3C (2025), ch. 24-29, VMX operation and extended page tables
- Advanced Micro Devices, AMD64 Architecture Programmer's Manual, Vol. 2 (2023), ch. 15, secure virtual machine architecture and nested paging
- Gerald J. Popek and Robert P. Goldberg, “Formal Requirements for Virtualizable Third Generation Architectures,” Communications of the ACM 17(7), 1974, classical virtualization theorem
Accessible:
- Linux kernel documentation, Control Group v2 and Namespaces man pages, concrete interfaces for cgroups and namespace creation
- Open Container Initiative, OCI Image Format Specification and OCI Runtime Specification, image layout and runtime contract
- Docker documentation, Use the overlay2 driver, practical description of overlay layers and copy-up behavior
Next Topics
/computationpath/processes-and-system-calls/computationpath/virtual-memory/computationpath/file-systems-and-storage/computationpath/os-scheduling/topics/memory-hierarchy