Under the Code — Part II: The Software Layer

01 — Anatomy

The kernel is just a program

It is easy, after Chapter 1, to start thinking of the kernel as some abstract authority — a kind of governing law of the machine. It is not. The Bridge at the close of Part I showed the hardware contract that makes a kernel possible at all — the privilege bit, the trap, the MMU, the timer. This chapter examines what the program on the kernel side of that contract actually looks like. The kernel is a program: a set of instructions stored in memory, executed by the same CPU that runs everything else, written in a real programming language by real people. The Linux kernel, as of 2025, is roughly 30 million lines of C, with a small and growing amount of Rust. What makes it the kernel is not what it is made of, but where it sits — in Ring 0, with hardware-enforced privileges no other program has.

When you press the power button on a computer, a sequence unfolds. The CPU starts executing instructions from a fixed firmware address — historically called the BIOS, now usually UEFI — built into the motherboard. The firmware initializes hardware, finds the boot device, and loads a small program called a bootloader (GRUB on Linux, Windows Boot Manager, iBoot on Apple devices). The bootloader's only job is to find the kernel image on disk, load it into memory, and jump to its entry point. From that moment, the kernel runs forever — or until the machine shuts down.

Monolithic vs microkernel: the great schism

There are two philosophies about how to build a kernel. They have been arguing with each other since the late 1980s.

A monolithic kernel puts everything in Ring 0: process scheduling, memory management, filesystems, device drivers, network stacks, all of it. Performance is excellent because subsystems can call each other directly — no boundary crossings. The cost is fragility and security: a bug in any part of the kernel can corrupt the whole machine, and a single device driver with a vulnerability can be the entry point for total compromise. Linux is monolithic. Windows is monolithic. Most kernels in production use are monolithic.

A microkernel takes the opposite stance. It puts only the absolute minimum in Ring 0 — typically just message passing, basic memory protection, and minimal scheduling. Filesystems, drivers, and network stacks all run as separate user-space processes that communicate via messages. A bug in a driver crashes only the driver, not the kernel. The cost is performance: every interaction crosses the user/kernel boundary, and message passing adds latency. Notable microkernels: MINIX (Andrew Tanenbaum's research OS), L4 (used in many embedded systems), seL4 (formally verified — mathematically proven to have no kernel bugs of certain classes).

In 1992, Tanenbaum publicly criticized the then-new Linux on a USENET group, calling its monolithic design "obsolete" and "a giant step back into the 1970s." Linus Torvalds responded — bluntly, and at length. The exchange is one of the most-cited debates in computing history. Tanenbaum was correct in theory. Torvalds was correct in practice. Both are still right, for different reasons, and the schism has never closed.

Fig 4.1 — The Tanenbaum–Torvalds debate, January 1992

The debate ran for weeks. Tanenbaum, the established expert, argued from architecture and theory: microkernels were the future, monolithic designs were a regression to the 1970s, Linux was tied to one specific CPU. Torvalds, twenty-two and writing Linux from his bedroom, argued from running code and pragmatic constraints. Three decades later both have been vindicated, just for different things. The cleanest microkernels (seL4, QNX, the formally-verified ones) win in safety-critical embedded systems where a single kernel bug is unacceptable. Linux runs everywhere else — server farms, phones, supercomputers, refrigerators — because "ships and works" beat "is theoretically correct" once the practical gap stopped being trivial. Both are still right. The argument never closed because both stances are answering different questions.

macOS sits in the middle. Its kernel, XNU, is a hybrid: a Mach microkernel core wrapped with a BSD UNIX layer that runs in the same address space — so you get message-passing primitives and monolithic-kernel performance. iOS uses the same kernel. Android uses Linux. Windows NT was originally microkernel-influenced but has drifted increasingly monolithic for performance reasons.

Fig 4.2 — Monolithic vs microkernel

In a monolithic kernel, every subsystem runs in Ring 0 and shares one address space. Calls between them are direct and fast. In a microkernel, only message passing and the bare minimum live in Ring 0; filesystems, drivers, and network stacks run as ordinary user-space processes that communicate by IPC. Microkernels are safer; monolithic kernels are faster.

Kernel modules: a third path

Linux added a feature called loadable kernel modules (LKMs) that softens the monolithic stance. Drivers and filesystems can be compiled separately from the main kernel and loaded or unloaded at runtime with commands like insmod and rmmod. They still run in Ring 0 — so a buggy module can crash the kernel — but they don't have to be linked into the main image. Most graphics drivers, filesystem drivers, and hardware support on Linux ships as modules. lsmod on a running Linux system will list hundreds of them.

Fig 4.3 — A kernel module loading at runtime

Linux at runtime is a base kernel image plus a constellation of modules that have been linked into it dynamically. Each insmod resolves the module's symbols against the kernel's symbol table, allocates kernel memory, copies the module's code in, and calls its module_init() entry. From that moment the module runs in Ring 0 — a buggy module can panic the whole machine — but the source tree, build system, and distribution channel are independent of the main kernel. Distributions ship pre-compiled modules; vendors (NVIDIA, Broadcom) ship out-of-tree modules signed for specific kernel versions. rmmod reverses the process: call module_exit(), free the memory, remove the symbols. This is how Linux maintains both the speed of monolithic design and the operational flexibility microkernels promised.

The pattern shows up everywhere: a clean theoretical model, modified by practical necessity. The kernel is monolithic but modular. It runs in Ring 0 but increasingly delegates risky work (drivers in user space, eBPF in a verified sandbox, see Section 6) to safer compartments. The history of operating systems is the history of these compromises.

02 — Scheduling

The art of taking turns

A modern laptop runs hundreds of processes simultaneously. A server may run thousands. Your machine has, at most, a few dozen CPU cores. The arithmetic doesn't work — most processes cannot be running at any given instant. The kernel creates the illusion that they all are, by switching between them many times per second, fast enough that you cannot perceive the gaps. The component that decides which process gets the CPU at any given moment is called the scheduler, and it is one of the deepest and most-studied parts of any kernel.

Cooperative vs preemptive: who interrupts whom

The earliest multi-tasking systems were cooperative: each running program voluntarily yielded control back to the scheduler when it had nothing useful to do. Mac OS through version 9 worked this way. So did 16-bit Windows. The model is simple and lightweight, but it has a fatal flaw: a single misbehaving program that never yields freezes the entire machine. Every Mac OS 9 user remembers the experience of one frozen application taking down everything else.

Modern systems are preemptive: the kernel forcibly takes the CPU back from a running process at regular intervals, regardless of whether the process is ready to give it up. The mechanism is a hardware timer interrupt — a chip on the motherboard that sends an electrical signal to the CPU at a fixed frequency (typically 100–1000 times per second on Linux). Each interrupt forces the CPU into Ring 0, where the scheduler runs, decides whether to switch processes, and either resumes the current one or chooses a different one. The user perceives perfectly smooth multitasking because the timer fires faster than human reaction.

What "scheduling" actually has to decide

The scheduler's job sounds trivial — pick a process to run — but the choices have surprising depth. The classical algorithms each capture a different trade-off:

Algorithm	How it works	Trade-off
FCFS	First-come, first-served. Run processes in arrival order until they finish.	Simple. Terrible response time — one long process blocks all the short ones (the "convoy effect").
SJF	Shortest Job First. Pick the process with the shortest remaining time.	Provably optimal for average wait time. Requires knowing job length in advance — usually you don't.
Round Robin	Each process gets a fixed time slice ("quantum"). When it expires, move to the next.	Fair. Responsive. Doesn't optimize for total throughput.
Priority	Each process has a priority number. Higher priority runs first.	Lets the system favor critical work. Risk: low-priority processes can be starved indefinitely.
MLFQ	Multilevel Feedback Queue. Multiple priority levels; processes that use a lot of CPU drift to lower priorities, processes that wait often rise.	Approximates SJF without knowing job length. Used by Windows, classic UNIX.
CFS	Completely Fair Scheduler. Linux's default since 2007. Tracks how much CPU each process has used; always runs whichever has used the least.	Approximates "everyone runs at exactly 1/N speed on N processes." O(log N) per scheduling decision via a red-black tree.

Linux's CFS deserves a closer look because it's a particularly elegant idea. The scheduler keeps every runnable process in a self-balancing binary tree (a red-black tree), keyed by virtual runtime — a measure of how much CPU time the process has consumed, weighted by its niceness. Whenever the scheduler needs to pick the next process to run, it picks the leftmost node of the tree (smallest virtual runtime). When a process runs, its virtual runtime increases; when it blocks, it's removed; when it wakes, it's reinserted with its saved value. This naturally implements approximate fairness: every process tends toward equal CPU usage, and any process behind catches up first.

Fig 4.4 — CFS chooses the leftmost node of a red-black tree

CFS stores all runnable processes in a red-black tree, keyed by virtual runtime (a measure of CPU consumption). The leftmost node has the smallest value — the process most "behind" on its share of the CPU — and is the next to run. Insertion, deletion, and finding the minimum are all O(log N). This is why a Linux machine with thousands of runnable processes still schedules in microseconds.

Fig 4.5 — Four schedulers, one workload

The same four jobs scheduled four different ways. FCFS serves them in arrival order — the eight-unit job A blocks everyone behind it (the convoy effect). SJF reorders by length — A still runs first because it arrived alone, but D (5 units) jumps ahead of C (9 units), giving the lowest possible average wait time. Round Robin with quantum 2 interleaves all four constantly — responsive but never letting any job make a long uninterrupted run. MLFQ demotes processes that use a lot of CPU into lower-priority queues; long-running C ends up at Q3 and waits, while interactive bursts of A get repeated Q1/Q2 attention. Each algorithm is correct, for a different definition of "correct" — average wait, total throughput, responsiveness, fairness. The choice depends on workload.

Queueing theory and Little's Law

The mathematics underlying scheduling is queueing theory, developed in the early 20th century by Agner Krarup Erlang for telephone networks. The single most important result is Little's Law:

L = λ · W

The average number of items in a queueing system (L) equals the average arrival rate (λ) multiplied by the average time each item spends in the system (W). It holds for any stable queue, regardless of arrival distribution or service distribution. It is the reason a slightly oversubscribed system — λ approaches its capacity — produces explosive wait times: as utilization approaches 100%, W goes to infinity. Every web server, every database, every operating system kernel obeys this. It is why the difference between 95%-loaded and 99%-loaded servers is not 4% — it is often 10× in latency.

Fig 4.6 — L = λ · W and the cliff at full utilization

Little's Law in two pictures. Top: a queue with arrivals (λ) flowing into a buffer of average length L; each item waits average W. The relationship L = λW is true for any stable queue, regardless of how arrivals are distributed or how long service takes. Bottom: the consequence — average wait time as utilization approaches 100% follows W ~ 1/(1−ρ), which is hyperbolic. At 50% utilization, doubling load barely budges the wait. At 95%, doubling load is catastrophic. This is why operations teams panic when a server crosses about 80% sustained utilization — they are not panicking about "the server being slow," they are panicking about the math curve they can see coming.

Real-time scheduling: when missing a deadline kills

General-purpose schedulers like CFS optimize for average performance. They make no guarantees about worst-case latency. For most software this is fine. For some software it is catastrophically not fine: anti-lock braking systems, pacemakers, avionics flight control, industrial robots. These systems run on real-time kernels — variants of Linux (PREEMPT_RT) or specialized OSes (VxWorks, QNX, FreeRTOS) — that guarantee a task will run within a bounded time after it becomes ready, even under load. The mathematical foundation is rate-monotonic scheduling and earliest-deadline-first scheduling, analyzed by Liu and Layland in a foundational 1973 paper.

Fig 4.7 — A periodic task and the deadline it must meet

A real-time task that runs at 100Hz must complete its work within each 10ms window. The first three deadlines are met (5ms work, 10ms window — half the budget). The fourth job runs longer than expected (12ms) and overruns its deadline. In a non-real-time system this would be a slow frame or a stutter; in an anti-lock braking controller, a pacemaker, or a flight control loop, it is a system failure with physical consequences. Real-time kernels use schedulers (rate-monotonic, earliest-deadline-first) that mathematically guarantee deadlines provided the total task utilization stays below a known bound — about 69% for rate-monotonic, 100% for EDF. Going above that bound, missed deadlines become possible, and in a hard real-time system that is the same as broken.

03 — Virtual memory

The lie of unlimited memory

Chapter 1 introduced virtual memory as the kernel's mechanism for isolating processes; the Bridge showed the MMU and TLB as the silicon that makes the mechanism enforceable. We saw the headline: every process believes it has its own private address space starting at zero. Now we look at the machinery that sustains the illusion — the page tables, the translation cache called the TLB, the page faults that quietly bring memory into existence on demand, and the mathematical structures that make 64-bit addressing tractable at all.

The translation problem

Every memory access a program performs uses a virtual address. The CPU cannot use this directly; physical RAM is addressed by physical addresses. Some hardware must, on every load and store, translate one to the other. That hardware is the Memory Management Unit (MMU), built into the CPU. The translation table it consults is the page table, maintained by the kernel.

Memory is divided into fixed-size blocks called pages, typically 4 KB. The address space is therefore divided into virtual pages, and physical RAM into physical pages (often called "page frames"). The page table maps one to the other. When a virtual page has no corresponding physical page in RAM — because it's never been used, or has been swapped to disk, or belongs to a memory-mapped file not yet loaded — the table entry is marked invalid, and the MMU raises a hardware exception called a page fault. The kernel's page-fault handler decides what to do.

Why the page table cannot be flat

A naive page table would be a single flat array — one entry for every possible virtual page. The number is enormous. On 64-bit x86, the architecture defines a 48-bit usable virtual address space (256 terabytes), giving 2³⁶ pages. A flat table with one 8-byte entry per page would need 512 GB just for the table. Per process. This is plainly impossible.

The solution is a multi-level page table — a tree. The 48-bit virtual address is split into four 9-bit fields plus a 12-bit page offset. Each 9-bit field indexes into one level of a four-level tree. Most branches are empty (the address space is sparse — your process only uses a tiny fraction of the 256 TB available), so most of the tree is never allocated. A real x86-64 process typically uses a few megabytes of page tables to map its actual memory, not 512 gigabytes.

Fig 4.8 — x86-64 four-level page table walk

Translating one virtual address into a physical address requires walking four levels of page tables. The CPU register CR3 points to the root (L4). Each level uses 9 bits of the address to index into a 512-entry table; the entry points to the next level. The bottom 12 bits give the byte offset within the final page. Without optimization, every memory access would cost five memory accesses.

The TLB: a cache for translations

A naive page-table walk would be ruinously slow — every memory access would require five memory accesses (four for the table walk, one for the actual data). The fix is the Translation Lookaside Buffer (TLB), a small, very fast cache inside the CPU that stores recent virtual-to-physical translations. A typical x86-64 TLB has between 64 and 1500 entries. When the CPU needs to translate an address, it first checks the TLB. If the entry is there (a "TLB hit"), translation takes one cycle. If not (a "TLB miss"), the MMU walks the page tables and inserts the result into the TLB.

Hit rates on the TLB are typically above 99%. The 1% miss rate, multiplied by billions of memory accesses per second, still matters — and is why CPU designers have steadily grown TLB sizes and added second-level TLBs over the past two decades. When the kernel switches between processes, it must flush parts of the TLB (since the new process has different page tables). This is one of the hidden costs of context switching, and one reason why excessive switching hurts performance.

Page faults — the productive kind

A page fault sounds like an error. Most of them are not. There are several kinds, and the everyday ones are how the kernel implements many of its most useful features. They fall into three rough categories:

A minor page fault happens when the page is in physical memory but isn't yet mapped into this process. Example: when a program first reads a page of a file the kernel has cached. The kernel just adds an entry to the page table and returns. Cost: microseconds.

A major page fault happens when the page is not in RAM and must be fetched from disk — typically because it was swapped out, or because the program is reading a memory-mapped file for the first time. The kernel issues a disk read, suspends the process, and resumes it when the data arrives. Cost: milliseconds. Tens of thousands of times slower than minor faults.

An invalid page fault happens when the access truly is illegal — writing to a read-only page, dereferencing a null pointer, executing data marked non-executable. The kernel sends the offending process a SIGSEGV signal, and unless the process catches it, the program dies with the famous "segmentation fault" message.

Fig 4.9 — A page fault, decision tree

A "page fault" is a CPU exception, but most faults are not errors — they are how the kernel implements memory management on demand. Minor faults are pure bookkeeping: the page is in RAM (perhaps already in another process's mapping or in the kernel's page cache), it just hasn't been wired into this process's page tables yet. Major faults involve disk I/O — the page must be paged in from swap, or read from a memory-mapped file for the first time — and are about a thousand times slower. Invalid faults are the real errors: the dereference of a NULL pointer, the write to a read-only page, the jump into non-executable data. Only invalid faults trigger SIGSEGV. On a typical Linux desktop, the kernel handles thousands of minor faults per second invisibly.

Two beautiful uses of page faults

Memory-mapped files. When you call mmap() on a file, the kernel doesn't read the file. It sets up page table entries marking the relevant virtual addresses as backed by that file, but invalid (not yet present). The first time you actually access a page, you take a page fault, and the kernel reads just that one page from disk. Reading a 100 GB file as if it were a contiguous array in memory becomes trivial; only the pages you touch are loaded. This is how databases, search engines, and many high-performance systems handle large data — and how every shared library loads on Linux.

Copy-on-write fork. When a UNIX process calls fork() to create a child process, the child receives a copy of the parent's entire address space. Naively, this would require duplicating gigabytes of memory. It doesn't. The kernel marks all of the parent's pages as read-only and shares them with the child. As long as neither process writes to a page, they share it. Only when one of them tries to write does a page fault occur, and the kernel makes a private copy at that moment. Most pages are never written; most fork-and-exec sequences (the standard way to launch a new program on UNIX) never copy any pages at all. The cheapest way to make a copy is to lie about having made it.

Fig 4.10 — Fork, then write: when shared pages diverge

After fork(), both parent and child point at the same physical pages, all marked read-only. The parent and child page tables are different — they could diverge — but no actual copying has happened. The instant either side writes to a shared page, the MMU raises a page fault. The kernel's COW handler allocates a fresh physical page, copies the contents, fixes up the page table of the writer to point at the new page (read-write), and resumes the instruction. The non-writing process never noticed. The famous fork-and-exec idiom — fork a child that immediately calls execve() to load a new program — copies zero pages, because the child throws away its address space before writing anything. This is why launching a UNIX program is essentially free.

🛡️

Dirty COW (CVE-2016-5195). A bug in Linux's copy-on-write logic, present in the kernel for nine years before being discovered, allowed an unprivileged attacker to write to read-only files — including /etc/passwd — by exploiting a race condition between the page fault handler and the kernel's COW machinery. Privilege escalation followed trivially. The bug's existence and longevity is a reminder of how much of operating system security depends on subtle correctness in subsystems most users will never see.

Fig 4.11 — Dirty COW: the race that broke Linux

Dirty COW exploited a race in the COW path. A thread tries to write a read-only file mapping, taking a page fault. The kernel begins COW: it allocates a copy and is about to install it as read-write in the writer's page table. Between "allocate copy" and "install copy in page table" there is a brief window. A second thread, on the same process, calls madvise(MADV_DONTNEED) on the same page — which legally discards the COW mapping. The first thread resumes, retries the write, but now the page table no longer has the copy installed, so the write goes through to the original mapping — the read-only file. An unprivileged process can therefore write to any file it can read. /etc/passwd is world-readable. Local privilege escalation in 60 lines of C. The bug had been in the kernel since 2007. Phil Oester reported it after seeing it abused on a production system; Linus Torvalds patched it within hours; every Linux distribution shipped fixes within days.

04 — Filesystems

A tree of names on a flat array of blocks

A disk, at its lowest level, is a flat array of fixed-size blocks. A spinning hard drive presents itself to the OS as billions of 512-byte sectors numbered 0, 1, 2, … An SSD presents the same abstraction even though the physical reality underneath is radically different. The kernel sees: a sequence of bytes, addressed by index. From this raw substrate, the filesystem builds the structure you actually use — files with names, organized into directories, organized into a tree, with metadata about ownership and permissions and timestamps. None of this structure exists at the hardware level. It exists because the filesystem code pretends it does.

The inode: where a file actually lives

In UNIX-derived filesystems (which is most of them), the central data structure is the inode — short for "index node." Every file has exactly one inode, and the inode contains everything about the file except its name and its data:

sizefile size in bytese.g. 24,873

owneruser ID and group IDuid, gid

permread/write/execute permissionsrwxr-xr--

timescreated, modified, accessedctime, mtime, atime

linkshow many filenames point herelink count

blocksarray of disk block numbers holding the data12 direct + indirect

Notice what's not there: the filename. In UNIX, a filename is a property of the directory it lives in, not the file. A directory is itself a special kind of file whose contents are a list of (name, inode-number) pairs. To open /home/yki/notes.txt, the kernel looks up the inode for /, reads its directory contents to find the entry for home, follows that to the next directory's inode, and so on, until it reaches notes.txt's inode. Then it uses the inode's block list to read the actual data.

This separation of name from data has elegant consequences. A single file can have multiple names — multiple directory entries pointing to the same inode. These are called hard links. The file is only deleted when its link count drops to zero. It is also why renaming a huge file is instant: only the directory entry changes. The inode and its data don't move.

Fig 4.12 — Filename → inode → data blocks

A path lookup walks directories to find an inode. The inode contains metadata and a list of data blocks. Two directory entries pointing to the same inode (notes.txt and backup.txt above) are hard links — the same file with two names. Data blocks are not necessarily contiguous; finding them quickly is one of the filesystem's main jobs.

The everything-is-a-file principle

UNIX took the inode concept further than just regular files. Every kernel-managed resource is exposed as a file:

Your keyboard is /dev/input/event0.
Your sound card is /dev/snd/pcmC0D0p.
Random numbers come from /dev/urandom.
The state of every running process lives under /proc/<pid>.
Network connections are accessed through file descriptors with read and write.
Hardware sensors expose temperature and voltage as files in /sys.

The same five system calls — open, read, write, close, lseek — work on all of them. This radical uniformity is one of the reasons UNIX-derived systems became so dominant: tools written for files just work on devices, network connections, and process state without modification. The shell pipeline you can use to count lines in a text file (cat file | wc -l) works equally well on the output of a sensor driver or a debugging tool, because to the kernel they are all just files.

Journaling: the database technique that saved filesystems

A filesystem that simply writes data and metadata to disk wherever convenient has a fatal weakness: power loss. If the machine crashes between writing data and updating its corresponding metadata (or vice versa), the filesystem becomes inconsistent — files exist whose blocks are still listed as free, or directory entries point at non-existent inodes. The classic UNIX response was fsck, a program that scans the entire disk after a crash to find and fix inconsistencies. On a multi-terabyte disk this could take hours.

The modern solution, borrowed from database theory, is the journal — a small, dedicated region of the disk where every intended modification is written first, before being applied to the main filesystem. After a crash, the kernel only has to replay the journal from where it left off, applying or discarding incomplete operations. Recovery takes seconds instead of hours. ext4 (Linux), NTFS (Windows), and HFS+ (older macOS) all journal. The technique is called write-ahead logging; we'll meet it again in Chapter 13 when we get to databases.

Fig 4.13 — Journaling: write your intent before doing it

A journaling filesystem does every write twice. First, the kernel writes a description of the intended change to a small journal region, followed by a single-block commit record (whose write is atomic by hardware guarantee). Then it applies the change to the actual filesystem at its leisure — possibly reordered, batched, or coalesced with other writes. If the machine crashes between commit and apply, recovery is straightforward: replay every journal entry whose commit record is intact, drop the rest. The technique comes from database write-ahead logging (Gray, IBM, 1981), and is now in ext4, NTFS, HFS+, JFS, XFS, and every other serious modern filesystem. It is also why a sudden power loss on a modern machine takes seconds to recover from, not hours.

The newest generation of filesystems — copy-on-write filesystems like ZFS, btrfs, and Apple's APFS — go further. They never overwrite existing data. Every modification writes new blocks; only after the write succeeds is the metadata updated to point to them. The old version remains until garbage-collected, which gives you essentially-free snapshots, atomic operations on entire directories, and built-in checksumming to detect silent disk corruption.

Fig 4.14 — Copy-on-write filesystem: never overwrite, just re-point

A copy-on-write filesystem never modifies a block in place. Updating one byte of block 103 means: allocate a new block (103′), copy 103's contents, apply the change, then re-write the parent that pointed at 103 to point at 103′ instead — and so on, all the way up to the root. The old version of every block remains valid until explicitly freed. A snapshot is just a saved copy of the old root pointer; it costs zero bytes until something diverges. Every block can include a checksum that the filesystem verifies on read, so silent disk corruption (which happens, even on enterprise hardware) gets caught instead of propagating. The trade is write amplification: a one-byte change touches a full block at every level of the tree. ZFS, btrfs, and APFS all make this trade and consider it a bargain. WAFL (NetApp) made the trade first, in 1992, and built a multi-billion-dollar storage business on it.

05 — IPC

How processes talk

Processes are isolated by design. The whole point of virtual memory and privilege separation is that one process cannot reach into another's memory and read or modify it. This is the foundation of operating system security. But isolation taken to its extreme produces a useless system — programs that can't share anything would be unable to compose into pipelines, can't coordinate, can't even tell each other when work is done. So the kernel exposes a controlled set of mechanisms for processes to communicate. These are collectively called inter-process communication, or IPC.

Each IPC mechanism is a different point on the trade-off between convenience, performance, and flexibility. The set below is roughly chronological — older mechanisms at the top, newer at the bottom.

Mechanism	How it works	Best for
Pipe	A unidirectional byte stream between two related processes. The shell `\|` operator creates one.	Chaining commands together. The classic UNIX pipeline.
Named pipe (FIFO)	A pipe with a filesystem name. Any process with permission can connect.	Letting unrelated processes communicate via a known path.
Signal	A small numeric notification sent to a process. `SIGTERM`, `SIGKILL`, `SIGINT` (Ctrl+C).	Asynchronous control: "stop," "reload config," "we're shutting down."
Shared memory	Two processes map the same physical pages into both their address spaces.	The fastest IPC. No copying. Used by databases and high-performance systems.
Message queue	A kernel-managed queue of typed messages between processes.	Structured communication; less common today.
Semaphore	A kernel-managed counter used to coordinate access to a shared resource.	Synchronizing without sharing data — "is it my turn yet?"
UNIX socket	Like a network socket, but local — kernel-mediated, with proper authentication.	Modern desktop IPC. Used by Docker, systemd, X11, Wayland.
Network socket	TCP or UDP connection, possibly to another machine.	Distributed systems. We'll cover these in depth in Part III.

The pipe: UNIX's most beautiful idea

The pipe was added to UNIX in 1973 by Doug McIlroy. Mechanically, a pipe is a kernel-allocated ring buffer with two file descriptors attached: one process writes to one end, another reads from the other. The kernel handles flow control automatically — if the buffer is full, the writer blocks; if it is empty, the reader blocks. No shared memory. No locks. No protocol. Just a stream of bytes.

Fig 4.15 — A pipe is a kernel ring buffer with two ends

A pipe is two file descriptors attached to one kernel-allocated ring buffer. The writer's write() deposits bytes; the reader's read() consumes them. The kernel handles synchronisation: when the buffer fills, the writer blocks until space appears; when it empties, the reader blocks until bytes arrive. No userspace synchronisation is required because the synchronisation lives in the kernel's well-tested code paths. This is the simplest possible mechanism for streaming data between two processes — and is the same machinery that fifty years of UNIX shell pipelines have been built on.

The pipe's syntactic appearance in the shell is dazzlingly simple:

shell

# Count how many unique lines start with "ERROR" in a log:
cat server.log | grep "^ERROR" | sort | uniq | wc -l

# Five separate processes, each doing one small thing.
# The shell creates pipes between them. Output of one
# becomes input of the next, streamed byte-by-byte.

Each program in the pipeline does one small task, knows nothing about the others, and reads its input from standard input and writes to standard output as if they were ordinary files. The kernel arranges those file descriptors to be the ends of pipes. Five processes execute in parallel; the output of one streams into the next as fast as either can handle it. The model is so productive that it has been imitated in essentially every shell since. It is one reason why UNIX produced a culture of small composable tools rather than monolithic applications.

Fig 4.16 — The UNIX philosophy, in five processes

Five processes, each doing one small task, with pipes between every pair. cat reads the file. grep filters to lines starting with ERROR. sort orders them. uniq drops adjacent duplicates. wc -l counts what's left. None of these programs was written knowing about the others; each just reads from stdin and writes to stdout. The shell connects them with pipes, and the kernel runs all five in parallel — bytes stream through the pipeline as fast as the slowest stage allows. This is the canonical illustration of the UNIX philosophy: composition through tiny, sharp tools and a universal interface (text bytes), so any new tool you write joins the pipeline for free.

"This is the Unix philosophy: write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface."

— Doug McIlroy, inventor of the UNIX pipe

Signals: a doorbell to a process

Pipes carry data. Signals carry no data — only the bare fact that something happened. A signal is a small, numbered notification delivered asynchronously to a process: number 2 (SIGINT) is what the kernel sends when you press Ctrl+C; number 9 (SIGKILL) is the uncatchable "stop now"; number 15 (SIGTERM) is the polite "please stop." A process can register a handler function for most signals; when the signal arrives, the kernel hijacks the running process, runs the handler, then resumes. Three things can happen on a signal: handle, ignore, or — for some signals like SIGKILL and SIGSTOP — die, with no recourse.

Fig 4.17 — A signal arrives mid-execution

A signal arrives at an arbitrary moment in the target process's execution. The kernel queues it, then — at the next kernel-to-userspace transition (a syscall return, a page-fault return, etc.) — diverts the process to its registered handler. The handler runs as if it were a function call inserted at that exact instruction; when it returns, the kernel restores the original CPU state and the process resumes. Some signals are uncatchable: SIGKILL (number 9) terminates the process unconditionally; the kernel does not even consult the process. SIGSTOP suspends it. Both exist because a privileged user must always be able to stop a misbehaving process — even one that has registered handlers for everything else. Signals are crude: a single integer per notification, no data, no queueing of duplicates of the same signal. They predate every other Unix IPC mechanism and remain the standard way to control the lifecycle of a process.

Shared memory and semaphores: the no-copy IPC

Pipes and signals are convenient. They are not fast. Every byte through a pipe is copied: from the writer's userspace into the kernel buffer, then out of the kernel buffer into the reader's userspace. For high-throughput communication — gigabytes per second between two processes on the same machine — shared memory is the only viable mechanism. Two processes call into the kernel to map the same physical pages into both their virtual address spaces. After that, reading and writing the shared region is no slower than reading and writing any other memory. There is no copy. There is no kernel involvement on every access.

The cost of zero-copy is that you now have a classic concurrency problem: two processes touching the same memory must coordinate so that neither sees the other mid-update. The kernel exposes semaphores for this — counters that processes increment and decrement atomically, used as gates. A semaphore initialised to 1 acts as a mutex (one process at a time may pass). Initialised to N, it acts as a resource pool. The mathematics is Edsger Dijkstra's from 1965, formalised in his note "Cooperating Sequential Processes," and underlies every modern concurrent system.

Fig 4.18 — Two processes, one region, one gate

Two processes share a physical page by mapping it at (potentially different) virtual addresses. Reads and writes against that region are ordinary CPU loads and stores — no kernel involvement, no copying, no system calls in the fast path. The cost is coordination: if both processes write at the same time, the result is undefined. A semaphore, initialised to 1, lets exactly one of them through at a time. sem_wait() atomically decrements; if the value would go negative, the caller blocks until sem_post() elsewhere releases the gate. This is shared memory + semaphore: the IPC of choice for performance-sensitive systems where the cost of even one extra copy per byte is unacceptable. Database engines, video pipelines, audio mixers, and high-frequency trading systems are all built on it.

Sockets: when the other process is on another machine

A socket is a generalization of the pipe. Instead of communicating with another process on the same machine through a kernel-managed pipe, you communicate over a connection — either to another local process (a "UNIX domain socket") or to a process on a different machine entirely (a "network socket," using TCP or UDP). The same system calls — read, write, close — apply. From the application's perspective, the network is just another file. The kernel, the network drivers, the protocol stack, and the wire are all hidden beneath a four-call API.

This abstraction is one of the most important pieces of glue in modern computing. It is why a web browser, a database client, a video conferencing app, and an SSH session all use roughly the same code shape: open a socket, write a request, read a reply, close. We will spend Part III opening up what happens between write and read on a network socket — a journey through ARPANET, packet switching, TCP, DNS, TLS, HTTP, and the rest of the internet stack. For now: the kernel makes it look like a file. That, more than anything else, is the kernel's job.

06 — Security

When the kernel breaks

Everything we've built in this chapter rests on one assumption: the kernel is correct. The privilege boundary, the address-space isolation, the file permission checks, the IPC mediation — all of it depends on the kernel actually doing what it says it does. A bug in the kernel is therefore not an ordinary bug. It is potentially the end of every security guarantee on the machine. A user-space program with a buffer overflow can be exploited to take over that program. A kernel with the same kind of bug can be exploited to take over everything running on the entire computer, including the kernel itself, every other process, every file, and every keystroke. This is what kernel security is about.

Privilege escalation, more carefully

A privilege escalation attack is one where a process gains more privilege than it was supposed to have. The two important versions:

Vertical escalation — moving up the privilege ladder. A normal user process exploits a bug to become root, or root exploits a bug to enter Ring 0 (kernel mode). Once in Ring 0, an attacker controls the machine without restriction.

Horizontal escalation — moving sideways at the same privilege level into another process's data. Reading another user's files, hijacking another user's session.

Memorable kernel-level privilege escalation bugs in the last decade include Dirty COW (mentioned earlier — copy-on-write race condition, 2016), DirtyPipe (a flaw in pipe-buffer initialization that allowed writes to read-only files, 2022), and dozens of bugs in device drivers — driver code is vast, often less audited than the core kernel, and runs in Ring 0. Most modern kernel exploits chain together multiple bugs: an information leak (to defeat KASLR, the kernel's version of ASLR), then a memory-corruption primitive (to overwrite something useful), then a privilege escalation.

Containers: not virtual machines

You've probably heard of Docker, Kubernetes, and "containers." They are often described as "lightweight VMs," which is misleading. A virtual machine emulates an entire computer — its own kernel, its own everything — running on top of a hypervisor. Containers do not. A container is a normal Linux process whose view of the system has been restricted by the kernel. There is one shared kernel. The container is just a process the kernel has lied to.

The lying is done with two Linux features:

Namespaces partition kernel resources so that different processes see different views. There are namespaces for process IDs (a containerized process believes it is PID 1, even though to the host it might be PID 8472), for the network (the container has its own virtual network interface), for mounts (its own root filesystem), for user IDs, for hostnames, and more. Inside its namespaces, the container looks like a complete isolated system.

Cgroups (control groups) limit how much of each resource a process group can use — CPU time, memory, disk I/O, network bandwidth. Together with namespaces, this gives Docker its model: cheap, fast, lightweight isolation that doesn't require a hypervisor.

Fig 4.19 — VMs vs containers

A virtual machine runs a complete guest kernel on emulated hardware — strong isolation, but heavy. A container is just a process the host kernel has restricted with namespaces (what it can see) and cgroups (what it can use). One shared kernel, much less overhead — but a kernel bug in that shared kernel can break out of every container at once.

The trade-off matters: VMs isolate kernels, so a compromise of one guest doesn't affect others. Containers share a kernel, so a kernel privilege-escalation bug can let a malicious container take over the host and every sibling container. Cloud providers running multi-tenant workloads typically use VMs (or VM-strengthened containers like AWS Firecracker) for this reason; single-tenant deployments use plain containers because they're far cheaper.

Fig 4.20 — Namespaces: each container sees its own world

Namespaces are how the kernel lies to a process about the world. There are seven kinds — PID, network, mount, IPC, UTS (hostname), user, cgroup — and each can be independently configured per process. Container A's ps shows only the processes inside its PID namespace; its ifconfig shows only its virtual interfaces; its / is the rootfs of its container image. Container B sees a completely different version of all of those, even though both processes are running under the same kernel, on the same machine, accessing the same physical RAM. From the host's perspective, both containers' PIDs are just numbers in the global PID table; namespaces are translation layers between the global state and what each container sees.

Fig 4.21 — Cgroups: quotas as taps on the resource flow

If namespaces decide what a process can see, cgroups decide what it can use. Each cgroup is a node in a hierarchy, with explicit limits on every resource the kernel can meter — CPU shares, memory bytes, block-device IOPS, network bandwidth, even the number of file descriptors. The kernel enforces these on every syscall and every scheduling decision: a process in the "batch" cgroup that tries to exceed its memory limit gets killed by the OOM killer; a process exceeding its CPU share gets descheduled until its budget refills. Cgroups are how Kubernetes packs ten workloads onto one machine without any of them noticing the others, how Docker's --cpus=2 flag works, and how systemd ensures a misbehaving service does not bring down the rest of the host.

eBPF: safe code inside the kernel

A more recent and remarkable Linux feature is eBPF — "extended Berkeley Packet Filter." It allows user programs to upload small pieces of code into the running kernel, where they are attached to specific events (network packets, system calls, function entries) and run inside Ring 0 with kernel-level performance.

The obvious worry is that this lets unprivileged users execute code in the kernel — historically the worst possible security outcome. eBPF makes it safe through a verifier: a static analyzer in the kernel that examines every uploaded program and refuses to load it unless it can prove the program will always terminate, never read out-of-bounds memory, and never crash. Programs that pass the verifier are then JIT-compiled to native machine code and run essentially as fast as compiled kernel code.

Fig 4.22 — eBPF: untrusted code in the kernel, made safe by proof

eBPF flips a forty-year-old assumption: that any code running in the kernel must be hand-vetted, audited, signed, and trusted. Instead, the kernel ships a verifier — a static analyser that mathematically proves an uploaded program is safe before letting it run. The proof obligations are concrete: every memory access must be in-bounds (so the program cannot read random kernel memory); every loop must be provably bounded (so the program cannot hang the kernel); the program must terminate within a fixed instruction budget. Programs that pass are JIT-compiled to native instructions and attached to hooks — system call entry, packet receive, function entry, scheduler tick — where they run in Ring 0 with no syscall overhead. Cilium uses eBPF for high-performance Kubernetes networking; bpftrace exposes a tracing language built on it; modern Linux observability is increasingly eBPF underneath. The pattern — sandboxed verified execution inside a privileged context — shows up again in Chapter 12 (browser JavaScript) and Chapter 14 (TLS), and is one of the most important architectural ideas of the last twenty years.

eBPF has become the foundation of modern Linux observability and networking: tools like Cilium (high-performance network policy), bpftrace (kernel tracing), and Falco (runtime security monitoring) all build on it. The general pattern is important — providing a sandboxed, formally checked execution environment inside a privileged context — and we will see it again when we discuss browser security models in Chapter 12 and TLS in Chapter 14.

🔐

The recurring pattern. Every layer in this chapter — the kernel itself, the privilege boundary, virtual memory, filesystem permissions, container isolation, eBPF verification — is a mechanism that restricts what less-trusted code can do. Operating system security, fundamentally, is the discipline of building such mechanisms and then living with the fact that any one of them can have a bug. The kernel is not a fortress. It is a series of carefully designed walls, each one with a guard at every gate, and the guards themselves have to be checked by other guards. There is no bottom — only better and worse layers.

What you now understand

The kernel is a program — written in C, loaded by a bootloader, run forever in Ring 0 — that owns the hardware and mediates everything every other program does. It comes in two main shapes: monolithic (everything in Ring 0, fast, fragile) and microkernel (minimal Ring 0, message-passing, safer and slower), with most real systems sitting somewhere on that spectrum. Its scheduler, anchored mathematically in queueing theory, decides which process gets the CPU at each microsecond — Linux's CFS does this in O(log N) using a red-black tree of virtual runtimes. Its virtual memory subsystem maintains a four-level page table per process, accelerated by a translation cache (TLB), and uses page faults productively to implement memory-mapped files and copy-on-write fork. Its filesystem turns a flat array of disk blocks into a tree of inodes and names, made crash-safe by journaling or copy-on-write. Its IPC mechanisms — pipes, signals, shared memory, sockets — are the controlled boundary across which isolated processes can still cooperate. And its security depends entirely on its own correctness — which is why kernel bugs are so dangerous, and why each new mechanism (containers, eBPF) is layered behind further verification.

With this, the kernel is in view from both sides. The Bridge at the close of Part I showed it as silicon — privilege rings as a CPU bit, the trap as a wire-level mechanism, the MMU as a hardware unit, the timer as the single piece of hardware that makes preemption possible. This chapter showed it as code. Same object, two genuinely different perspectives, and the kernel becomes legible only when both are visible at once: thirty million lines of C running in Ring 0, sustained by a hardware contract that lets it own the machine.

The next chapter climbs out of the kernel and into the language it was written in. C — Ritchie at Bell Labs, 1972 — is the language every kernel in production today still uses. We'll see why pointers are the same idea as the addresses we just walked through, and why they are simultaneously the source of C's power and the cause of every memory bug we discussed in Chapter 3. From C we'll move to C++, then to the radically different choice that is Python. Each language is a particular set of decisions about which kinds of mistakes to make easy and which to make hard.

01 — The portability problem

One source. A thousand binaries.

Until 1973, an operating system was a thing you wrote in assembly for one specific machine. If you wanted to run that OS on a different computer, you didn't recompile — you rewrote, instruction by instruction, for that computer's CPU. The cost was civilizational: every new machine reset the OS landscape to zero. Then C arrived, and one source file became a thousand binaries.

To see why this was a revolution, you have to understand what the world looked like before. In 1965, IBM shipped the System/360. To write software for it, you wrote System/360 assembly — directly, in mnemonics like L, ST, BAL, encoded in EBCDIC. When DEC shipped the PDP-8 in 1965 and the PDP-11 in 1970, those machines spoke entirely different assembly languages. A program for the System/360 was unusable on the PDP-11. Not "needs porting." Unusable. The bits did not even mean the same opcodes.

This was the normal state of computing. Operating systems, compilers, every piece of systems software — all rewritten per machine. A small army of programmers spent their careers translating known-good programs into yet another assembly. When a new CPU appeared, the cost of porting useful software was so high that most software simply did not make the trip. Each generation of hardware was an island.

Fig 5.1 — One C source, many targets

A single C function, compiled for three architectures separated by fifty years of CPU design. The output is unrecognizably different — different mnemonics, different registers (eax vs w0 vs a0), different encodings (variable-length on x86, fixed 4-byte on ARM and RISC-V) — but the program behaves identically. This is what "portable" actually means: not "binary works everywhere" but "source compiles cleanly to whatever the target needs." Before C, this property did not exist for systems software.

What makes a language portable

Three things, principally. First, the language must abstract away CPU-specific details — register sets, instruction widths, calling conventions — without forcing the programmer to give up speed. Second, it must be small enough that writing a compiler for a new architecture is feasible by a small team. Third, its memory model must be loose enough to map cleanly onto whatever memory hardware actually provides — bytes here, words there, alignment requirements everywhere.

C achieves all three with a particular trade. It is barely above the metal. Its types map directly onto machine words. Its control flow maps directly onto conditional jumps. Its data structures are layouts you could draw on graph paper. The cost: you, the programmer, see the metal whether you want to or not. The benefit: a competent compiler-writer can target a new ISA in a few months, and a C program written for the PDP-11 in 1975 will recompile and run, unmodified, on a 2026 ARM Mac.

BCPL → B → C

The lineage matters because it explains why C looks the way it does. BCPL (Martin Richards, Cambridge, 1967) was a typeless, stack-based language designed for writing compilers. Words. No types. Just memory. B (Ken Thompson at Bell Labs, 1969) was a smaller, cleaner BCPL — also typeless, also word-oriented. B ran on early UNIX. But the PDP-11 that arrived at Bell Labs in 1970 was byte-addressable, and B's wordless model fit it badly. In 1972 Ritchie added types — int, char, pointers, structs — and called the result C. The language was named, simply, after its predecessor's first letter.

The key insight: C did not solve portability by being abstract. It solved portability by being just barely abstract enough — close enough to assembly to compile efficiently on any reasonable CPU, just far enough above it to never mention specific registers.

02 — Ritchie and Thompson at Bell Labs

Two men, a discarded minicomputer, four years.

In a windowless room at Bell Labs in 1969, Ken Thompson found a discarded PDP-7 and started writing a programming environment for himself. His friend Dennis Ritchie joined. Within four years they had built UNIX, written the C language to write more UNIX in, and laid the foundations of every operating system you use today. No marketing. No grand plan. Two men, a stripped-down minicomputer, and a little time.

The story begins, characteristically, with a project that failed. Bell Labs had been one of three institutions building Multics — a vast, ambitious time-sharing operating system meant to be a "computing utility" for hundreds of simultaneous users. Multics was technically remarkable and managerially catastrophic. By 1969 Bell Labs had pulled out. Thompson and Ritchie, who had been Multics contributors, were left with the appetite for that kind of system but no machine, no team, and no permission.

Thompson found an unused PDP-7 in a hallway. The PDP-7 was already old by 1969 — an 18-bit minicomputer DEC had stopped selling. He started writing a small operating system for it, partly to play a space-travel game he had ported from Multics, partly to have an environment to work in. By the end of 1969 there was a kernel, a shell, an editor, and a primitive filesystem. Ritchie joined. Brian Kernighan, walking past one day and noting that the system was a stripped-down Multics, quipped that they should call it "UNICS" — UNiplexed Information and Computing Service, a deliberate pun on Multics. The name became UNIX.

Fig 5.2 — UNIX's rewrite, in time

UNIX's first version was assembly, like every operating system before it. By 1973, Ritchie had built C and rewritten UNIX in it — and that rewrite is the moment systems software becomes portable. Within five years, UNIX was running on machines Thompson had never seen, ported by people he had never met. The 1978 Kernighan and Ritchie book — universally called K&R — codified the language and is still in print, mostly unrevised, half a century later.

The PDP-11

The machine that shaped C deserves its own paragraph. The PDP-11 was a 16-bit minicomputer that DEC sold from 1970 onwards. It was small (the size of a refrigerator, not a room), had eight general-purpose registers, used byte-addressable memory, and supported indirect addressing through registers. Most importantly, it had a clean, regular instruction set. C's design choices — pointer arithmetic that adds sizeof(*p) bytes, structures laid out in memory in declaration order, the equivalence of arrays and pointers — are PDP-11 idioms made into a language. To a first approximation, C is a portable PDP-11 assembler.

K&R, 1978

Brian Kernighan and Dennis Ritchie published The C Programming Language with Prentice-Hall in 1978. Two hundred and twenty-eight pages. Every example a few lines, compilable, runnable. The famous opener — a program that prints hello, world — is on page 6. K&R defined the language until ANSI standardization in 1989, and it is still on the bookshelf of every working C programmer. The book is unfussy, exact, and assumes intelligence. It teaches by showing rather than explaining. It is one of the small number of programming books that someone could plausibly read in an afternoon and write working code that night.

"C is quirky, flawed, and an enormous success."

— Dennis Ritchie

Fig 5.3 — PDP-7 to PDP-11 · the machines that shaped C

The two machines that shaped C, side by side. The PDP-7 was the discarded mainframe in a corner of Bell Labs where Thompson wrote the first version of UNIX in 1969 — entirely in PDP-7 assembly, tied to that one specific machine forever. The PDP-11, which DEC released in 1970, gave Ritchie a clean 16-bit byte-addressable architecture with eight general-purpose registers and orthogonal addressing modes — and it is, almost literally, what C was designed for. Pointer arithmetic that adds sizeof(*p) bytes; structures laid out in memory in declaration order; arrays "decaying" into pointers in expressions — all of these are PDP-11 idioms abstracted just enough to be a language. The 1973 rewrite of UNIX in C took advantage of this fit: C compiled to good PDP-11 code by mapping almost directly to the hardware, while staying portable enough that the same source ran on a different machine the next year. For half a century afterward, every CPU vendor has been quietly designing for "what C expects" — flat byte-addressable memory, integer pointers, two's complement arithmetic, a stack. The shadow of the PDP-11 is still on the silicon you are reading this on.

03 — Pointers as math

A pointer is an integer with attitude.

A pointer is not a mysterious thing. It is an integer. Specifically, it is the integer address of a byte in memory. The reason C feels different from every language since is that C lets you do arithmetic on these integers — directly, without a safety net — and trusts you to know what you're doing.

Recall from Chapter 1 that memory is a giant array of bytes, and each byte has an integer address. On a 64-bit machine, those addresses run from 0x0000_0000_0000_0000 to 0xFFFF_FFFF_FFFF_FFFF, which is more bytes than will ever exist. A pointer in C is just one of those integer addresses, stored in a variable that the compiler has labelled with the type of thing the address is supposed to point at.

Three operators do nearly all the work. The unary & takes the address of a variable: &x is "the address where x lives." The unary * dereferences a pointer: if p is the address of an int, then *p is the int stored there. And pointer arithmetic — p + 1 — does not mean "the next byte." It means "the next thing of the type p points to." If p is an int * on a 32-bit-int machine, p + 1 advances four bytes. The compiler scales the arithmetic for you. This is the entire trick.

Fig 5.4 — A pointer walks down memory

A pointer to an int starts at address 0x100. p++ doesn't move forward by one byte; it moves forward by sizeof(int), which is four bytes on most modern systems. The compiler does the multiplication. The same expression p++ on a double * would advance eight bytes; on a char *, one byte. The arithmetic always means "next thing," never "next byte." This single design choice unlocks the equivalence of arrays and pointers, the entire C string library, and approximately every memory-corruption vulnerability in computing history.

Arrays are pointers in disguise

Here is one of the deeper revelations in C. The expression a[i], where a is an array, is defined by the C standard to be exactly equivalent to *(a + i). Not "compiles to similar code." Not "behaves the same way." Defined to be equivalent. The array indexing operator is syntactic sugar for pointer arithmetic. A consequence: if you have a pointer p, you can write p[3], and that is a legal C expression meaning *(p + 3). And if you have an array a, you can write *a, and that is a[0].

Fig 5.5 — a[i] ≡ *(a + i)

The C standard does not say a[i] behaves like *(a + i). It says they are the same expression. A consequence: i[a] is also legal — addition commutes — and means the same thing as a[i]. Nobody writes 3[arr] in production, but the standard permits it. Arrays in C are not really first-class data structures; they are a thin convention layered over pointer arithmetic. Most languages that came after C decided this was a mistake. C survived anyway.

What this lets you do

Direct memory traversal. Linked structures built by chasing pointers from one node to the next. Arrays of arrays, dynamically sized. Function pointers — pointers to executable code, the foundation of callbacks, interrupt handlers, and virtual method tables (which is how Chapter 6's vtables are implemented). Type-punning casts that let you treat the same bytes as different types — useful for binary protocols, image manipulation, and reading raw hardware registers. Everything systems software does, eventually, comes down to this.

What this lets you screw up

A pointer that points outside its valid range will happily write anywhere your process is allowed to write. There is no check. arr[i] with i too large does not raise an exception; it stores into whatever happens to be 4×i bytes past the start of arr. If that happens to be the saved return address on the stack — and you went back through Chapter 3 — you have a buffer overflow vulnerability. The mechanism we drew on paper there is exactly this, expressed as a few lines of C, compiled with bounds-check optimization off (which is to say: with a normal C compiler).

The trade: Pointer arithmetic gives you the most direct access to memory of any high-level language. It is also the source of approximately every memory-corruption vulnerability in computing history. Both halves of that sentence are true, and the second is a price the first has to pay.

04 — Manual memory management

You allocated it. You free it.

Every byte of dynamic memory your C program uses, you allocated. Every byte you allocated, you must free. Forget to free, and your program leaks until the OS kills it. Free twice, or use a freed pointer, and your program corrupts itself in ways that may not show up for hours. C makes you the bookkeeper. There is no garbage collector. There is no safety net.

Recall from Chapter 4 that memory comes in two main flavours: the stack, which grows and shrinks automatically as functions enter and exit, and the heap, a region the kernel hands the process for long-lived allocations. Stack memory is automatic — local variables come and go without your involvement. Heap memory requires explicit requests. C exposes those requests through two functions: malloc(n) asks for n bytes of heap and returns the address; free(p) tells the allocator that the bytes at address p are no longer needed.

Behind these two functions lives an entire algorithm — the allocator. The kernel hands the process a single big chunk of address space (via the brk or mmap system calls). The allocator subdivides that chunk into smaller pieces in response to malloc calls, and stitches pieces back together when free calls return them. Over time, allocations and frees create a free list — chunks of various sizes interleaved with allocated chunks, each tagged with whether it is in use.

Fig 5.6 — The heap, as a free list

The heap is a long ribbon of bytes. The allocator carves chunks out of it and tracks which ones are in use. After repeated allocations and frees, the heap becomes fragmented — many small free chunks scattered between in-use chunks, none of them large enough for the next big request, even though the total free space might be huge. This is why long-running C programs can run out of memory while having plenty available, and why allocator design (first-fit vs best-fit, slab allocators, jemalloc, glibc's ptmalloc) is its own dense subfield.

The two failure modes

There are exactly two ways to mishandle malloc and free, and both are catastrophic. The first is to allocate and never free — a leak. The chunk stays in use forever, even though no pointer in the program references it any more. Run a leaky program long enough and the OS kills it for exhausting memory. The second is the inverse: free a chunk and then keep using the pointer to it. This is a use-after-free. The chunk has gone back to the free list and may have been handed to a different malloc caller; reading it now reads someone else's data, writing it corrupts that data.

Fig 5.7 — Leak vs use-after-free

A leak is silent and slow: memory accumulates until the OS kills the process. A use-after-free is silent and fast: the moment free returns, the chunk is back on the allocator's free list, and a future malloc may hand it to a totally different part of your program. Reads from the dangling pointer return whatever that other code has stored. Writes corrupt it. In a security context, the attacker arranges for a sensitive object — a function pointer, a credentials structure — to be allocated where the dangling pointer still points, then writes through the dangling pointer to take control. UAF is the modern equivalent of the buffer overflow.

Why this is still in production

Garbage collection was invented in 1959 by John McCarthy for Lisp. By 1990 it was well-understood. C kept refusing it for one reason: predictability. A garbage collector decides on its own schedule when to pause your program and walk the heap. For a kernel running 10,000 syscalls per second, or an audio stack filling a buffer 48,000 times per second, a 100-millisecond GC pause is catastrophic. C's manual model gives you a contract: memory is freed exactly when you free it, never sooner, never later. The price is that you must be right.

05 — Undefined behavior

The optimizer's cruel logic.

The C standard contains a list of operations whose behavior is "undefined." This does not mean they crash, or fail, or return garbage. It means the standard places no requirement at all on what happens — and modern optimizers will exploit that freedom in ways that look like betrayal. Code that worked for ten years can break catastrophically on the next compiler upgrade.

The list is short and dangerous: signed integer overflow; dereferencing a null pointer; reading or writing past the end of an array; using a value before initializing it; modifying a string literal; data races between threads. Each is a contract the programmer has implicitly accepted: "I promise I will never do this." The compiler trusts that promise — completely. Modern optimizers do not check whether you can actually keep it. They simply assume you will, and generate code that only makes sense if you did.

Here is the canonical example. Consider a function that wants to detect signed integer overflow before doing something dangerous:

Fig 5.8 — How "trust the programmer" deletes your bug check

The programmer wrote a check for signed overflow. The check is correct in principle. But because signed overflow is undefined behavior, the compiler is allowed to assume it never occurs — and from that assumption, it can prove the check is always false, and delete it as dead code. The programmer's defensive check has been silently removed. The compiler is not malicious; it is following the standard exactly. This pattern has caused real security bugs in real production code, including in the Linux kernel, which now compiles with explicit flags (-fno-strict-overflow, -fno-delete-null-pointer-checks) to prevent the optimizer from exploiting these particular freedoms.

NULL dereference

A pointer with the value 0 — or NULL, the macro that means the same thing — is conventionally "the address that points nowhere." The kernel arranges for the very first page of every process's virtual address space to be unmapped. Touching that page raises a hardware fault. The kernel intercepts the fault and converts it into a signal called SIGSEGV ("segmentation violation"), which by default kills the process. This is the Segmentation fault (core dumped) message that has greeted C programmers for fifty years.

Fig 5.9 — A NULL dereference, traced

A NULL dereference traverses every layer of the system. The CPU's memory management unit (Bridge — The Boundary, in Part I) detects that no page-table entry exists for address 0x0 and raises a hardware exception. The kernel's page-fault handler examines the faulting address, finds it lies outside any of the process's mapped regions, and decides this is an illegal access. It queues signal 11 — SIGSEGV — to the offending process. The process's default signal handler is "terminate, dump core for the debugger," which is what you see. All of this happens in roughly the time it takes to read this sentence.

The optimizer's contract, the programmer's homework

"Trust the programmer" was C's original philosophy and is still its operating stance. Modern optimizers take it to its logical limit: the compiler trusts you so completely that it removes code based on assumptions you never knowingly made. The Linux kernel has accumulated dozens of -f flags telling GCC to not exploit specific UB optimizations. Modern security-aware C codebases compile with UndefinedBehaviorSanitizer in development to catch the cases where the programmer's and the compiler's models of "what is allowed" have drifted apart.

The right way to think about UB: not "the program might crash" but "the program's behavior is no longer governed by the source you wrote." The optimizer is allowed to assume the input is well-defined, and once that assumption is wrong, the output may bear no resemblance to your code's intent. This is the cost of C's performance — and the cost is real.

06 — Why C survives

Half a century later, still in the foundation.

Every five years, someone writes a clickbait piece announcing that C is dying. Every five years, more C runs in production than ever before. Linux is C. macOS XNU is mostly C. Windows NT is mostly C. The Python interpreter is C. The browser rendering this page is mostly C. The TLS library that secured your connection is C. C is not dying. C is so deep in the foundations that it has become invisible.

Three properties keep C in the floor of every system. Performance: C compiles to assembly with no runtime overhead and no hidden machinery — what you write is, very nearly, what runs. Predictability: every operation has a known, bounded cost; there is no garbage collector to pause your program on its own schedule, no JIT to recompile your hot loop in the middle of a frame. Ubiquity: every hardware vendor ships a C compiler before any other language, every embedded system speaks C, every kernel API is described in C.

Fig 5.10 — Where C still runs in 2026

Every domain where C remains dominant has the same property: when the cost of a bug is measured in machine cycles or microseconds, garbage collection is unaffordable; when the cost is measured in dollars or lives, predictability beats convenience. The price is paid in vulnerabilities. Roughly two-thirds of all serious security bugs reported in the past decade are memory-safety bugs in C or C++ code. Rust — designed in the 2010s explicitly to give C's performance and control without C's footguns — is beginning to displace C in specific layers (the Linux kernel started accepting Rust modules in 2022, Microsoft is rewriting parts of Windows in Rust, the Android Bluetooth stack moved). It is a slow displacement. C will outlive the people reading this.

Fig 5.11 — The C family tree · half a century of descendants

A partial family tree. C at the root, three rough generations of descendants. The first wave (Stroustrup's C++, Cox's Objective-C) kept C's speed and added object orientation directly on top of C compilation; both languages are technically supersets that compile to roughly the same instructions. The second wave (Java, C#, Kotlin, JavaScript, PHP, Perl) kept C's syntactic surface — curly braces, semicolons, the for loop, the cast operator — but ran on a virtual machine with garbage collection, trading raw performance for safety. The third wave (Go, Rust, Zig, Swift) returned to C's performance contract but added memory safety either through ownership analysis (Rust), runtime checks (Go), or explicit control (Zig). Almost every language a working programmer encounters in 2026 is a descendant of C or a reaction against C's specific costs. Languages that look fundamentally different — Python, Lisp, Haskell, OCaml — are themselves often implemented in C. The shadow is everywhere.

"There are only two kinds of languages: the ones people complain about and the ones nobody uses."

— Bjarne Stroustrup

The seam to Chapter 6

Stroustrup, who said that, was reflecting on his own creation. By 1979 he was sitting in the same Bell Labs hallway as Ritchie and Thompson, watching C run things it had never been designed to run — large systems, with millions of lines of code, written by hundreds of people who all had to understand each other's intentions. C had given them speed and portability. It had not given them tools to manage complexity at that scale. Stroustrup set out to make C scale, without giving up anything it did well. The result was the next chapter.

The SoftwareLayer

The Code ThatOwns theMachine

The kernel is just a program

Monolithic vs microkernel: the great schism

Kernel modules: a third path

The art of taking turns

Cooperative vs preemptive: who interrupts whom

What "scheduling" actually has to decide

Real-time scheduling: when missing a deadline kills

The lie of unlimited memory

The translation problem

Why the page table cannot be flat

The TLB: a cache for translations

Page faults — the productive kind

Two beautiful uses of page faults

A tree of names on a flat array of blocks

The inode: where a file actually lives

The everything-is-a-file principle

Journaling: the database technique that saved filesystems

How processes talk

The pipe: UNIX's most beautiful idea

Signals: a doorbell to a process

Shared memory and semaphores: the no-copy IPC

Sockets: when the other process is on another machine

When the kernel breaks

Privilege escalation, more carefully

Containers: not virtual machines

eBPF: safe code inside the kernel

What you now understand

The LanguageThat BuiltEverything

One source. A thousand binaries.

What makes a language portable

BCPL → B → C

Two men, a discarded minicomputer, four years.

The PDP-11

K&R, 1978

A pointer is an integer with attitude.

Arrays are pointers in disguise

What this lets you do

What this lets you screw up

You allocated it. You free it.

The two failure modes

Why this is still in production

The optimizer's cruel logic.

NULL dereference

The optimizer's contract, the programmer's homework

Half a century later, still in the foundation.

The seam to Chapter 6

When C NeededTo Think InObjects

The crisis that wasn't a crash.

Simula's children, raised on C.

What "OO" looks like in memory.

Inheritance is layout

Encapsulation as a tool

The signature C++ idiom.

What "zero-cost" actually means.

The language that ate its own complexity.

Smart pointers

Move semantics

The seam to Chapter 7

The LanguageThat Chose HumansOver Machines

Two pipelines, two trades.

A Christmas hobby that ate the world.

Types are properties of values, not variables.

One Python thread runs at a time. There are reasons. They are running out.

CPython is one million lines of C running your script.

The two-language strategy.

The verdict

The kernel and the languages.

The Software
Layer

The Code That
Owns the
Machine

The Language
That Built
Everything

When C Needed
To Think In
Objects

The Language
That Chose Humans
Over Machines