Part Two

The Software
Layer

Part I built the substrate — sand to transistor, gate to instruction. Part II opens the program that owns it. The kernel first, examined as a real piece of code. Then the languages people actually write in: C, the portable systems language UNIX itself was written in; C++, the object-oriented answer to its complexity; Python, the opposite trade that chose humans over machines. Each is a particular set of decisions about which mistakes to make easy and which to make hard.

CHAPTER 04
The Code That Owns the Machine
The kernel as a real program — scheduling, virtual memory, filesystems, IPC, and the security boundary.
CHAPTER 05
C — The Language That Built Everything
The portability problem, Ritchie and Thompson at Bell Labs, pointers as math, manual memory, undefined behavior.
CHAPTER 06
C++ — When C Needed to Think in Objects
Stroustrup, object orientation as complexity management, vtables, RAII, zero-cost abstractions.
CHAPTER 07
Python — The Language That Chose Humans Over Machines
Van Rossum 1989-91, dynamic typing, the GIL, CPython internals, the two-language strategy that won data science.
Chapter 04

The Code That
Owns the
Machine

One program runs always. It owns the CPU, the memory, the disks, the network. Every other program lives at its mercy. Without it, modern computing is impossible: no multitasking, no security, no internet. This is the kernel — examined in the detail it deserves.

TopicsScheduler · VM · FS · IPC · Security
Era covered1969 → present
Chapter 04 hero · The Code That Owns the Machine KERNEL Ring 0 · privileged ~30M lines of code SCHEDULER processes · threads VIRTUAL MEM page tables · TLB FILESYSTEM inodes · journals DRIVERS disk · net · GPU SYSCALLS user → kernel IPC pipes · sockets
01 — Anatomy

The kernel is just a program

It is easy, after Chapter 1, to start thinking of the kernel as some abstract authority — a kind of governing law of the machine. It is not. The Bridge at the close of Part I showed the hardware contract that makes a kernel possible at all — the privilege bit, the trap, the MMU, the timer. This chapter examines what the program on the kernel side of that contract actually looks like. The kernel is a program: a set of instructions stored in memory, executed by the same CPU that runs everything else, written in a real programming language by real people. The Linux kernel, as of 2025, is roughly 30 million lines of C, with a small and growing amount of Rust. What makes it the kernel is not what it is made of, but where it sits — in Ring 0, with hardware-enforced privileges no other program has.

When you press the power button on a computer, a sequence unfolds. The CPU starts executing instructions from a fixed firmware address — historically called the BIOS, now usually UEFI — built into the motherboard. The firmware initializes hardware, finds the boot device, and loads a small program called a bootloader (GRUB on Linux, Windows Boot Manager, iBoot on Apple devices). The bootloader's only job is to find the kernel image on disk, load it into memory, and jump to its entry point. From that moment, the kernel runs forever — or until the machine shuts down.

Monolithic vs microkernel: the great schism

There are two philosophies about how to build a kernel. They have been arguing with each other since the late 1980s.

A monolithic kernel puts everything in Ring 0: process scheduling, memory management, filesystems, device drivers, network stacks, all of it. Performance is excellent because subsystems can call each other directly — no boundary crossings. The cost is fragility and security: a bug in any part of the kernel can corrupt the whole machine, and a single device driver with a vulnerability can be the entry point for total compromise. Linux is monolithic. Windows is monolithic. Most kernels in production use are monolithic.

A microkernel takes the opposite stance. It puts only the absolute minimum in Ring 0 — typically just message passing, basic memory protection, and minimal scheduling. Filesystems, drivers, and network stacks all run as separate user-space processes that communicate via messages. A bug in a driver crashes only the driver, not the kernel. The cost is performance: every interaction crosses the user/kernel boundary, and message passing adds latency. Notable microkernels: MINIX (Andrew Tanenbaum's research OS), L4 (used in many embedded systems), seL4 (formally verified — mathematically proven to have no kernel bugs of certain classes).

In 1992, Tanenbaum publicly criticized the then-new Linux on a USENET group, calling its monolithic design "obsolete" and "a giant step back into the 1970s." Linus Torvalds responded — bluntly, and at length. The exchange is one of the most-cited debates in computing history. Tanenbaum was correct in theory. Torvalds was correct in practice. Both are still right, for different reasons, and the schism has never closed.

Fig 4.1 — The Tanenbaum–Torvalds debate, January 1992
Fig 4.1 — The Tanenbaum–Torvalds debate, January 1992 comp.os.minix · 29 January 1992 · the schism made public ANDREW S. TANENBAUM Vrije Universiteit Amsterdam · author of MINIX "LINUX is obsolete." — monolithic design = giant step back into the 1970s — "microkernels have won the intellectual debate" — tied to one CPU architecture (the 386) — won't survive stance: theoretical purity vs LINUS TORVALDS University of Helsinki · 22 yrs old · author of Linux "Your job is being a professor and researcher: that's one hell of a good excuse for some of the brain-damages of MINIX." — monolithic = simpler & faster — Linux is free, runs today — portability follows users stance: practical engineering verdict 30 years on: Tanenbaum was right in theory; Torvalds was right in practice; Linux runs more devices than every microkernel combined

The debate ran for weeks. Tanenbaum, the established expert, argued from architecture and theory: microkernels were the future, monolithic designs were a regression to the 1970s, Linux was tied to one specific CPU. Torvalds, twenty-two and writing Linux from his bedroom, argued from running code and pragmatic constraints. Three decades later both have been vindicated, just for different things. The cleanest microkernels (seL4, QNX, the formally-verified ones) win in safety-critical embedded systems where a single kernel bug is unacceptable. Linux runs everywhere else — server farms, phones, supercomputers, refrigerators — because "ships and works" beat "is theoretically correct" once the practical gap stopped being trivial. Both are still right. The argument never closed because both stances are answering different questions.

macOS sits in the middle. Its kernel, XNU, is a hybrid: a Mach microkernel core wrapped with a BSD UNIX layer that runs in the same address space — so you get message-passing primitives and monolithic-kernel performance. iOS uses the same kernel. Android uses Linux. Windows NT was originally microkernel-influenced but has drifted increasingly monolithic for performance reasons.

Fig 4.2 — Monolithic vs microkernel
Fig 4.2 — Monolithic vs microkernel MONOLITHIC (Linux, Windows) USER PROGRAMS KERNEL — Ring 0 Scheduler VM / MMU FS Drivers (in kernel) Network stack all subsystems share one address space MICROKERNEL (MINIX, seL4) USER SPACE — Ring 3 FS server Net server Disk driver USB driver User programs drivers run as ordinary processes MICROKERNEL — Ring 0 IPC · scheduling · memory protection

In a monolithic kernel, every subsystem runs in Ring 0 and shares one address space. Calls between them are direct and fast. In a microkernel, only message passing and the bare minimum live in Ring 0; filesystems, drivers, and network stacks run as ordinary user-space processes that communicate by IPC. Microkernels are safer; monolithic kernels are faster.

Kernel modules: a third path

Linux added a feature called loadable kernel modules (LKMs) that softens the monolithic stance. Drivers and filesystems can be compiled separately from the main kernel and loaded or unloaded at runtime with commands like insmod and rmmod. They still run in Ring 0 — so a buggy module can crash the kernel — but they don't have to be linked into the main image. Most graphics drivers, filesystem drivers, and hardware support on Linux ships as modules. lsmod on a running Linux system will list hundreds of them.

Fig 4.3 — A kernel module loading at runtime
Fig 4.3 — A kernel module loading at runtime LINUX KERNEL · Ring 0 scheduler · VM · syscall layer · symbol table ~30 million lines of C, loaded once at boot nvidia.ko GPU driver · loaded btrfs.ko filesystem · loaded snd_hda.ko audio · loaded my_driver.ko insmod just now $ sudo insmod my_driver.ko # kernel resolves symbols, allocates memory, calls module_init() → now in Ring 0, can crash everything if buggy $ lsmod | wc -l → typically 100–300 modules on a running desktop "monolithic but modular" — the practical compromise that ate the world

Linux at runtime is a base kernel image plus a constellation of modules that have been linked into it dynamically. Each insmod resolves the module's symbols against the kernel's symbol table, allocates kernel memory, copies the module's code in, and calls its module_init() entry. From that moment the module runs in Ring 0 — a buggy module can panic the whole machine — but the source tree, build system, and distribution channel are independent of the main kernel. Distributions ship pre-compiled modules; vendors (NVIDIA, Broadcom) ship out-of-tree modules signed for specific kernel versions. rmmod reverses the process: call module_exit(), free the memory, remove the symbols. This is how Linux maintains both the speed of monolithic design and the operational flexibility microkernels promised.

The pattern shows up everywhere: a clean theoretical model, modified by practical necessity. The kernel is monolithic but modular. It runs in Ring 0 but increasingly delegates risky work (drivers in user space, eBPF in a verified sandbox, see Section 6) to safer compartments. The history of operating systems is the history of these compromises.

02 — Scheduling

The art of taking turns

A modern laptop runs hundreds of processes simultaneously. A server may run thousands. Your machine has, at most, a few dozen CPU cores. The arithmetic doesn't work — most processes cannot be running at any given instant. The kernel creates the illusion that they all are, by switching between them many times per second, fast enough that you cannot perceive the gaps. The component that decides which process gets the CPU at any given moment is called the scheduler, and it is one of the deepest and most-studied parts of any kernel.

Cooperative vs preemptive: who interrupts whom

The earliest multi-tasking systems were cooperative: each running program voluntarily yielded control back to the scheduler when it had nothing useful to do. Mac OS through version 9 worked this way. So did 16-bit Windows. The model is simple and lightweight, but it has a fatal flaw: a single misbehaving program that never yields freezes the entire machine. Every Mac OS 9 user remembers the experience of one frozen application taking down everything else.

Modern systems are preemptive: the kernel forcibly takes the CPU back from a running process at regular intervals, regardless of whether the process is ready to give it up. The mechanism is a hardware timer interrupt — a chip on the motherboard that sends an electrical signal to the CPU at a fixed frequency (typically 100–1000 times per second on Linux). Each interrupt forces the CPU into Ring 0, where the scheduler runs, decides whether to switch processes, and either resumes the current one or chooses a different one. The user perceives perfectly smooth multitasking because the timer fires faster than human reaction.

What "scheduling" actually has to decide

The scheduler's job sounds trivial — pick a process to run — but the choices have surprising depth. The classical algorithms each capture a different trade-off:

AlgorithmHow it worksTrade-off
FCFSFirst-come, first-served. Run processes in arrival order until they finish.Simple. Terrible response time — one long process blocks all the short ones (the "convoy effect").
SJFShortest Job First. Pick the process with the shortest remaining time.Provably optimal for average wait time. Requires knowing job length in advance — usually you don't.
Round RobinEach process gets a fixed time slice ("quantum"). When it expires, move to the next.Fair. Responsive. Doesn't optimize for total throughput.
PriorityEach process has a priority number. Higher priority runs first.Lets the system favor critical work. Risk: low-priority processes can be starved indefinitely.
MLFQMultilevel Feedback Queue. Multiple priority levels; processes that use a lot of CPU drift to lower priorities, processes that wait often rise.Approximates SJF without knowing job length. Used by Windows, classic UNIX.
CFSCompletely Fair Scheduler. Linux's default since 2007. Tracks how much CPU each process has used; always runs whichever has used the least.Approximates "everyone runs at exactly 1/N speed on N processes." O(log N) per scheduling decision via a red-black tree.

Linux's CFS deserves a closer look because it's a particularly elegant idea. The scheduler keeps every runnable process in a self-balancing binary tree (a red-black tree), keyed by virtual runtime — a measure of how much CPU time the process has consumed, weighted by its niceness. Whenever the scheduler needs to pick the next process to run, it picks the leftmost node of the tree (smallest virtual runtime). When a process runs, its virtual runtime increases; when it blocks, it's removed; when it wakes, it's reinserted with its saved value. This naturally implements approximate fairness: every process tends toward equal CPU usage, and any process behind catches up first.

Fig 4.4 — CFS chooses the leftmost node of a red-black tree
Fig 4.4 — CFS chooses the leftmost node of a red-black tree RUNNABLE PROCESSES — keyed by virtual runtime 12.4 root 8.1 17.2 3.2 leftmost ← runs next 10.0 15.5 19.8

CFS stores all runnable processes in a red-black tree, keyed by virtual runtime (a measure of CPU consumption). The leftmost node has the smallest value — the process most "behind" on its share of the CPU — and is the next to run. Insertion, deletion, and finding the minimum are all O(log N). This is why a Linux machine with thousands of runnable processes still schedules in microseconds.

Fig 4.5 — Four schedulers, one workload
Fig 4.5 — Four schedulers, one workload workload (arrival time, burst length): A: arrives t=0, runs 8 · B: arrives t=1, runs 4 · C: arrives t=2, runs 9 · D: arrives t=3, runs 5 FCFS A · 0–8 B · 8–12 C · 12–21 D · 21–26 avg wait: 8.5 · convoy effect SJF A · 0–8 B · 8–12 D · 12–17 C · 17–26 avg wait: 7.0 · provably optimal · needs job lengths RR (q=2) A B C D A B C D A C D A C C interleaved · responsive · context-switch overhead MLFQ A B (Q1) A (Q2) D (Q1) A (Q3) C (Q3, demoted) approx-SJF without knowing lengths · used by Windows & classic UNIX 0 5 10 15 20 25 time →

The same four jobs scheduled four different ways. FCFS serves them in arrival order — the eight-unit job A blocks everyone behind it (the convoy effect). SJF reorders by length — A still runs first because it arrived alone, but D (5 units) jumps ahead of C (9 units), giving the lowest possible average wait time. Round Robin with quantum 2 interleaves all four constantly — responsive but never letting any job make a long uninterrupted run. MLFQ demotes processes that use a lot of CPU into lower-priority queues; long-running C ends up at Q3 and waits, while interactive bursts of A get repeated Q1/Q2 attention. Each algorithm is correct, for a different definition of "correct" — average wait, total throughput, responsiveness, fairness. The choice depends on workload.

Queueing theory and Little's Law

The mathematics underlying scheduling is queueing theory, developed in the early 20th century by Agner Krarup Erlang for telephone networks. The single most important result is Little's Law:

L = λ · W

The average number of items in a queueing system (L) equals the average arrival rate (λ) multiplied by the average time each item spends in the system (W). It holds for any stable queue, regardless of arrival distribution or service distribution. It is the reason a slightly oversubscribed system — λ approaches its capacity — produces explosive wait times: as utilization approaches 100%, W goes to infinity. Every web server, every database, every operating system kernel obeys this. It is why the difference between 95%-loaded and 99%-loaded servers is not 4% — it is often 10× in latency.

Fig 4.6 — L = λ · W and the cliff at full utilization
Fig 4.6 — L = λ · W and the cliff at full utilization arrival rate λ requests / sec queue (avg L items) each waits avg W seconds L = λ · W true for any stable queue utilization ρ = λ / capacity 0 50% 80% 95% 100% avg wait W "95% loaded" ≠ "5% slower" → wait time grows hyperbolically

Little's Law in two pictures. Top: a queue with arrivals (λ) flowing into a buffer of average length L; each item waits average W. The relationship L = λW is true for any stable queue, regardless of how arrivals are distributed or how long service takes. Bottom: the consequence — average wait time as utilization approaches 100% follows W ~ 1/(1−ρ), which is hyperbolic. At 50% utilization, doubling load barely budges the wait. At 95%, doubling load is catastrophic. This is why operations teams panic when a server crosses about 80% sustained utilization — they are not panicking about "the server being slow," they are panicking about the math curve they can see coming.

Real-time scheduling: when missing a deadline kills

General-purpose schedulers like CFS optimize for average performance. They make no guarantees about worst-case latency. For most software this is fine. For some software it is catastrophically not fine: anti-lock braking systems, pacemakers, avionics flight control, industrial robots. These systems run on real-time kernels — variants of Linux (PREEMPT_RT) or specialized OSes (VxWorks, QNX, FreeRTOS) — that guarantee a task will run within a bounded time after it becomes ready, even under load. The mathematical foundation is rate-monotonic scheduling and earliest-deadline-first scheduling, analyzed by Liu and Layland in a foundational 1973 paper.

Fig 4.7 — A periodic task and the deadline it must meet
Fig 4.7 — A periodic task and the deadline it must meet a 100Hz brake-pulse controller — task period 10ms, deadline 10ms 0 10ms 20ms 30ms 40ms deadline deadline deadline deadline run · 5ms run · 5ms run · 5ms run · 12ms — overruns deadline! deadline missed = brake pulse late = real-world consequence in a real-time system, a missed deadline is a system failure, not a slow-running app. EDF and rate-monotonic scheduling guarantee bounded response provided utilization < a known limit.

A real-time task that runs at 100Hz must complete its work within each 10ms window. The first three deadlines are met (5ms work, 10ms window — half the budget). The fourth job runs longer than expected (12ms) and overruns its deadline. In a non-real-time system this would be a slow frame or a stutter; in an anti-lock braking controller, a pacemaker, or a flight control loop, it is a system failure with physical consequences. Real-time kernels use schedulers (rate-monotonic, earliest-deadline-first) that mathematically guarantee deadlines provided the total task utilization stays below a known bound — about 69% for rate-monotonic, 100% for EDF. Going above that bound, missed deadlines become possible, and in a hard real-time system that is the same as broken.

03 — Virtual memory

The lie of unlimited memory

Chapter 1 introduced virtual memory as the kernel's mechanism for isolating processes; the Bridge showed the MMU and TLB as the silicon that makes the mechanism enforceable. We saw the headline: every process believes it has its own private address space starting at zero. Now we look at the machinery that sustains the illusion — the page tables, the translation cache called the TLB, the page faults that quietly bring memory into existence on demand, and the mathematical structures that make 64-bit addressing tractable at all.

The translation problem

Every memory access a program performs uses a virtual address. The CPU cannot use this directly; physical RAM is addressed by physical addresses. Some hardware must, on every load and store, translate one to the other. That hardware is the Memory Management Unit (MMU), built into the CPU. The translation table it consults is the page table, maintained by the kernel.

Memory is divided into fixed-size blocks called pages, typically 4 KB. The address space is therefore divided into virtual pages, and physical RAM into physical pages (often called "page frames"). The page table maps one to the other. When a virtual page has no corresponding physical page in RAM — because it's never been used, or has been swapped to disk, or belongs to a memory-mapped file not yet loaded — the table entry is marked invalid, and the MMU raises a hardware exception called a page fault. The kernel's page-fault handler decides what to do.

Why the page table cannot be flat

A naive page table would be a single flat array — one entry for every possible virtual page. The number is enormous. On 64-bit x86, the architecture defines a 48-bit usable virtual address space (256 terabytes), giving 236 pages. A flat table with one 8-byte entry per page would need 512 GB just for the table. Per process. This is plainly impossible.

The solution is a multi-level page table — a tree. The 48-bit virtual address is split into four 9-bit fields plus a 12-bit page offset. Each 9-bit field indexes into one level of a four-level tree. Most branches are empty (the address space is sparse — your process only uses a tiny fraction of the 256 TB available), so most of the tree is never allocated. A real x86-64 process typically uses a few megabytes of page tables to map its actual memory, not 512 gigabytes.

Fig 4.8 — x86-64 four-level page table walk
Fig 4.8 — x86-64 four-level page table walk VIRTUAL ADDRESS — 48 BITS PML4 idx 9 bits PDPT idx 9 bits PD idx 9 bits PT idx 9 bits page offset 12 bits = 4 KB PAGE TABLE TREE WALK: L4 table CR3 → 512 entries L3 table PDPT L2 table PD L1 table PT physical page + offset 5 memory accesses per address translation — that's why the TLB exists

Translating one virtual address into a physical address requires walking four levels of page tables. The CPU register CR3 points to the root (L4). Each level uses 9 bits of the address to index into a 512-entry table; the entry points to the next level. The bottom 12 bits give the byte offset within the final page. Without optimization, every memory access would cost five memory accesses.

The TLB: a cache for translations

A naive page-table walk would be ruinously slow — every memory access would require five memory accesses (four for the table walk, one for the actual data). The fix is the Translation Lookaside Buffer (TLB), a small, very fast cache inside the CPU that stores recent virtual-to-physical translations. A typical x86-64 TLB has between 64 and 1500 entries. When the CPU needs to translate an address, it first checks the TLB. If the entry is there (a "TLB hit"), translation takes one cycle. If not (a "TLB miss"), the MMU walks the page tables and inserts the result into the TLB.

Hit rates on the TLB are typically above 99%. The 1% miss rate, multiplied by billions of memory accesses per second, still matters — and is why CPU designers have steadily grown TLB sizes and added second-level TLBs over the past two decades. When the kernel switches between processes, it must flush parts of the TLB (since the new process has different page tables). This is one of the hidden costs of context switching, and one reason why excessive switching hurts performance.

Page faults — the productive kind

A page fault sounds like an error. Most of them are not. There are several kinds, and the everyday ones are how the kernel implements many of its most useful features. They fall into three rough categories:

A minor page fault happens when the page is in physical memory but isn't yet mapped into this process. Example: when a program first reads a page of a file the kernel has cached. The kernel just adds an entry to the page table and returns. Cost: microseconds.

A major page fault happens when the page is not in RAM and must be fetched from disk — typically because it was swapped out, or because the program is reading a memory-mapped file for the first time. The kernel issues a disk read, suspends the process, and resumes it when the data arrives. Cost: milliseconds. Tens of thousands of times slower than minor faults.

An invalid page fault happens when the access truly is illegal — writing to a read-only page, dereferencing a null pointer, executing data marked non-executable. The kernel sends the offending process a SIGSEGV signal, and unless the process catches it, the program dies with the famous "segmentation fault" message.

Fig 4.9 — A page fault, decision tree
Fig 4.9 — A page fault, decision tree CPU access raises #PF MMU couldn't translate kernel page-fault handler examines the faulting address minor fault page is in RAM, just not mapped yet action: add page-table entry, resume — ~µs major fault page must be fetched from disk / mmap'd file action: issue disk I/O, suspend, resume — ~ms (1000× slower) invalid fault access genuinely illegal: NULL deref, RO write, etc. action: queue SIGSEGV, process dies + core dump 99% OF FAULTS ARE MINOR — THE KERNEL'S "EXCEPTIONS" ARE MOSTLY ROUTINE BOOKKEEPING

A "page fault" is a CPU exception, but most faults are not errors — they are how the kernel implements memory management on demand. Minor faults are pure bookkeeping: the page is in RAM (perhaps already in another process's mapping or in the kernel's page cache), it just hasn't been wired into this process's page tables yet. Major faults involve disk I/O — the page must be paged in from swap, or read from a memory-mapped file for the first time — and are about a thousand times slower. Invalid faults are the real errors: the dereference of a NULL pointer, the write to a read-only page, the jump into non-executable data. Only invalid faults trigger SIGSEGV. On a typical Linux desktop, the kernel handles thousands of minor faults per second invisibly.

Two beautiful uses of page faults

Memory-mapped files. When you call mmap() on a file, the kernel doesn't read the file. It sets up page table entries marking the relevant virtual addresses as backed by that file, but invalid (not yet present). The first time you actually access a page, you take a page fault, and the kernel reads just that one page from disk. Reading a 100 GB file as if it were a contiguous array in memory becomes trivial; only the pages you touch are loaded. This is how databases, search engines, and many high-performance systems handle large data — and how every shared library loads on Linux.

Copy-on-write fork. When a UNIX process calls fork() to create a child process, the child receives a copy of the parent's entire address space. Naively, this would require duplicating gigabytes of memory. It doesn't. The kernel marks all of the parent's pages as read-only and shares them with the child. As long as neither process writes to a page, they share it. Only when one of them tries to write does a page fault occur, and the kernel makes a private copy at that moment. Most pages are never written; most fork-and-exec sequences (the standard way to launch a new program on UNIX) never copy any pages at all. The cheapest way to make a copy is to lie about having made it.

Fig 4.10 — Fork, then write: when shared pages diverge
Fig 4.10 — Fork, then write: when shared pages diverge ① just after fork() parent child shared physical page [A B C D E F] marked READ-ONLY in both PTs no copying yet — both share ② child writes — page fault → COW splits parent child parent's page [A B C D E F] RW restored child's COPY [A B X D E F] child wrote 'X' here cost: 4 KB copy + 1 page fault, only when written "the cheapest way to make a copy is to lie about having made it" FORK + EXEC: COMMON CASE COPIES ZERO PAGES — CHILD JUST OVERWRITES ITSELF

After fork(), both parent and child point at the same physical pages, all marked read-only. The parent and child page tables are different — they could diverge — but no actual copying has happened. The instant either side writes to a shared page, the MMU raises a page fault. The kernel's COW handler allocates a fresh physical page, copies the contents, fixes up the page table of the writer to point at the new page (read-write), and resumes the instruction. The non-writing process never noticed. The famous fork-and-exec idiom — fork a child that immediately calls execve() to load a new program — copies zero pages, because the child throws away its address space before writing anything. This is why launching a UNIX program is essentially free.

🛡️

Dirty COW (CVE-2016-5195). A bug in Linux's copy-on-write logic, present in the kernel for nine years before being discovered, allowed an unprivileged attacker to write to read-only files — including /etc/passwd — by exploiting a race condition between the page fault handler and the kernel's COW machinery. Privilege escalation followed trivially. The bug's existence and longevity is a reminder of how much of operating system security depends on subtle correctness in subsystems most users will never see.

Fig 4.11 — Dirty COW: the race that broke Linux
Fig 4.11 — Dirty COW: the race that broke Linux CVE-2016-5195 · 9 years in the kernel · disclosed Oct 2016 time → writer madvise(DONTNEED) write to RO page → page fault, COW kernel copies page about to fix up PTE... madvise(DONTNEED) discards copy mapping writer resumes — fixup writes to ORIGINAL page! RO file modified by unprivileged user race window — microseconds consequence: unprivileged write to /etc/passwd → instant root exploit fits in 60 lines of C, runs on every unpatched Linux from 2007–2016

Dirty COW exploited a race in the COW path. A thread tries to write a read-only file mapping, taking a page fault. The kernel begins COW: it allocates a copy and is about to install it as read-write in the writer's page table. Between "allocate copy" and "install copy in page table" there is a brief window. A second thread, on the same process, calls madvise(MADV_DONTNEED) on the same page — which legally discards the COW mapping. The first thread resumes, retries the write, but now the page table no longer has the copy installed, so the write goes through to the original mapping — the read-only file. An unprivileged process can therefore write to any file it can read. /etc/passwd is world-readable. Local privilege escalation in 60 lines of C. The bug had been in the kernel since 2007. Phil Oester reported it after seeing it abused on a production system; Linus Torvalds patched it within hours; every Linux distribution shipped fixes within days.

04 — Filesystems

A tree of names on a flat array of blocks

A disk, at its lowest level, is a flat array of fixed-size blocks. A spinning hard drive presents itself to the OS as billions of 512-byte sectors numbered 0, 1, 2, … An SSD presents the same abstraction even though the physical reality underneath is radically different. The kernel sees: a sequence of bytes, addressed by index. From this raw substrate, the filesystem builds the structure you actually use — files with names, organized into directories, organized into a tree, with metadata about ownership and permissions and timestamps. None of this structure exists at the hardware level. It exists because the filesystem code pretends it does.

The inode: where a file actually lives

In UNIX-derived filesystems (which is most of them), the central data structure is the inode — short for "index node." Every file has exactly one inode, and the inode contains everything about the file except its name and its data:

sizefile size in bytese.g. 24,873
owneruser ID and group IDuid, gid
permread/write/execute permissionsrwxr-xr--
timescreated, modified, accessedctime, mtime, atime
linkshow many filenames point herelink count
blocksarray of disk block numbers holding the data12 direct + indirect

Notice what's not there: the filename. In UNIX, a filename is a property of the directory it lives in, not the file. A directory is itself a special kind of file whose contents are a list of (name, inode-number) pairs. To open /home/yki/notes.txt, the kernel looks up the inode for /, reads its directory contents to find the entry for home, follows that to the next directory's inode, and so on, until it reaches notes.txt's inode. Then it uses the inode's block list to read the actual data.

This separation of name from data has elegant consequences. A single file can have multiple names — multiple directory entries pointing to the same inode. These are called hard links. The file is only deleted when its link count drops to zero. It is also why renaming a huge file is instant: only the directory entry changes. The inode and its data don't move.

Fig 4.12 — Filename → inode → data blocks
Fig 4.12 — Filename → inode → data blocks DIRECTORY /home/yki (name → inode pairs) notes.txt → 4827 draft.md → 4828 photo.jpg → 4829 backup.txt → 4827 ^^ hard link to same file ^^ INODE 4827 file metadata size: 24,873 owner: yki perm: rw-r--r-- links: 2 blocks: [102, 103, 104, 500, 501, ... ] DATA BLOCKS block 102 block 103 block 104 block 500 block 501 may be scattered across the disk

A path lookup walks directories to find an inode. The inode contains metadata and a list of data blocks. Two directory entries pointing to the same inode (notes.txt and backup.txt above) are hard links — the same file with two names. Data blocks are not necessarily contiguous; finding them quickly is one of the filesystem's main jobs.

The everything-is-a-file principle

UNIX took the inode concept further than just regular files. Every kernel-managed resource is exposed as a file:

  • Your keyboard is /dev/input/event0.
  • Your sound card is /dev/snd/pcmC0D0p.
  • Random numbers come from /dev/urandom.
  • The state of every running process lives under /proc/<pid>.
  • Network connections are accessed through file descriptors with read and write.
  • Hardware sensors expose temperature and voltage as files in /sys.

The same five system calls — open, read, write, close, lseek — work on all of them. This radical uniformity is one of the reasons UNIX-derived systems became so dominant: tools written for files just work on devices, network connections, and process state without modification. The shell pipeline you can use to count lines in a text file (cat file | wc -l) works equally well on the output of a sensor driver or a debugging tool, because to the kernel they are all just files.

Journaling: the database technique that saved filesystems

A filesystem that simply writes data and metadata to disk wherever convenient has a fatal weakness: power loss. If the machine crashes between writing data and updating its corresponding metadata (or vice versa), the filesystem becomes inconsistent — files exist whose blocks are still listed as free, or directory entries point at non-existent inodes. The classic UNIX response was fsck, a program that scans the entire disk after a crash to find and fix inconsistencies. On a multi-terabyte disk this could take hours.

The modern solution, borrowed from database theory, is the journal — a small, dedicated region of the disk where every intended modification is written first, before being applied to the main filesystem. After a crash, the kernel only has to replay the journal from where it left off, applying or discarding incomplete operations. Recovery takes seconds instead of hours. ext4 (Linux), NTFS (Windows), and HFS+ (older macOS) all journal. The technique is called write-ahead logging; we'll meet it again in Chapter 13 when we get to databases.

Fig 4.13 — Journaling: write your intent before doing it
Fig 4.13 — Journaling: write your intent before doing it a write to a file, journaled JOURNAL REGION MAIN FILESYSTEM ① write to journal "will write block 502 with these bytes; mark inode dirty" main untouched ② commit record single-block write — atomic. now the intent is durable. main untouched ③ apply to main filesystem copy from journal into block 502, update inode — at leisure, in any order crash recovery (seconds, not hours): → on boot, kernel scans journal: any committed-but-not-applied entries get replayed → any uncommitted entries (no commit record) are discarded — write was never durable → filesystem state is always consistent; the old O(disk) fsck pass is gone

A journaling filesystem does every write twice. First, the kernel writes a description of the intended change to a small journal region, followed by a single-block commit record (whose write is atomic by hardware guarantee). Then it applies the change to the actual filesystem at its leisure — possibly reordered, batched, or coalesced with other writes. If the machine crashes between commit and apply, recovery is straightforward: replay every journal entry whose commit record is intact, drop the rest. The technique comes from database write-ahead logging (Gray, IBM, 1981), and is now in ext4, NTFS, HFS+, JFS, XFS, and every other serious modern filesystem. It is also why a sudden power loss on a modern machine takes seconds to recover from, not hours.

The newest generation of filesystems — copy-on-write filesystems like ZFS, btrfs, and Apple's APFS — go further. They never overwrite existing data. Every modification writes new blocks; only after the write succeeds is the metadata updated to point to them. The old version remains until garbage-collected, which gives you essentially-free snapshots, atomic operations on entire directories, and built-in checksumming to detect silent disk corruption.

Fig 4.14 — Copy-on-write filesystem: never overwrite, just re-point
Fig 4.14 — Copy-on-write filesystem: never overwrite, just re-point file modification on a COW filesystem (btrfs / ZFS / APFS) before root ptr block 102 block 103 after modifying byte 12 of block 103 NEW root block 102 unchanged · shared NEW 103' freshly written old root still points to old block 103 — that's a free snapshot old root old block 103 consequences: snapshots are free · no in-place writes can corrupt · checksums fit naturally cost: every write triggers a chain of new-block writes up to the root — write amplification

A copy-on-write filesystem never modifies a block in place. Updating one byte of block 103 means: allocate a new block (103′), copy 103's contents, apply the change, then re-write the parent that pointed at 103 to point at 103′ instead — and so on, all the way up to the root. The old version of every block remains valid until explicitly freed. A snapshot is just a saved copy of the old root pointer; it costs zero bytes until something diverges. Every block can include a checksum that the filesystem verifies on read, so silent disk corruption (which happens, even on enterprise hardware) gets caught instead of propagating. The trade is write amplification: a one-byte change touches a full block at every level of the tree. ZFS, btrfs, and APFS all make this trade and consider it a bargain. WAFL (NetApp) made the trade first, in 1992, and built a multi-billion-dollar storage business on it.

05 — IPC

How processes talk

Processes are isolated by design. The whole point of virtual memory and privilege separation is that one process cannot reach into another's memory and read or modify it. This is the foundation of operating system security. But isolation taken to its extreme produces a useless system — programs that can't share anything would be unable to compose into pipelines, can't coordinate, can't even tell each other when work is done. So the kernel exposes a controlled set of mechanisms for processes to communicate. These are collectively called inter-process communication, or IPC.

Each IPC mechanism is a different point on the trade-off between convenience, performance, and flexibility. The set below is roughly chronological — older mechanisms at the top, newer at the bottom.

MechanismHow it worksBest for
PipeA unidirectional byte stream between two related processes. The shell | operator creates one.Chaining commands together. The classic UNIX pipeline.
Named pipe (FIFO)A pipe with a filesystem name. Any process with permission can connect.Letting unrelated processes communicate via a known path.
SignalA small numeric notification sent to a process. SIGTERM, SIGKILL, SIGINT (Ctrl+C).Asynchronous control: "stop," "reload config," "we're shutting down."
Shared memoryTwo processes map the same physical pages into both their address spaces.The fastest IPC. No copying. Used by databases and high-performance systems.
Message queueA kernel-managed queue of typed messages between processes.Structured communication; less common today.
SemaphoreA kernel-managed counter used to coordinate access to a shared resource.Synchronizing without sharing data — "is it my turn yet?"
UNIX socketLike a network socket, but local — kernel-mediated, with proper authentication.Modern desktop IPC. Used by Docker, systemd, X11, Wayland.
Network socketTCP or UDP connection, possibly to another machine.Distributed systems. We'll cover these in depth in Part III.

The pipe: UNIX's most beautiful idea

The pipe was added to UNIX in 1973 by Doug McIlroy. Mechanically, a pipe is a kernel-allocated ring buffer with two file descriptors attached: one process writes to one end, another reads from the other. The kernel handles flow control automatically — if the buffer is full, the writer blocks; if it is empty, the reader blocks. No shared memory. No locks. No protocol. Just a stream of bytes.

Fig 4.15 — A pipe is a kernel ring buffer with two ends
Fig 4.15 — A pipe is a kernel ring buffer with two ends writer cat server.log writes to fd 1 → pipe write end KERNEL · RING BUFFER typically 64 KB E R R O R · ↑ read ↑ write reader grep "ERROR" reads from fd 0 ← pipe read end backpressure for free buffer full → writer blocks · buffer empty → reader blocks · no shared memory, no locks needed

A pipe is two file descriptors attached to one kernel-allocated ring buffer. The writer's write() deposits bytes; the reader's read() consumes them. The kernel handles synchronisation: when the buffer fills, the writer blocks until space appears; when it empties, the reader blocks until bytes arrive. No userspace synchronisation is required because the synchronisation lives in the kernel's well-tested code paths. This is the simplest possible mechanism for streaming data between two processes — and is the same machinery that fifty years of UNIX shell pipelines have been built on.

The pipe's syntactic appearance in the shell is dazzlingly simple:

shell
# Count how many unique lines start with "ERROR" in a log:
cat server.log | grep "^ERROR" | sort | uniq | wc -l

# Five separate processes, each doing one small thing.
# The shell creates pipes between them. Output of one
# becomes input of the next, streamed byte-by-byte.

Each program in the pipeline does one small task, knows nothing about the others, and reads its input from standard input and writes to standard output as if they were ordinary files. The kernel arranges those file descriptors to be the ends of pipes. Five processes execute in parallel; the output of one streams into the next as fast as either can handle it. The model is so productive that it has been imitated in essentially every shell since. It is one reason why UNIX produced a culture of small composable tools rather than monolithic applications.

Fig 4.16 — The UNIX philosophy, in five processes
Fig 4.16 — The UNIX philosophy, in five processes cat read file → stream bytes grep ^ERROR filter lines → matches only sort order them → sorted lines uniq dedupe adjacent → unique lines wc -l count → "47" | pipe | pipe | pipe | pipe five processes · five small jobs · zero coordination · streamed in parallel none of the five tools knows about the others. The shell wires them together via pipes, and the kernel makes "stdin" and "stdout" look like ordinary files to each one. "WRITE PROGRAMS THAT DO ONE THING WELL · WRITE PROGRAMS TO WORK TOGETHER"

Five processes, each doing one small task, with pipes between every pair. cat reads the file. grep filters to lines starting with ERROR. sort orders them. uniq drops adjacent duplicates. wc -l counts what's left. None of these programs was written knowing about the others; each just reads from stdin and writes to stdout. The shell connects them with pipes, and the kernel runs all five in parallel — bytes stream through the pipeline as fast as the slowest stage allows. This is the canonical illustration of the UNIX philosophy: composition through tiny, sharp tools and a universal interface (text bytes), so any new tool you write joins the pipeline for free.

"This is the Unix philosophy: write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface."

— Doug McIlroy, inventor of the UNIX pipe

Signals: a doorbell to a process

Pipes carry data. Signals carry no data — only the bare fact that something happened. A signal is a small, numbered notification delivered asynchronously to a process: number 2 (SIGINT) is what the kernel sends when you press Ctrl+C; number 9 (SIGKILL) is the uncatchable "stop now"; number 15 (SIGTERM) is the polite "please stop." A process can register a handler function for most signals; when the signal arrives, the kernel hijacks the running process, runs the handler, then resumes. Three things can happen on a signal: handle, ignore, or — for some signals like SIGKILL and SIGSTOPdie, with no recourse.

Fig 4.17 — A signal arrives mid-execution
Fig 4.17 — A signal arrives mid-execution main kernel handler main running SIGINT queue signal handler runs main resumes three possible reactions: handle (custom code) · ignore (signal discarded) · die (default for fatal signals)

A signal arrives at an arbitrary moment in the target process's execution. The kernel queues it, then — at the next kernel-to-userspace transition (a syscall return, a page-fault return, etc.) — diverts the process to its registered handler. The handler runs as if it were a function call inserted at that exact instruction; when it returns, the kernel restores the original CPU state and the process resumes. Some signals are uncatchable: SIGKILL (number 9) terminates the process unconditionally; the kernel does not even consult the process. SIGSTOP suspends it. Both exist because a privileged user must always be able to stop a misbehaving process — even one that has registered handlers for everything else. Signals are crude: a single integer per notification, no data, no queueing of duplicates of the same signal. They predate every other Unix IPC mechanism and remain the standard way to control the lifecycle of a process.

Shared memory and semaphores: the no-copy IPC

Pipes and signals are convenient. They are not fast. Every byte through a pipe is copied: from the writer's userspace into the kernel buffer, then out of the kernel buffer into the reader's userspace. For high-throughput communication — gigabytes per second between two processes on the same machine — shared memory is the only viable mechanism. Two processes call into the kernel to map the same physical pages into both their virtual address spaces. After that, reading and writing the shared region is no slower than reading and writing any other memory. There is no copy. There is no kernel involvement on every access.

The cost of zero-copy is that you now have a classic concurrency problem: two processes touching the same memory must coordinate so that neither sees the other mid-update. The kernel exposes semaphores for this — counters that processes increment and decrement atomically, used as gates. A semaphore initialised to 1 acts as a mutex (one process at a time may pass). Initialised to N, it acts as a resource pool. The mathematics is Edsger Dijkstra's from 1965, formalised in his note "Cooperating Sequential Processes," and underlies every modern concurrent system.

Fig 4.18 — Two processes, one region, one gate
Fig 4.18 — Two processes, one region, one gate process A virtual addr X maps physical 0x4000 process B virtual addr Y also maps 0x4000 physical RAM page 0x4000 [shared region] no copy on access · plain load/store semaphore value: 1 A: sem_wait() decrement & pass if > 0 B: sem_wait() blocks until A signals Dijkstra 1965 · "Cooperating Sequential Processes" sem_wait()/sem_post() are atomic in the kernel — every concurrent system rests on this primitive

Two processes share a physical page by mapping it at (potentially different) virtual addresses. Reads and writes against that region are ordinary CPU loads and stores — no kernel involvement, no copying, no system calls in the fast path. The cost is coordination: if both processes write at the same time, the result is undefined. A semaphore, initialised to 1, lets exactly one of them through at a time. sem_wait() atomically decrements; if the value would go negative, the caller blocks until sem_post() elsewhere releases the gate. This is shared memory + semaphore: the IPC of choice for performance-sensitive systems where the cost of even one extra copy per byte is unacceptable. Database engines, video pipelines, audio mixers, and high-frequency trading systems are all built on it.

Sockets: when the other process is on another machine

A socket is a generalization of the pipe. Instead of communicating with another process on the same machine through a kernel-managed pipe, you communicate over a connection — either to another local process (a "UNIX domain socket") or to a process on a different machine entirely (a "network socket," using TCP or UDP). The same system calls — read, write, close — apply. From the application's perspective, the network is just another file. The kernel, the network drivers, the protocol stack, and the wire are all hidden beneath a four-call API.

This abstraction is one of the most important pieces of glue in modern computing. It is why a web browser, a database client, a video conferencing app, and an SSH session all use roughly the same code shape: open a socket, write a request, read a reply, close. We will spend Part III opening up what happens between write and read on a network socket — a journey through ARPANET, packet switching, TCP, DNS, TLS, HTTP, and the rest of the internet stack. For now: the kernel makes it look like a file. That, more than anything else, is the kernel's job.

06 — Security

When the kernel breaks

Everything we've built in this chapter rests on one assumption: the kernel is correct. The privilege boundary, the address-space isolation, the file permission checks, the IPC mediation — all of it depends on the kernel actually doing what it says it does. A bug in the kernel is therefore not an ordinary bug. It is potentially the end of every security guarantee on the machine. A user-space program with a buffer overflow can be exploited to take over that program. A kernel with the same kind of bug can be exploited to take over everything running on the entire computer, including the kernel itself, every other process, every file, and every keystroke. This is what kernel security is about.

Privilege escalation, more carefully

A privilege escalation attack is one where a process gains more privilege than it was supposed to have. The two important versions:

Vertical escalation — moving up the privilege ladder. A normal user process exploits a bug to become root, or root exploits a bug to enter Ring 0 (kernel mode). Once in Ring 0, an attacker controls the machine without restriction.

Horizontal escalation — moving sideways at the same privilege level into another process's data. Reading another user's files, hijacking another user's session.

Memorable kernel-level privilege escalation bugs in the last decade include Dirty COW (mentioned earlier — copy-on-write race condition, 2016), DirtyPipe (a flaw in pipe-buffer initialization that allowed writes to read-only files, 2022), and dozens of bugs in device drivers — driver code is vast, often less audited than the core kernel, and runs in Ring 0. Most modern kernel exploits chain together multiple bugs: an information leak (to defeat KASLR, the kernel's version of ASLR), then a memory-corruption primitive (to overwrite something useful), then a privilege escalation.

Containers: not virtual machines

You've probably heard of Docker, Kubernetes, and "containers." They are often described as "lightweight VMs," which is misleading. A virtual machine emulates an entire computer — its own kernel, its own everything — running on top of a hypervisor. Containers do not. A container is a normal Linux process whose view of the system has been restricted by the kernel. There is one shared kernel. The container is just a process the kernel has lied to.

The lying is done with two Linux features:

Namespaces partition kernel resources so that different processes see different views. There are namespaces for process IDs (a containerized process believes it is PID 1, even though to the host it might be PID 8472), for the network (the container has its own virtual network interface), for mounts (its own root filesystem), for user IDs, for hostnames, and more. Inside its namespaces, the container looks like a complete isolated system.

Cgroups (control groups) limit how much of each resource a process group can use — CPU time, memory, disk I/O, network bandwidth. Together with namespaces, this gives Docker its model: cheap, fast, lightweight isolation that doesn't require a hypervisor.

Fig 4.19 — VMs vs containers
Fig 4.19 — VMs vs containers VIRTUAL MACHINES App + Libs App + Libs Guest kernel Guest kernel Hypervisor (KVM, VMware, etc.) Host hardware strong isolation, gigabytes per VM, slow boot CONTAINERS App+Libs App+Libs App+Libs ↑ namespaces + cgroups ↑ SHARED HOST KERNEL one kernel for all containers Host hardware weaker isolation, megabytes per container, instant boot

A virtual machine runs a complete guest kernel on emulated hardware — strong isolation, but heavy. A container is just a process the host kernel has restricted with namespaces (what it can see) and cgroups (what it can use). One shared kernel, much less overhead — but a kernel bug in that shared kernel can break out of every container at once.

The trade-off matters: VMs isolate kernels, so a compromise of one guest doesn't affect others. Containers share a kernel, so a kernel privilege-escalation bug can let a malicious container take over the host and every sibling container. Cloud providers running multi-tenant workloads typically use VMs (or VM-strengthened containers like AWS Firecracker) for this reason; single-tenant deployments use plain containers because they're far cheaper.

Fig 4.20 — Namespaces: each container sees its own world
Fig 4.20 — Namespaces: each container sees its own world one shared kernel · three different "views" CONTAINER A PID namespace: PID 1 = redis (the only PID 1 it sees) network ns: eth0 = 10.0.0.2 mount ns: / = redis image UTS ns: redis-host CONTAINER B PID namespace: PID 1 = nginx (also believes it's PID 1) network ns: eth0 = 10.0.0.3 mount ns: / = nginx image UTS ns: web-host HOST VIEW (real) actual PIDs: 8472, 8481, … net interfaces: eth0 (real) veth0a · veth0b filesystem: full host tree hostname: prod-01 SHARED KERNEL — same code, different views per process group

Namespaces are how the kernel lies to a process about the world. There are seven kinds — PID, network, mount, IPC, UTS (hostname), user, cgroup — and each can be independently configured per process. Container A's ps shows only the processes inside its PID namespace; its ifconfig shows only its virtual interfaces; its / is the rootfs of its container image. Container B sees a completely different version of all of those, even though both processes are running under the same kernel, on the same machine, accessing the same physical RAM. From the host's perspective, both containers' PIDs are just numbers in the global PID table; namespaces are translation layers between the global state and what each container sees.

Fig 4.21 — Cgroups: quotas as taps on the resource flow
Fig 4.21 — Cgroups: quotas as taps on the resource flow control groups (cgroups) — limit, account, isolate host resources CPU: 16 cores RAM: 64 GB disk I/O: 2 GB/s net: 10 Gbps cgroup: web CPU: 4 cores RAM: 16 GB I/O: 200 MB/s net: 1 Gbps — enforced by kernel — cgroup: db CPU: 8 cores RAM: 32 GB I/O: 1.5 GB/s net: 5 Gbps — enforced by kernel — cgroup: batch CPU: 4 cores RAM: 16 GB I/O: 300 MB/s net: 2 Gbps — enforced by kernel — total: 16 cores · 64 GB · 2 GB/s · 10 Gbps — fully accounted, no overcommit a runaway "batch" job cannot drain the "db" cgroup's CPU — kernel enforces the cap

If namespaces decide what a process can see, cgroups decide what it can use. Each cgroup is a node in a hierarchy, with explicit limits on every resource the kernel can meter — CPU shares, memory bytes, block-device IOPS, network bandwidth, even the number of file descriptors. The kernel enforces these on every syscall and every scheduling decision: a process in the "batch" cgroup that tries to exceed its memory limit gets killed by the OOM killer; a process exceeding its CPU share gets descheduled until its budget refills. Cgroups are how Kubernetes packs ten workloads onto one machine without any of them noticing the others, how Docker's --cpus=2 flag works, and how systemd ensures a misbehaving service does not bring down the rest of the host.

eBPF: safe code inside the kernel

A more recent and remarkable Linux feature is eBPF — "extended Berkeley Packet Filter." It allows user programs to upload small pieces of code into the running kernel, where they are attached to specific events (network packets, system calls, function entries) and run inside Ring 0 with kernel-level performance.

The obvious worry is that this lets unprivileged users execute code in the kernel — historically the worst possible security outcome. eBPF makes it safe through a verifier: a static analyzer in the kernel that examines every uploaded program and refuses to load it unless it can prove the program will always terminate, never read out-of-bounds memory, and never crash. Programs that pass the verifier are then JIT-compiled to native machine code and run essentially as fast as compiled kernel code.

Fig 4.22 — eBPF: untrusted code in the kernel, made safe by proof
Fig 4.22 — eBPF: untrusted code in the kernel, made safe by proof USER SPACE — Ring 3 user writes BPF C source clang -target bpf → bytecode bpf() syscall → submit eBPF verifier proves: terminates · no OOB reads · no crashes if any check fails → reject, return EINVAL REJECTED verifier could not prove safety program never enters the kernel ACCEPTED · JIT-compile to native attach to event hooks runs in Ring 0 at native speed KERNEL — Ring 0 — only verified BPF runs here

eBPF flips a forty-year-old assumption: that any code running in the kernel must be hand-vetted, audited, signed, and trusted. Instead, the kernel ships a verifier — a static analyser that mathematically proves an uploaded program is safe before letting it run. The proof obligations are concrete: every memory access must be in-bounds (so the program cannot read random kernel memory); every loop must be provably bounded (so the program cannot hang the kernel); the program must terminate within a fixed instruction budget. Programs that pass are JIT-compiled to native instructions and attached to hooks — system call entry, packet receive, function entry, scheduler tick — where they run in Ring 0 with no syscall overhead. Cilium uses eBPF for high-performance Kubernetes networking; bpftrace exposes a tracing language built on it; modern Linux observability is increasingly eBPF underneath. The pattern — sandboxed verified execution inside a privileged context — shows up again in Chapter 12 (browser JavaScript) and Chapter 14 (TLS), and is one of the most important architectural ideas of the last twenty years.

eBPF has become the foundation of modern Linux observability and networking: tools like Cilium (high-performance network policy), bpftrace (kernel tracing), and Falco (runtime security monitoring) all build on it. The general pattern is important — providing a sandboxed, formally checked execution environment inside a privileged context — and we will see it again when we discuss browser security models in Chapter 12 and TLS in Chapter 14.

🔐

The recurring pattern. Every layer in this chapter — the kernel itself, the privilege boundary, virtual memory, filesystem permissions, container isolation, eBPF verification — is a mechanism that restricts what less-trusted code can do. Operating system security, fundamentally, is the discipline of building such mechanisms and then living with the fact that any one of them can have a bug. The kernel is not a fortress. It is a series of carefully designed walls, each one with a guard at every gate, and the guards themselves have to be checked by other guards. There is no bottom — only better and worse layers.

What you now understand

The kernel is a program — written in C, loaded by a bootloader, run forever in Ring 0 — that owns the hardware and mediates everything every other program does. It comes in two main shapes: monolithic (everything in Ring 0, fast, fragile) and microkernel (minimal Ring 0, message-passing, safer and slower), with most real systems sitting somewhere on that spectrum. Its scheduler, anchored mathematically in queueing theory, decides which process gets the CPU at each microsecond — Linux's CFS does this in O(log N) using a red-black tree of virtual runtimes. Its virtual memory subsystem maintains a four-level page table per process, accelerated by a translation cache (TLB), and uses page faults productively to implement memory-mapped files and copy-on-write fork. Its filesystem turns a flat array of disk blocks into a tree of inodes and names, made crash-safe by journaling or copy-on-write. Its IPC mechanisms — pipes, signals, shared memory, sockets — are the controlled boundary across which isolated processes can still cooperate. And its security depends entirely on its own correctness — which is why kernel bugs are so dangerous, and why each new mechanism (containers, eBPF) is layered behind further verification.

With this, the kernel is in view from both sides. The Bridge at the close of Part I showed it as silicon — privilege rings as a CPU bit, the trap as a wire-level mechanism, the MMU as a hardware unit, the timer as the single piece of hardware that makes preemption possible. This chapter showed it as code. Same object, two genuinely different perspectives, and the kernel becomes legible only when both are visible at once: thirty million lines of C running in Ring 0, sustained by a hardware contract that lets it own the machine.

The next chapter climbs out of the kernel and into the language it was written in. C — Ritchie at Bell Labs, 1972 — is the language every kernel in production today still uses. We'll see why pointers are the same idea as the addresses we just walked through, and why they are simultaneously the source of C's power and the cause of every memory bug we discussed in Chapter 3. From C we'll move to C++, then to the radically different choice that is Python. Each language is a particular set of decisions about which kinds of mistakes to make easy and which to make hard.

Chapter 05

The Language
That Built
Everything

Before C, every operating system was rewritten from scratch for every machine. The cost was civilizational. C broke the pattern by being portable — one source, many architectures. UNIX itself was rewritten in it. Half a century later, the kernel you used today, the browser rendering this page, the language runtime executing it — all C, or built on something that is.

TopicsPortability · Pointers · UB · The systems contract
Era covered1969 → present
Chapter 05 hero · The Language That Built Everything main.c one source · many targets x86-64 CISC · variable ARM64 RISC · 4-byte RISC-V open ISA PDP-11 1973 · K&R
01 — The portability problem

One source. A thousand binaries.

Until 1973, an operating system was a thing you wrote in assembly for one specific machine. If you wanted to run that OS on a different computer, you didn't recompile — you rewrote, instruction by instruction, for that computer's CPU. The cost was civilizational: every new machine reset the OS landscape to zero. Then C arrived, and one source file became a thousand binaries.

To see why this was a revolution, you have to understand what the world looked like before. In 1965, IBM shipped the System/360. To write software for it, you wrote System/360 assembly — directly, in mnemonics like L, ST, BAL, encoded in EBCDIC. When DEC shipped the PDP-8 in 1965 and the PDP-11 in 1970, those machines spoke entirely different assembly languages. A program for the System/360 was unusable on the PDP-11. Not "needs porting." Unusable. The bits did not even mean the same opcodes.

This was the normal state of computing. Operating systems, compilers, every piece of systems software — all rewritten per machine. A small army of programmers spent their careers translating known-good programs into yet another assembly. When a new CPU appeared, the cost of porting useful software was so high that most software simply did not make the trip. Each generation of hardware was an island.

Fig 5.1 — One C source, many targets
Fig 5.1 — One C source, many targets main.c int add(int a, int b) { return a + b; } gcc -m64 aarch64-gcc riscv64-gcc x86-64 add: lea eax,[rdi+rsi] ret 7 bytes · variable-length encoding ARM64 add: add w0,w0,w1 ret 8 bytes · fixed 4-byte encoding RISC-V add: add a0,a0,a1 ret 8 bytes · open ISA · 2010 SAME C · DIFFERENT INSTRUCTIONS · DIFFERENT REGISTERS · IDENTICAL BEHAVIOR

A single C function, compiled for three architectures separated by fifty years of CPU design. The output is unrecognizably different — different mnemonics, different registers (eax vs w0 vs a0), different encodings (variable-length on x86, fixed 4-byte on ARM and RISC-V) — but the program behaves identically. This is what "portable" actually means: not "binary works everywhere" but "source compiles cleanly to whatever the target needs." Before C, this property did not exist for systems software.

What makes a language portable

Three things, principally. First, the language must abstract away CPU-specific details — register sets, instruction widths, calling conventions — without forcing the programmer to give up speed. Second, it must be small enough that writing a compiler for a new architecture is feasible by a small team. Third, its memory model must be loose enough to map cleanly onto whatever memory hardware actually provides — bytes here, words there, alignment requirements everywhere.

C achieves all three with a particular trade. It is barely above the metal. Its types map directly onto machine words. Its control flow maps directly onto conditional jumps. Its data structures are layouts you could draw on graph paper. The cost: you, the programmer, see the metal whether you want to or not. The benefit: a competent compiler-writer can target a new ISA in a few months, and a C program written for the PDP-11 in 1975 will recompile and run, unmodified, on a 2026 ARM Mac.

BCPL → B → C

The lineage matters because it explains why C looks the way it does. BCPL (Martin Richards, Cambridge, 1967) was a typeless, stack-based language designed for writing compilers. Words. No types. Just memory. B (Ken Thompson at Bell Labs, 1969) was a smaller, cleaner BCPL — also typeless, also word-oriented. B ran on early UNIX. But the PDP-11 that arrived at Bell Labs in 1970 was byte-addressable, and B's wordless model fit it badly. In 1972 Ritchie added types — int, char, pointers, structs — and called the result C. The language was named, simply, after its predecessor's first letter.

The key insight: C did not solve portability by being abstract. It solved portability by being just barely abstract enough — close enough to assembly to compile efficiently on any reasonable CPU, just far enough above it to never mention specific registers.

02 — Ritchie and Thompson at Bell Labs

Two men, a discarded minicomputer, four years.

In a windowless room at Bell Labs in 1969, Ken Thompson found a discarded PDP-7 and started writing a programming environment for himself. His friend Dennis Ritchie joined. Within four years they had built UNIX, written the C language to write more UNIX in, and laid the foundations of every operating system you use today. No marketing. No grand plan. Two men, a stripped-down minicomputer, and a little time.

The story begins, characteristically, with a project that failed. Bell Labs had been one of three institutions building Multics — a vast, ambitious time-sharing operating system meant to be a "computing utility" for hundreds of simultaneous users. Multics was technically remarkable and managerially catastrophic. By 1969 Bell Labs had pulled out. Thompson and Ritchie, who had been Multics contributors, were left with the appetite for that kind of system but no machine, no team, and no permission.

Thompson found an unused PDP-7 in a hallway. The PDP-7 was already old by 1969 — an 18-bit minicomputer DEC had stopped selling. He started writing a small operating system for it, partly to play a space-travel game he had ported from Multics, partly to have an environment to work in. By the end of 1969 there was a kernel, a shell, an editor, and a primitive filesystem. Ritchie joined. Brian Kernighan, walking past one day and noting that the system was a stripped-down Multics, quipped that they should call it "UNICS" — UNiplexed Information and Computing Service, a deliberate pun on Multics. The name became UNIX.

Fig 5.2 — UNIX's rewrite, in time
Fig 5.2 — UNIX's rewrite, in time 1969 1970 1972 1973 1978 PDP-7 discarded PDP-11 byte addr UNIX v1 PDP-7 asm B typeless C 1.0 types · structs UNIX in C first portable OS K&R 228 pages From assembly to a portable kernel — four years Bell Labs · Murray Hill · Thompson and Ritchie THE 1973 REWRITE WAS THE PIVOT — UNIX BECAME THE FIRST OS YOU COULD MOVE BETWEEN MACHINES

UNIX's first version was assembly, like every operating system before it. By 1973, Ritchie had built C and rewritten UNIX in it — and that rewrite is the moment systems software becomes portable. Within five years, UNIX was running on machines Thompson had never seen, ported by people he had never met. The 1978 Kernighan and Ritchie book — universally called K&R — codified the language and is still in print, mostly unrevised, half a century later.

The PDP-11

The machine that shaped C deserves its own paragraph. The PDP-11 was a 16-bit minicomputer that DEC sold from 1970 onwards. It was small (the size of a refrigerator, not a room), had eight general-purpose registers, used byte-addressable memory, and supported indirect addressing through registers. Most importantly, it had a clean, regular instruction set. C's design choices — pointer arithmetic that adds sizeof(*p) bytes, structures laid out in memory in declaration order, the equivalence of arrays and pointers — are PDP-11 idioms made into a language. To a first approximation, C is a portable PDP-11 assembler.

K&R, 1978

Brian Kernighan and Dennis Ritchie published The C Programming Language with Prentice-Hall in 1978. Two hundred and twenty-eight pages. Every example a few lines, compilable, runnable. The famous opener — a program that prints hello, world — is on page 6. K&R defined the language until ANSI standardization in 1989, and it is still on the bookshelf of every working C programmer. The book is unfussy, exact, and assumes intelligence. It teaches by showing rather than explaining. It is one of the small number of programming books that someone could plausibly read in an afternoon and write working code that night.

"C is quirky, flawed, and an enormous success."

— Dennis Ritchie
Fig 5.3 — PDP-7 to PDP-11 · the machines that shaped C
Fig 5.3 — PDP-7 to PDP-11 · the machines that shaped C Bell Labs · Murray Hill · 1969–1973 · the four years that produced a portable kernel PDP-7 · 1964 found in a corner at Bell Labs FRONT-PANEL TOGGLES 18-bit word 8K words RAM no byte addressing paper tape I/O UNIX v1 · 1971 written in PDP-7 assembly tied to one specific machine 1970 — DEC ships PDP-11 "now we have a real machine" PDP-11 · 1970 refrigerator-sized · 16-bit · clean ISA 16-BIT REGISTER DISPLAY DECTAPE 16-bit word · byte-addressable 8 general-purpose registers indirect addressing modes orthogonal instruction set UNIX in C · 1973 first portable operating system runs on any byte-addressable CPU C IS A PORTABLE PDP-11 ASSEMBLER pointer arithmetic · byte-addressable structs · arrays as decayed pointers — all PDP-11 idioms made into a language

The two machines that shaped C, side by side. The PDP-7 was the discarded mainframe in a corner of Bell Labs where Thompson wrote the first version of UNIX in 1969 — entirely in PDP-7 assembly, tied to that one specific machine forever. The PDP-11, which DEC released in 1970, gave Ritchie a clean 16-bit byte-addressable architecture with eight general-purpose registers and orthogonal addressing modes — and it is, almost literally, what C was designed for. Pointer arithmetic that adds sizeof(*p) bytes; structures laid out in memory in declaration order; arrays "decaying" into pointers in expressions — all of these are PDP-11 idioms abstracted just enough to be a language. The 1973 rewrite of UNIX in C took advantage of this fit: C compiled to good PDP-11 code by mapping almost directly to the hardware, while staying portable enough that the same source ran on a different machine the next year. For half a century afterward, every CPU vendor has been quietly designing for "what C expects" — flat byte-addressable memory, integer pointers, two's complement arithmetic, a stack. The shadow of the PDP-11 is still on the silicon you are reading this on.

03 — Pointers as math

A pointer is an integer with attitude.

A pointer is not a mysterious thing. It is an integer. Specifically, it is the integer address of a byte in memory. The reason C feels different from every language since is that C lets you do arithmetic on these integers — directly, without a safety net — and trusts you to know what you're doing.

Recall from Chapter 1 that memory is a giant array of bytes, and each byte has an integer address. On a 64-bit machine, those addresses run from 0x0000_0000_0000_0000 to 0xFFFF_FFFF_FFFF_FFFF, which is more bytes than will ever exist. A pointer in C is just one of those integer addresses, stored in a variable that the compiler has labelled with the type of thing the address is supposed to point at.

Three operators do nearly all the work. The unary & takes the address of a variable: &x is "the address where x lives." The unary * dereferences a pointer: if p is the address of an int, then *p is the int stored there. And pointer arithmetic — p + 1 — does not mean "the next byte." It means "the next thing of the type p points to." If p is an int * on a 32-bit-int machine, p + 1 advances four bytes. The compiler scales the arithmetic for you. This is the entire trick.

Fig 5.4 — A pointer walks down memory
Fig 5.4 — A pointer walks down memory memory: 17 0x100 42 0x104 99 0x108 −7 0x10C 23 0x110 8 0x114 61 0x118 12 0x11C int *p = p++ advances by sizeof(int) = 4 bytes because p's type is int* — the compiler scales the arithmetic for you A POINTER IS AN INTEGER + A TYPE — THE TYPE TELLS C HOW BIG A "STEP" IS

A pointer to an int starts at address 0x100. p++ doesn't move forward by one byte; it moves forward by sizeof(int), which is four bytes on most modern systems. The compiler does the multiplication. The same expression p++ on a double * would advance eight bytes; on a char *, one byte. The arithmetic always means "next thing," never "next byte." This single design choice unlocks the equivalence of arrays and pointers, the entire C string library, and approximately every memory-corruption vulnerability in computing history.

Arrays are pointers in disguise

Here is one of the deeper revelations in C. The expression a[i], where a is an array, is defined by the C standard to be exactly equivalent to *(a + i). Not "compiles to similar code." Not "behaves the same way." Defined to be equivalent. The array indexing operator is syntactic sugar for pointer arithmetic. A consequence: if you have a pointer p, you can write p[3], and that is a legal C expression meaning *(p + 3). And if you have an array a, you can write *a, and that is a[0].

Fig 5.5 — a[i] ≡ *(a + i)
Fig 5.5 — a[i] ≡ *(a + i) array notation a[i] "the i-th element of a" defined as pointer notation *(a + i) "deref a-plus-i bytes-scaled" C99 §6.5.2.1: "The definition of the subscript operator [] is that E1[E2] is identical to (*((E1)+(E2)))."

The C standard does not say a[i] behaves like *(a + i). It says they are the same expression. A consequence: i[a] is also legal — addition commutes — and means the same thing as a[i]. Nobody writes 3[arr] in production, but the standard permits it. Arrays in C are not really first-class data structures; they are a thin convention layered over pointer arithmetic. Most languages that came after C decided this was a mistake. C survived anyway.

What this lets you do

Direct memory traversal. Linked structures built by chasing pointers from one node to the next. Arrays of arrays, dynamically sized. Function pointers — pointers to executable code, the foundation of callbacks, interrupt handlers, and virtual method tables (which is how Chapter 6's vtables are implemented). Type-punning casts that let you treat the same bytes as different types — useful for binary protocols, image manipulation, and reading raw hardware registers. Everything systems software does, eventually, comes down to this.

What this lets you screw up

A pointer that points outside its valid range will happily write anywhere your process is allowed to write. There is no check. arr[i] with i too large does not raise an exception; it stores into whatever happens to be 4×i bytes past the start of arr. If that happens to be the saved return address on the stack — and you went back through Chapter 3 — you have a buffer overflow vulnerability. The mechanism we drew on paper there is exactly this, expressed as a few lines of C, compiled with bounds-check optimization off (which is to say: with a normal C compiler).

The trade: Pointer arithmetic gives you the most direct access to memory of any high-level language. It is also the source of approximately every memory-corruption vulnerability in computing history. Both halves of that sentence are true, and the second is a price the first has to pay.

04 — Manual memory management

You allocated it. You free it.

Every byte of dynamic memory your C program uses, you allocated. Every byte you allocated, you must free. Forget to free, and your program leaks until the OS kills it. Free twice, or use a freed pointer, and your program corrupts itself in ways that may not show up for hours. C makes you the bookkeeper. There is no garbage collector. There is no safety net.

Recall from Chapter 4 that memory comes in two main flavours: the stack, which grows and shrinks automatically as functions enter and exit, and the heap, a region the kernel hands the process for long-lived allocations. Stack memory is automatic — local variables come and go without your involvement. Heap memory requires explicit requests. C exposes those requests through two functions: malloc(n) asks for n bytes of heap and returns the address; free(p) tells the allocator that the bytes at address p are no longer needed.

Behind these two functions lives an entire algorithm — the allocator. The kernel hands the process a single big chunk of address space (via the brk or mmap system calls). The allocator subdivides that chunk into smaller pieces in response to malloc calls, and stitches pieces back together when free calls return them. Over time, allocations and frees create a free list — chunks of various sizes interleaved with allocated chunks, each tagged with whether it is in use.

Fig 5.6 — The heap, as a free list
Fig 5.6 — The heap, as a free list heap region (handed by kernel via brk/mmap): ① after several malloc calls: used 64B used 128B used 32B used 96B free 240B ② after free(p2) and free(p4): used 64B free 128B used 32B free 96B free 240B free list now: 128B → 96B → 240B (linked, but fragmented) 128B 96B 240B malloc(150) now must search the list — none of these chunks alone fits

The heap is a long ribbon of bytes. The allocator carves chunks out of it and tracks which ones are in use. After repeated allocations and frees, the heap becomes fragmented — many small free chunks scattered between in-use chunks, none of them large enough for the next big request, even though the total free space might be huge. This is why long-running C programs can run out of memory while having plenty available, and why allocator design (first-fit vs best-fit, slab allocators, jemalloc, glibc's ptmalloc) is its own dense subfield.

The two failure modes

There are exactly two ways to mishandle malloc and free, and both are catastrophic. The first is to allocate and never free — a leak. The chunk stays in use forever, even though no pointer in the program references it any more. Run a leaky program long enough and the OS kills it for exhausting memory. The second is the inverse: free a chunk and then keep using the pointer to it. This is a use-after-free. The chunk has gone back to the free list and may have been handed to a different malloc caller; reading it now reads someone else's data, writing it corrupts that data.

Fig 5.7 — Leak vs use-after-free
Fig 5.7 — Leak vs use-after-free memory leak char *p = malloc(1024); use(p); // forgot free(p) return; // p goes out of scope marked "used" no pointer references it allocator can't reclaim silent · grows over time OS eventually OOM-kills use-after-free char *p = malloc(1024); free(p); // p now dangles strcpy(p, attacker); on free list — reusable next malloc may return it to a different caller writes corrupt their data → exploit vector EVERY MAJOR BROWSER CVE OF THE 2010S WAS ONE OF THESE TWO

A leak is silent and slow: memory accumulates until the OS kills the process. A use-after-free is silent and fast: the moment free returns, the chunk is back on the allocator's free list, and a future malloc may hand it to a totally different part of your program. Reads from the dangling pointer return whatever that other code has stored. Writes corrupt it. In a security context, the attacker arranges for a sensitive object — a function pointer, a credentials structure — to be allocated where the dangling pointer still points, then writes through the dangling pointer to take control. UAF is the modern equivalent of the buffer overflow.

Why this is still in production

Garbage collection was invented in 1959 by John McCarthy for Lisp. By 1990 it was well-understood. C kept refusing it for one reason: predictability. A garbage collector decides on its own schedule when to pause your program and walk the heap. For a kernel running 10,000 syscalls per second, or an audio stack filling a buffer 48,000 times per second, a 100-millisecond GC pause is catastrophic. C's manual model gives you a contract: memory is freed exactly when you free it, never sooner, never later. The price is that you must be right.

05 — Undefined behavior

The optimizer's cruel logic.

The C standard contains a list of operations whose behavior is "undefined." This does not mean they crash, or fail, or return garbage. It means the standard places no requirement at all on what happens — and modern optimizers will exploit that freedom in ways that look like betrayal. Code that worked for ten years can break catastrophically on the next compiler upgrade.

The list is short and dangerous: signed integer overflow; dereferencing a null pointer; reading or writing past the end of an array; using a value before initializing it; modifying a string literal; data races between threads. Each is a contract the programmer has implicitly accepted: "I promise I will never do this." The compiler trusts that promise — completely. Modern optimizers do not check whether you can actually keep it. They simply assume you will, and generate code that only makes sense if you did.

Here is the canonical example. Consider a function that wants to detect signed integer overflow before doing something dangerous:

Fig 5.8 — How "trust the programmer" deletes your bug check
Fig 5.8 — How "trust the programmer" deletes your bug check // the programmer's intent: refuse to operate if overflow has happened int safe_add_check(int i) { if (i + 1 < i) abort(); // detect overflow return i + 1; } the compiler's reasoning, step by step 1. signed overflow is undefined behavior — the standard says it cannot happen 2. therefore i + 1 > i for every legal i (since overflow is the only way to violate this) 3. therefore the test (i + 1 < i) is always false 4. therefore abort() is dead code — delete it generated assembly: just `lea eax, [rdi+1]; ret` the bug check has been silently deleted by the optimizer

The programmer wrote a check for signed overflow. The check is correct in principle. But because signed overflow is undefined behavior, the compiler is allowed to assume it never occurs — and from that assumption, it can prove the check is always false, and delete it as dead code. The programmer's defensive check has been silently removed. The compiler is not malicious; it is following the standard exactly. This pattern has caused real security bugs in real production code, including in the Linux kernel, which now compiles with explicit flags (-fno-strict-overflow, -fno-delete-null-pointer-checks) to prevent the optimizer from exploiting these particular freedoms.

NULL dereference

A pointer with the value 0 — or NULL, the macro that means the same thing — is conventionally "the address that points nowhere." The kernel arranges for the very first page of every process's virtual address space to be unmapped. Touching that page raises a hardware fault. The kernel intercepts the fault and converts it into a signal called SIGSEGV ("segmentation violation"), which by default kills the process. This is the Segmentation fault (core dumped) message that has greeted C programmers for fifty years.

Fig 5.9 — A NULL dereference, traced
Fig 5.9 — A NULL dereference, traced user code char *p = NULL; *p = 'a'; // fault! CPU MMU walks page tables, finds no mapping for 0x0 → raise #PF kernel page-fault handler looks up faulting address: 0x0 no VMA covers it → illegal access SIGSEGV kernel queues signal #11 to the process death default handler: terminate + core dump USER CODE → CPU → KERNEL → SIGNAL → PROCESS DEATH a four-machine round trip in roughly 10 microseconds "Segmentation fault (core dumped)" — the message every C programmer has seen ten thousand times

A NULL dereference traverses every layer of the system. The CPU's memory management unit (Bridge — The Boundary, in Part I) detects that no page-table entry exists for address 0x0 and raises a hardware exception. The kernel's page-fault handler examines the faulting address, finds it lies outside any of the process's mapped regions, and decides this is an illegal access. It queues signal 11 — SIGSEGV — to the offending process. The process's default signal handler is "terminate, dump core for the debugger," which is what you see. All of this happens in roughly the time it takes to read this sentence.

The optimizer's contract, the programmer's homework

"Trust the programmer" was C's original philosophy and is still its operating stance. Modern optimizers take it to its logical limit: the compiler trusts you so completely that it removes code based on assumptions you never knowingly made. The Linux kernel has accumulated dozens of -f flags telling GCC to not exploit specific UB optimizations. Modern security-aware C codebases compile with UndefinedBehaviorSanitizer in development to catch the cases where the programmer's and the compiler's models of "what is allowed" have drifted apart.

The right way to think about UB: not "the program might crash" but "the program's behavior is no longer governed by the source you wrote." The optimizer is allowed to assume the input is well-defined, and once that assumption is wrong, the output may bear no resemblance to your code's intent. This is the cost of C's performance — and the cost is real.

06 — Why C survives

Half a century later, still in the foundation.

Every five years, someone writes a clickbait piece announcing that C is dying. Every five years, more C runs in production than ever before. Linux is C. macOS XNU is mostly C. Windows NT is mostly C. The Python interpreter is C. The browser rendering this page is mostly C. The TLS library that secured your connection is C. C is not dying. C is so deep in the foundations that it has become invisible.

Three properties keep C in the floor of every system. Performance: C compiles to assembly with no runtime overhead and no hidden machinery — what you write is, very nearly, what runs. Predictability: every operation has a known, bounded cost; there is no garbage collector to pause your program on its own schedule, no JIT to recompile your hot loop in the middle of a frame. Ubiquity: every hardware vendor ships a C compiler before any other language, every embedded system speaks C, every kernel API is described in C.

Fig 5.10 — Where C still runs in 2026
Fig 5.10 — Where C still runs in 2026 software domains by share-of-C in production code: KERNELS Linux · BSDs · XNU · NT EMBEDDED MCUs · IoT · auto · MedTech RUNTIMES CPython · Ruby · Lua DATABASES SQLite · Postgres · MySQL TLS · CRYPTO OpenSSL NET INFRA NGINX · HAProxy "C is not the language you reach for. It is what is already there when you arrive." the price: Heartbleed 2014 Shellshock 2014 ~70% of CVEs memory-safety bugs RUST IS BEGINNING TO TAKE ITS PLACE — SLOWLY, IN THE SPECIFIC LAYERS WHERE SAFETY MATTERS MORE THAN HERITAGE

Every domain where C remains dominant has the same property: when the cost of a bug is measured in machine cycles or microseconds, garbage collection is unaffordable; when the cost is measured in dollars or lives, predictability beats convenience. The price is paid in vulnerabilities. Roughly two-thirds of all serious security bugs reported in the past decade are memory-safety bugs in C or C++ code. Rust — designed in the 2010s explicitly to give C's performance and control without C's footguns — is beginning to displace C in specific layers (the Linux kernel started accepting Rust modules in 2022, Microsoft is rewriting parts of Windows in Rust, the Android Bluetooth stack moved). It is a slow displacement. C will outlive the people reading this.

Fig 5.11 — The C family tree · half a century of descendants
Fig 5.11 — The C family tree · half a century of descendants C didn't just survive — almost every working language since is a descendant or a reaction C · 1972 Ritchie · Bell Labs C++ · 1985 Stroustrup · same labs Objective-C · 1984 Cox · NeXT · Apple Perl · 1987 Wall · NASA · syntax Java · 1995 Gosling · Sun · "C without footguns" JavaScript · 1995 Eich · Netscape · syntax PHP · 1995 Lerdorf · web · C-syntax Swift · 2014 Lattner · Apple · post-Obj-C C# · 2000 Hejlsberg · Microsoft · post-Java Kotlin · 2011 JetBrains · JVM Go · 2009 Pike · Thompson · Google "C with goroutines" Rust · 2010 Hoare · Mozilla "C performance, no UB" Zig · 2016 Kelley · "better C" explicit, no hidden control flow first wave · keep speed, add objects second wave · keep syntax, add VM and GC third wave · keep speed, add memory safety C-FAMILY SYNTAX RUNS 80%+ OF PRODUCTION CODE WRITTEN IN 2026 curly braces · semicolons · int / float / pointer-or-reference · for/while/if · function-call syntax — all C, all the way down

A partial family tree. C at the root, three rough generations of descendants. The first wave (Stroustrup's C++, Cox's Objective-C) kept C's speed and added object orientation directly on top of C compilation; both languages are technically supersets that compile to roughly the same instructions. The second wave (Java, C#, Kotlin, JavaScript, PHP, Perl) kept C's syntactic surface — curly braces, semicolons, the for loop, the cast operator — but ran on a virtual machine with garbage collection, trading raw performance for safety. The third wave (Go, Rust, Zig, Swift) returned to C's performance contract but added memory safety either through ownership analysis (Rust), runtime checks (Go), or explicit control (Zig). Almost every language a working programmer encounters in 2026 is a descendant of C or a reaction against C's specific costs. Languages that look fundamentally different — Python, Lisp, Haskell, OCaml — are themselves often implemented in C. The shadow is everywhere.

"There are only two kinds of languages: the ones people complain about and the ones nobody uses."

— Bjarne Stroustrup

The seam to Chapter 6

Stroustrup, who said that, was reflecting on his own creation. By 1979 he was sitting in the same Bell Labs hallway as Ritchie and Thompson, watching C run things it had never been designed to run — large systems, with millions of lines of code, written by hundreds of people who all had to understand each other's intentions. C had given them speed and portability. It had not given them tools to manage complexity at that scale. Stroustrup set out to make C scale, without giving up anything it did well. The result was the next chapter.

Chapter 06

When C Needed
To Think In
Objects

By 1979, C was running million-line systems and the seams were showing. Bjarne Stroustrup, in the same Bell Labs hallway as Ritchie and Thompson, asked: what if a language gave you all of C's speed but let you think in larger units than functions? The answer was not a new language. It was C with classes added on top, in a way that costs you nothing if you don't use them. That language now runs trading systems, game engines, browsers, and most of your operating system's user-space.

TopicsEncapsulation · Vtables · RAII · Zero-cost abstraction
Era covered1979 → present
Chapter 06 hero · When C Needed To Think In Objects class Shape virtual area() virtual draw() class Circle override area() class Square override area() class Triangle override area() class Polygon override area()
01 — The complexity problem

The crisis that wasn't a crash.

The 1980s software crisis was not a single failure. It was a slow, systemic realization. Codebases that had been small enough for one team to keep in their heads were now too large for any human to model. C gave you no tools above the function. Beyond fifty thousand lines, C codebases became unmaintainable — not because the code was wrong, but because no one could tell when it was right.

Frederick Brooks had already named the disease. The Mythical Man-Month, 1975, observed that adding people to a late software project makes it later — that complexity does not reduce by parallelism, because the cost of communication grows faster than the cost of work. By the late 1970s, Bell Labs and IBM were running projects with hundreds of programmers and millions of lines of code, and watching them fail. The MIT-led Multics system — which Bell Labs had pulled out of a decade earlier — was the canonical case study, but every large vendor had its own.

The diagnosis was that C, for all its virtues, gave the programmer one and only one tool for organising code: the function. Functions could group statements; nothing grouped functions. There was no language-level concept of module, no class, no interface. A header file was a list of declarations and a polite request that callers stick to it. A struct was a list of fields with no associated behaviour. If your subsystem had thirty types and three hundred functions, the language had nothing to say about which functions belonged to which types. You enforced it by convention, in comments, in code review.

Fig 6.1 — The 1980s software crisis, in one chart
Fig 6.1 — The 1980s software crisis, in one chart 1975 1980 1985 1990 1995 2000 10K 100K 1M 10M lines of code 5 15 25 35 bug density / KLOC codebase size bug density (rising in lockstep) Brooks 1975 "Mythical Man-Month" 1980s software crisis C++ released CODEBASES GREW BY 100×; THE LANGUAGE THAT BUILT THEM HAD NOT EVOLVED SINCE 1973

Through the 1980s, codebases grew about a hundred-fold. Bug density grew with them, faster than linearly — every doubling of code roughly doubled the rate of defects per thousand lines, because the surface area for unintended interactions grew quadratically. C, frozen in its 1973 design, had no language tools to slow this. Modules, namespaces, encapsulation, type-checked interfaces — none existed. The "crisis" was not a single dramatic failure; it was the slow recognition across the industry that a language built for systems written by two people in a hallway did not extend cleanly to systems written by two hundred people across continents.

The fix needed two things. First, a way to bind data and code together — to say "these functions operate on this kind of thing, and only on this kind of thing." Second, a way to talk about relationships between kinds — a Circle is a Shape; a TextEditor has a Buffer; a Logger can be any object that exposes a log() method. None of this is hard to do in C, by convention. The point is to make the convention enforceable by the compiler, so the cost of breaking it is paid once, at build time, by the person who broke it — rather than once per bug report, by everyone.

02 — Stroustrup at Bell Labs

Simula's children, raised on C.

Bjarne Stroustrup arrived at Bell Labs in 1979, not far down the corridor from Ritchie and Thompson. He had just finished a doctoral thesis on simulation systems, written in a language called Simula 67. Simula had introduced a strange new idea called the class. Stroustrup loved Simula's expressiveness and hated its speed. C, in the same building, was the inverse: fast and ugly. The thought that organised the next forty years of his life was the simplest one available — could he have both?

Simula 67 — designed in 1965 by Ole-Johan Dahl and Kristen Nygaard at the Norwegian Computing Center, expressly for simulating physical systems — is the language that introduced the modern object-oriented vocabulary. The words class, object, inheritance, virtual method, and even this (referring to "the current object") all originate there. By 1979 Simula was a niche academic language, but its ideas had spread: Smalltalk had taken them in one direction, building an entire dynamic system around them. Stroustrup decided to take them in the opposite direction. He would graft them onto C, but in a way that imposed no runtime overhead unless you used them — a zero-cost philosophy that became the language's defining trait.

The first version, in 1979, was called "C with Classes." It was implemented as a preprocessor that translated the new constructs back to plain C. In 1985 Stroustrup wrote the language's first book and renamed it C++ — a literal pun on C's increment operator, ++, meaning "C, but more." The 1985 release shipped as a true compiler. By 1989, C++ was an ANSI standard. By 2000 it was the dominant language for systems where C alone could not scale: trading platforms, telecommunications switches, browser engines, game runtimes.

Fig 6.2 — C++, in time
Fig 6.2 — C++, in time 1965 1979 1985 1998 2011 2017 2023 Simula 67 first "class" "C with Classes" preprocessor C++ 1.0 true compiler ANSI C++98 + STL C++11 modern era C++17 structured C++23 modules From Norwegian academic curiosity to half the world's systems software Stroustrup · Bell Labs · Murray Hill "I wanted Simula's expressiveness and BCPL's efficiency, on a real machine, for real systems." — Bjarne Stroustrup

From Simula 67's introduction of the class to C++23's standardised modules. The crucial pivot is 2011 — C++11 — when the language gained smart pointers, lambdas, move semantics, type inference, and threading primitives. Before C++11, "modern C++" was a discipline some practitioners maintained against the language; after C++11, it was the language's official voice. Every release since (14, 17, 20, 23) has added more, all backwards compatible. C++ has the strange honour of being a language you can spend a career using without ever writing in the same dialect twice.

What Stroustrup actually built, with the patience of someone who would still be tending it forty years later, was a deal: you may add abstraction without losing speed. Every C++ feature is held to the same rule. Classes compile to structs with associated functions — no overhead. Inheritance with virtual methods compiles to a struct containing one extra pointer and an indirect call — the cheapest possible version of dynamic dispatch. Templates compile, per-instantiation, to type-specialised assembly. None of this is magic. All of it is C with the bookkeeping pushed up into the type system, where the compiler can check it and inline it away.

03 — Classes, inheritance, and the vtable

What "OO" looks like in memory.

A class is a struct with associated functions. A virtual method is a function pointer the compiler manages for you. Inheritance is one struct embedded inside another. Polymorphism is dereferencing one of those function pointers at runtime. None of these are abstract concepts. They are layouts in memory, and you can draw every one of them on graph paper.

Start with a non-virtual class. The class Point with two integer fields and an operator== compiles to exactly the same memory layout as struct { int x; int y; } in C. The function operator== compiles to a free function with an extra hidden argument named this, a pointer to the object. Calling p.equals(q) compiles to Point_equals(&p, &q). There is nothing else. No vtable, no runtime, no overhead. This is what "zero-cost" means at the smallest scale.

Add the keyword virtual to a method, however, and the compiler must arrange for that method to be dispatched at runtime — that is, the actual function called depends on the dynamic type of the object, not the type of the pointer the caller is using. To support this, the compiler adds one hidden field to every object of a class with virtual methods: a pointer to a per-class array of function pointers, traditionally called the vtable. Each virtual method gets a fixed slot in that array. Calling shape->area() compiles to "load the vtable pointer, index slot 0, jump to that address."

Fig 6.3 — A class, an object, and a vtable
Fig 6.3 — A class, an object, and a vtable class Shape { virtual double area() = 0; virtual void draw() = 0; }; class Circle : public Shape { double r; double area() override; void draw() override; }; Shape* s = new Circle{2.0}; s->area(); the object 's' in memory vptr → Circle vtable r = 2.0 double, 8 bytes Circle's vtable (one per class) slot 0: → Circle::area slot 1: → Circle::draw what s->area() compiles to mov rax, [rdi] ; load vptr call [rax+0] ; jump to slot 0 2 instructions · 1 indirect call ~3 cycles, branch-predicted at runtime, when s->area() executes: load vptr read slot 0 of vtable jump to Circle::area execute body DYNAMIC DISPATCH IS A FUNCTION-POINTER LOOKUP — TWO INSTRUCTIONS, ONE INDIRECT CALL

A class with virtual methods adds one pointer per object — the vptr, set automatically by the constructor. Each class has one vtable, shared by all its objects. Calling a virtual method compiles to two memory accesses (load vptr, load slot), one indirect call, and the body of the chosen function. On a modern CPU with branch prediction, this is roughly three cycles in steady state — barely more expensive than a direct call. The C equivalent is "a struct containing function pointers, manually managed" — which is exactly how the Linux kernel's file_operations and many other systems are built. C++ just makes the compiler do the bookkeeping.

Inheritance is layout

Fig 6.4 — Inheritance: same vptr slot, different function
Fig 6.4 — Inheritance: same vptr slot, different function Shape virtual area() = 0 Circle area() { return π·r·r; } vtable[0] → Circle::area Square area() { return s·s; } vtable[0] → Square::area Triangle area() { return ½·b·h; } vtable[0] → Triangle::area void print(Shape* s) { cout << s->area(); } caller knows nothing about the actual class

Three classes inherit from Shape. Each overrides area() with a different body. At runtime, slot 0 of each subclass's vtable points to its own implementation. A function that takes Shape* can call s->area() without knowing — or caring — what concrete type s actually is. This is polymorphism: one call site, three possible callees, dispatched by the object itself. The caller is decoupled from the implementations. New shape types can be added without changing print(). This is the central organisational tool C lacked — a way to write code against an interface rather than a specific implementation.

Encapsulation as a tool

Public, private, protected. Classes can hide their state from the outside world, forcing access through methods. This is enforced at compile time only — at runtime, a private field is just a field, and a determined caller can cast their way to it. Encapsulation in C++ is not a security boundary; it is a discipline that the language helps you maintain. Used well, it lets you change a class's internals freely, knowing the rest of the codebase cannot have grown to depend on them.

04 — RAII

The signature C++ idiom.

Stroustrup's most original contribution was not classes — Simula had those — nor operator overloading, nor templates. It was an idea so plain, once stated, that it sounds obvious: tie the lifetime of every resource to the lifetime of an object on the stack, and let the compiler clean up automatically when the object goes out of scope. The acronym is RAII — Resource Acquisition Is Initialisation — and once you see it you stop being able to write C code without missing it.

Recall from Chapter 5 that C makes you the bookkeeper. Every malloc needs a free. Every fopen needs an fclose. Every pthread_mutex_lock needs a pthread_mutex_unlock. And every one of those calls must be paired correctly across every possible exit path from the function — including the paths created by exceptions, which C does not have but which C++ does. The C answer is goto-cleanup style: a single label at the end of the function that frees everything in the right order, with every branch jumping to it. It works. It is fragile. It does not survive refactoring.

The C++ answer is an object whose constructor acquires the resource and whose destructor releases it. Place that object on the stack. The compiler emits destructor calls automatically when the object goes out of scope — for any reason. Function returned normally? Destructor runs. Function exited via break? Destructor runs. Function exited because an exception was thrown three frames down? Destructor runs as the stack unwinds. The bookkeeping cannot be skipped, because there is no human writing it any more.

Fig 6.5 — RAII: scope acquires, scope releases
Fig 6.5 — RAII: scope acquires, scope releases C — manual cleanup, error-prone int do_thing(void) { FILE *f = fopen("x", "r"); char *buf = malloc(1024); pthread_mutex_lock(&m); if (parse(buf, f)) goto err; if (write(buf)) goto err; pthread_mutex_unlock(&m); free(buf); fclose(f); return 0; err: pthread_mutex_unlock(&m); free(buf); fclose(f); return 1; C++ — RAII, automatic cleanup int do_thing() { std::ifstream f{"x"}; std::vector<char> buf(1024); std::lock_guard lock{m}; parse(buf, f); write(buf); return 0; // ifstream::~ifstream() closes file // vector::~vector() frees buf // lock_guard::~lock_guard() unlocks // — all automatic, exception-safe } Same effect. Half the code. Cannot be skipped. The compiler emits destructor calls in reverse order of construction, on every exit path — including exception unwind.

In the C version, every error path must remember to unlock the mutex, free the buffer, and close the file. Add a fourth resource and you have eight cleanup paths. Miss one and you leak. In the C++ version, the resources are stack objects. Their destructors run automatically when the function exits, in reverse construction order, on every exit path — return, break, exception, anything. The "cleanup" code does not exist because it does not need to be written. RAII is the single feature that lets C++ scale to large codebases without leaking; it is the reason exception safety is even possible; it is what every "modern" feature added since C++11 is built on top of.

Once you understand RAII, large parts of C++ become re-readable. std::vector is RAII for a heap-allocated array. std::unique_ptr is RAII for a single heap object. std::lock_guard is RAII for a mutex. std::ifstream is RAII for a file descriptor. The standard library is, to a first approximation, a catalogue of RAII wrappers around C resources. The whole language is structured around the principle that resources are objects, and objects clean up after themselves.

"Resource management is the C++ idiom that I am most proud of."

— Bjarne Stroustrup, The Design and Evolution of C++, 1994
05 — The performance contract

What "zero-cost" actually means.

C++ has one promise that has held for forty years: you do not pay for what you do not use, and what you do use is as fast as anything you could have written by hand. The slogan is zero-cost abstraction, and like most slogans it is slightly misleading — there are costs, just not at runtime. The cost is compile time, build complexity, and a generation of programmers who have to understand both layers to write C++ at all.

Consider std::sort. C's qsort takes a function pointer to a comparator, and at every comparison the CPU performs an indirect call — jumping through a function pointer that the compiler cannot see through. std::sort takes a comparator as a template argument, which means the compiler generates a fresh copy of std::sort per comparator type and inlines the comparator directly into the loop. The result is a tight, branch-predictable loop with no indirect calls. On a modern CPU sorting a million integers, std::sort typically runs about three times faster than qsort — for the same algorithm, written by the same author, for ostensibly the same task. The difference is entirely the dispatch mechanism.

Fig 6.6 — qsort vs std::sort, same algorithm, different cost
Fig 6.6 — qsort vs std::sort, same algorithm, different cost C — qsort with function-pointer comparator int cmp(const void*a,const void*b){ return *(int*)a - *(int*)b; } qsort(arr, n, sizeof(int), cmp); inner loop, asm: ; for each comparison: call rdx ; indirect call! test eax, eax jne swap C++ — std::sort with template lambda std::sort(arr.begin(), arr.end(), [](int a, int b){ return a < b; }); // comparator inlined per type inner loop, asm: ; for each comparison: cmp eax, ebx ; direct cmp jge no_swap ; ~3× faster, branch-predicted SAME ALGORITHM · TEMPLATE INSTANTIATION REMOVES THE INDIRECT CALL · INLINED INTO THE LOOP cost paid at compile time, not at runtime — that is the trade

Both versions implement the same comparison-based sort. The C version compiles to a loop where every comparison is an indirect call through a function pointer — the CPU cannot see what code is about to run, cannot prefetch, cannot inline. The C++ version specialises the entire sort at compile time for the specific lambda passed in: the comparison becomes a single cmp instruction in the loop body, no call, no indirection. Roughly threefold speedup on integer sorting. The price is paid in compile time and binary size — every std::sort call site with a different comparator type produces a new copy of std::sort in the binary. C++ trades disk and seconds-of-build for cycles-of-runtime, in volumes that strongly favour the runtime side.

The same logic runs through the entire C++ value system. Templates instead of function pointers, when you can. constexpr instead of runtime computation, when the inputs are known. Move semantics instead of copying, when the source will not be needed again. Each is a way to push work from runtime to compile time, where the compiler can see everything at once. C++ programmers tend to internalise this trade so completely that they forget they are making it. To a C programmer, that move can look like luxury. To a C++ programmer, it is just the cost the compiler should obviously pay so the program does not have to.

06 — Modern C++

The language that ate its own complexity.

Through the 1990s and 2000s, C++ accumulated a reputation as the language with every feature, none of which were quite right. By 2010, the standard committee had reached the same conclusion. C++11 — the first major standard release in thirteen years — added enough new mechanisms to obsolete most of the patterns the previous generation had hated. Smart pointers replaced raw new/delete. Move semantics replaced expensive copies. auto replaced verbose type declarations. Lambdas replaced function pointers. None of the old features were removed; the new ones were just so much better that competent C++ stopped using them.

Smart pointers

A raw new in modern C++ is rare to the point of suspicion. Three class templates from <memory> handle nearly every heap allocation case, each with explicit ownership semantics. std::unique_ptr owns exclusively — there is exactly one owner; copying is forbidden; moving transfers. std::shared_ptr owns jointly — a reference count tracks how many shared_ptrs point to the same object, and the object is deleted when the count drops to zero. std::weak_ptr observes — it does not affect the count, and accessing it requires explicitly upgrading to a shared_ptr (which may fail if the object is gone).

Fig 6.7 — Three pointer flavours, three ownership semantics
Fig 6.7 — Three pointer flavours, three ownership semantics unique_ptr exclusive ownership unique_ptr<T> single owner heap object free when ptr dies cannot copy · can move shared_ptr reference-counted shared_ptr owner A shared_ptr owner B heap object refcount: 2 free when count → 0 weak_ptr observe, don't own weak_ptr<T> non-owning observer heap object does not extend life may be expired RAW new/delete IS RARE IN MODERN C++ — OWNERSHIP IS PART OF THE TYPE

The three smart pointers cover almost every dynamic-allocation case. unique_ptr for "I own this; nobody else does" — has zero overhead beyond a raw pointer. shared_ptr for "this is jointly owned; clean up when nobody needs it" — adds an atomic reference count, paid for in cycles. weak_ptr for "I want to know when this object is alive but never extend its life" — used to break the cycles that shared_ptr graphs can otherwise leak. The right answer is almost always unique_ptr; shared_ptr is a red flag for "I have not finished thinking about ownership"; weak_ptr is for the cases where ownership genuinely cannot be made acyclic.

Move semantics

For decades, returning a large object from a function meant copying it. C++11 added a way to move it instead — to declare "I am done with this; take it" and have the destination object steal the source's internal buffers, leaving the source in a valid-but-empty state. The mechanism is the rvalue reference, written T&&: a reference that can only bind to a temporary, signalling that the caller has no further use for the value. A class with a properly written move constructor can transfer ownership of a million-element vector in three pointer assignments.

Fig 6.8 — A move is "I'm done with this, take it"
Fig 6.8 — A move is "I'm done with this, take it" before — auto v2 = std::move(v1); v1 size=1M, ptr=0xA v2 empty heap buffer (1M ints) after — three pointer copies, no allocation v1 size=0, ptr=null v2 size=1M, ptr=0xA heap buffer (same one) No copying. The buffer's address is reassigned. v1 is now empty but valid.

A move transfers ownership of a heap buffer by reassigning pointers. No copying. No allocation. A vector of one million elements moves in roughly three machine instructions, regardless of size. The source is left in a valid-but-empty state — destructible, assignable, but no longer holding the data. Move semantics are why "return by value" is the right default in modern C++ for large containers: the compiler proves the local is going out of scope, marks the return as an rvalue, and the receiver moves rather than copies. The performance trick that used to require careful manual code is now automatic.

Fig 6.9 — C++ standards · the language eating its own complexity
Fig 6.9 — C++ standards · the language eating its own complexity six standards in twenty-five years · each one rewrote what "modern C++" meant 1998 2003 2011 2014 2017 2020 2023 C++98 templates STL exceptions namespaces RTTI "the original" ANSI ratifies C++03 defect repair small housekeeping C++11 auto lambdas unique_ptr shared_ptr move semantics range-for nullptr std::thread "a new language" C++14 generic lambdas make_unique constexpr expanded C++17 structured bindings std::optional std::variant if constexpr filesystem parallel STL C++20 concepts modules ranges coroutines spaceship op format designated init consteval "a new language" 2 C++23 std::expected std::print stacktrace flat_map deducing this [[assume]] obsoleted by C++11: raw new/delete · function pointers manual loop indices · TR1 hacks obsoleted by C++20: SFINAE for constraints header-include build model 9 years between major rewrites THE PATTERN — every nine years, C++ adds enough that the previous "modern C++" is out of date C++11 retired most of "Effective C++" (2005). C++20 is retiring most of "Effective Modern C++" (2014). A C++ programmer who learned the language in 2000 and never updated is, in 2026, writing a different language than C++ now is.

C++'s release cadence after the long C++98–C++03–wait. C++11 was the rewrite — auto, lambdas, smart pointers, move semantics, range-for, threads — and the language stopped feeling like "C plus quirks." C++14 and C++17 polished. C++20 was the second rewrite — concepts (constraints on templates that finally compile readably), modules (replacing 50 years of header-include cruft), ranges (composable algorithms), coroutines (suspendable functions). C++23 added quality-of-life. The pattern is consistent: roughly every nine years, the standard committee absorbs enough community experience that the previous "modern C++" books are out of date. A working C++ programmer from 2002 who never updated is, in 2026, writing a different language than C++ has become — same syntax, different idioms, different bug profiles, different performance shape. The language eats its own complexity by replacing the patterns the previous generation hated, without removing the original mechanisms that those patterns were workarounds for. The mechanisms accumulate; the recommended subset shifts.

The seam to Chapter 7

C and C++ are both languages that face the machine. They give the programmer enormous power — direct memory access, deterministic destruction, zero-cost abstraction — at the cost of demanding the programmer pay careful attention to what the machine actually does. By 1990 these languages were running every system that mattered, and there were many programmers who thought this was the trade every language must make. In 1990, in a hallway in Amsterdam, a man named Guido van Rossum was thinking the opposite question. What if a language did not face the machine at all?

"C++ leaves few opportunities for cleverness."

— Bjarne Stroustrup, defending the language he made
🔐

Where C++ still bleeds. Smart pointers and RAII closed most of the manual-memory wounds, but three classes survive. Use-after-move: a moved-from object is in a valid-but-unspecified state, and reading it is undefined behaviour the compiler will not warn about. Vtable corruption: a use-after-free on a polymorphic object lets an attacker overwrite the vtable pointer and route every virtual call to attacker-chosen code — the modern descendant of the buffer overflow from Chapter 3. Iterator invalidation: the moment a vector reallocates, every outstanding iterator is a dangling pointer. The continuing CVE rate in C++ codebases is what motivates the slow migration to Rust in security-critical paths.

Chapter 07

The Language
That Chose Humans
Over Machines

C and C++ are languages designed around the machine. Python is designed around the programmer. The trade is explicit: you give up an order of magnitude of performance, and you get back code that reads like the pseudocode in your head. For most problems most programmers actually have, that trade is the right one.

TopicsInterpretation · Dynamic typing · GIL · The two-language strategy
Era covered1989 → present
Chapter 07 hero · The Language That Chose Humans Over Machines main.py for x in numbers: print(x ** 2) human-readable CPython VM tokenize · parse · compile · eval ~1M lines of C underneath slow but predictable NumPy SIMD · BLAS PyTorch CUDA · ATen SciPy Fortran kernels
01 — Compiled vs interpreted

Two pipelines, two trades.

The first Lisp interpreter ran on an IBM 704 at MIT in 1959. John McCarthy had published a paper proposing a new language defined recursively in terms of itself — its core function, eval, evaluated Lisp expressions in Lisp — but had not implemented it. A research assistant named Steve Russell, who had been transcribing the paper, realised that eval looked like working code rather than mathematical theory, and spent a few weeks turning it into machine code by hand. The result was the first interpreter that mattered: a program that read each Lisp expression at runtime and executed it directly, without ever producing a separate binary. BASIC, designed at Dartmouth in 1964, was interpreted from day one. The argument over when a program should be translated to machine instructions — once, on the developer's machine, into a self-sufficient binary; or every time it runs, in flight, on the user's — is older than C, older than UNIX, older than most of the languages this book has met. C took one side. Python took the other. The first gives you speed. The second gives you back something C never thought worth bargaining for: your time.

A compiled language passes through a long pipeline before it ever runs. A C source file becomes preprocessor output, becomes parser tokens, becomes an abstract syntax tree, becomes intermediate representation (often LLVM IR), becomes optimised IR, becomes target-specific assembly, becomes machine code in an object file, becomes a linked executable. All of this happens in advance, on the developer's machine. By the time a user runs the program, the CPU loads instructions and starts executing them. Nothing translates at runtime. Speed is bounded only by the hardware and the compiler's cleverness.

An interpreted language does most of the same pipeline, but at runtime, on the end-user's machine, every time. Python source becomes tokens, becomes an AST, becomes bytecode — a sequence of small, simple opcodes for a virtual machine. The CPython interpreter is a giant switch statement in C that reads each opcode and performs its action. The CPU executes the interpreter; the interpreter executes your code. Every Python statement becomes hundreds of CPU instructions in the interpreter's dispatch loop, plus the actual work.

Fig 7.1 — The two pipelines, side by side
Fig 7.1 — The two pipelines, side by side compiled — done once, ahead of time main.c source preproc + AST optimize LLVM IR assemble .s → .o a.out native bin CPU runs ↑ all of this happens on the developer's machine, before shipping interpreted — done every time the program runs main.py source tokenize + AST bytecode .pyc cache CPython eval loop switch on opcode → handler CPU runs the interpreter ↑ all of this happens at startup, then the eval loop runs your code one bytecode at a time SAME WORK · DIFFERENT PLACE TO PAY THE TRANSLATION COST

A compiled language pays its translation cost on the developer's machine, once, and ships native instructions. An interpreted language pays its translation cost on the user's machine, every run, but ships source code that needs no per-platform build. The same Python script runs on Windows, macOS, Linux, an Android phone, and a Raspberry Pi without modification — provided each has a Python interpreter installed. The user pays for that portability in cycles every time the script runs.

02 — Van Rossum 1989-91

A Christmas hobby that ate the world.

Christmas 1989, at the Centrum Wiskunde & Informatica in Amsterdam, Guido van Rossum was bored. He had spent three years working on a research language called ABC, which had failed to gain traction outside CWI. Over the holiday break, working from home, he started writing an interpreter for a new language that borrowed ABC's clarity and added the things ABC had refused — modules, exceptions, real escape hatches into C. He named it after a British comedy troupe he liked. By 1991 he had released it to USENET, where it began the slow, improbable march that ended with it eating data science, scientific computing, and most of the world's introductory programming courses.

Python's design choices, all of them, can be read off van Rossum's design notes from those years. Readability counts: indentation is part of the grammar, so badly-indented Python is also syntactically wrong; this enforces consistent layout across every codebase. Explicit is better than implicit: no hidden conversions, no surprise side effects in operator overloading, self spelled out as the first parameter of every method. There should be one — and preferably only one — obvious way to do it: the anti-Perl principle, deliberately the opposite of "many ways to skin a cat." These were aesthetic commitments before they were technical ones.

Fig 7.2 — Python's slow conquest
Fig 7.2 — Python's slow conquest 1980 1989 1991 2000 2008 2020 2024 ABC at CWI research language Christmas hack van Rossum Python 0.9 USENET release Python 2.0 list comp Python 3.0 painful split Py2 EOL finally Py 3.13 no-GIL preview From CWI hallway to half the world's data scientists Guido van Rossum · Centrum Wiskunde & Informatica · Amsterdam "I needed a name that was short, unique, and slightly mysterious. I was reading the published scripts of Monty Python's Flying Circus." — Guido van Rossum, on the name

Python 1.0 was a small interpreted language with a small but devoted following — scientific computing, sysadmins, education. Python 2 added the features that made it usable for serious work. Python 3 fixed long-standing design mistakes (Unicode strings, integer division, print as a function) at the cost of breaking compatibility — and the migration took twelve years and divided the community. Python 2 was finally retired in 2020. By that point Python had become, almost by accident, the dominant language of data science, machine learning, and the world's introductory CS courses. The slow ascent ate every benchmark.

03 — Dynamic typing

Types are properties of values, not variables.

In C and C++, a variable has a type, declared once and fixed forever. In Python, a variable is just a name that can refer to any value, and a value carries its own type. Write x = 5, then x = "hello", then x = [1, 2, 3] — all legal. The interpreter checks types at every operation, paying for that check in cycles. The programmer pays for it in fewer types and fewer mistakes about them.

This is the central trade. Statically typed languages — C, C++, Rust, Go, Java — check types once, at compile time. If you try to add a string to an integer, the compiler refuses; the program never runs at all. Dynamically typed languages — Python, JavaScript, Ruby — check types at runtime. The same mistake compiles fine, runs fine until it doesn't, and dies with a TypeError at the offending statement. The static side is safer; the dynamic side is more fluid. Reasonable people disagree about which trade is correct, and have for thirty years.

Python pushes the dynamic side to its logical limit with duck typing: "If it walks like a duck and quacks like a duck, it is a duck." Translation: an object is acceptable wherever its methods are acceptable, regardless of its declared class. A function that calls obj.read() works on any object that has a read() method. Files, network sockets, strings wrapped in a buffer adapter, your own custom class, a mock for testing — all interchangeable, with no shared base class required. The price is that "acceptable" is checked at runtime; the benefit is that you can compose objects in ways the original designers never anticipated.

Fig 7.3 — Duck typing: behaviour, not lineage
Fig 7.3 — Duck typing: behaviour, not lineage class Duck def speak(self): return "quack" no parent class class Robot def speak(self): return "BEEP" no parent class unrelated def announce(thing): print(thing.speak()) accepts both — no shared interface declared anywhere

Two classes with nothing in common except a method name. A function that calls thing.speak() works on either, with no inheritance, no abstract base, no Protocol declared. The interpreter looks up speak on the actual object at the call site; if it exists, it gets called; if not, AttributeError. The strength of duck typing is that you can pass anything, including objects from libraries that did not exist when the function was written. The weakness is that the contract — "must have a speak() method" — is enforced at runtime, by the absence of an exception, rather than at compile time, by the type system. Neither side is universally right. The right answer depends on whether your bugs come from the people writing the code or the people calling it.

04 — The GIL and concurrency

One Python thread runs at a time. There are reasons. They are running out.

Python supports threads. Python's threads are real OS threads — each is its own thread of execution at the kernel level. But CPython, the reference interpreter, contains a single global lock — the GIL, Global Interpreter Lock — that ensures only one thread holds the right to execute Python bytecode at any moment. On an eight-core CPU running a multithreaded Python program, seven cores sit idle. This sounds like a bug. It is, in fact, the consequence of a deliberate design decision in 1992 that has only become controversial in the last decade.

The reason is reference counting. Every Python object carries a count of how many names point to it; when the count drops to zero, the object is freed. Updating the count must be atomic — two threads incrementing or decrementing simultaneously would corrupt it. The GIL is the simplest possible solution: serialise everything Python-level, so refcounts cannot race. Removing the GIL has been attempted many times; the difficulty is not theoretical but practical — fine-grained locking slows down single-threaded code, which is most code, by 30–50%. PEP 703, accepted in 2023, finally lays out a way to remove the GIL while keeping single- threaded performance acceptable, and Python 3.13 ships an opt-in no-GIL mode. Default-on no-GIL is targeted for the late 2020s.

Fig 7.4 — Two Python threads, one GIL
Fig 7.4 — Two Python threads, one GIL time → Thread 1 runs (has GIL) runs (has GIL) runs (has GIL) Thread 2 runs (has GIL) runs (has GIL) runs handoff handoff handoff handoff handoff Two threads. One GIL. They take turns. on an 8-core CPU, threading still occupies one core's worth of time

Two Python threads alternate. Only one holds the GIL at any instant; the other waits. The interpreter releases and re-acquires the GIL at intervals (every 100 bytecodes by default), and around any blocking I/O call. So Python threads do help with I/O-bound work — while one thread is waiting on disk or network, another can run. They do not help with CPU-bound work, which has been the source of complaints since the late 1990s. PEP 703 (accepted 2023) removes the GIL through fine-grained per-object locking and biased reference counting; it is shipping as opt-in in Python 3.13. The single-lock era is ending, slowly, after thirty years.

Fig 7.5 — asyncio: cooperative scheduling on one thread
Fig 7.5 — asyncio: cooperative scheduling on one thread Task A Task B Task A Task C Task B Task C Task A await await await await await await One thread. Many tasks. Each yields at every I/O. cooperative scheduling — no GIL contention because there is only one thread THE PYTHON ANSWER TO MASSIVE I/O CONCURRENCY: BYPASS THREADS ENTIRELY

asyncio, introduced in Python 3.4, sidesteps the GIL by sidestepping threads. A single thread runs an event loop. Functions marked async def are coroutines; when they hit await on an I/O operation, they suspend and the event loop runs a different ready coroutine. The thread is never blocked. A single Python process can handle tens of thousands of simultaneous network connections this way — far more than the GIL would allow with threads, with no concurrency bugs because there is no parallelism. The trade is that async is contagious — it propagates through every function that calls an async function — and integrating async with synchronous libraries is famously awkward. JavaScript made the same trade in Chapter 12; the event loop is one of the great twentieth-century compromises in software design.

05 — How Python is built

CPython is one million lines of C running your script.

"Python" is a language specification. CPython is the reference implementation — the interpreter you almost certainly used the last time you typed python at a shell. It is roughly one million lines of C, maintained since 1991 by Guido van Rossum and now a global team of core developers. It is the bridge between the language's textbook semantics and the silicon's actual instructions.

The pipeline inside CPython is short and unromantic. A source file is read. The tokenizer splits it into tokens — identifiers, numbers, operators, indents and dedents. The parser assembles tokens into an abstract syntax tree. The compiler walks the AST and emits a list of bytecode instructions for a stack-based virtual machine. The evaluator — the famous ceval.c, ten thousand lines long — is a giant switch statement that reads each bytecode and dispatches to a C handler that performs its action. Bytecode is cached on disk in __pycache__/ as .pyc files, so subsequent runs skip the first three steps unless the source has changed.

Fig 7.6 — CPython's pipeline, in detail
Fig 7.6 — CPython's pipeline, in detail a + b * c source line tokens [NAME a, +, NAME b, *, NAME c] AST BinOp(Add, a, BinOp(Mul, b, c)) bytecode (.pyc) LOAD_NAME a · LOAD_NAME b LOAD_NAME c · BINARY_MUL · BINARY_ADD ceval.c — the eval loop while (true) { switch(opcode) { case LOAD_NAME: ...; case BINARY_ADD: ...; } } EVERY PYTHON STATEMENT IS A STREAM OF BYTECODES DISPATCHED THROUGH A SWITCH IN C

CPython compiles your source into a stream of stack-based bytecodes — the same idea Java's JVM uses, predating it by a year, simpler in design. The eval loop reads bytecodes one at a time and dispatches to a C handler. Each handler manipulates a stack of Python objects, performs the operation, and returns to the dispatch. This is the "switch on opcode" pattern at scale: a giant switch with about 120 cases, hand-tuned for branch prediction, with the most common opcodes (load/store local, binary add, function call) inlined for speed.

Fig 7.7 — Why Python is slow (and why nobody minds)
Fig 7.7 — Why Python is slow (and why nobody minds) cycles per arithmetic operation, log scale C ~1 cy Java JIT ~1.2 cy V8 (JS) ~1.5 cy PyPy ~3 cy CPython ~150 cy CPython: + unbox operands + check types + dispatch via vtable + refcount bump + box result = ~150× C

A single integer addition in C compiles to a single CPU instruction. The same operation in CPython performs roughly 150 instructions of work: unbox both operands from their PyObject boxes, check that they are both ints (and not, say, a custom class with __add__ overridden), dispatch to long_add, allocate a new PyObject for the result, bump its refcount, push it onto the eval stack, decrement the refcounts of the operands. CPython is slow because it does, at runtime, the work the C compiler does once at compile time. PyPy — an alternative Python implementation with a tracing JIT — recovers most of the gap for hot loops. CPython remains the reference implementation, slow and stable. For most workloads, this gap is invisible because the bottleneck is elsewhere — disk, network, the database, the model on the GPU — and Python is not the thing being measured.

06 — Why Python won data science

The two-language strategy.

Python is slow. Numerical work is the slowest thing computers do well. By every naive analysis, Python should not be the language of numerical computing, scientific simulation, or machine learning. It is, anyway. The reason is a single design pattern, used so consistently that it has become the unspoken contract of the entire ecosystem: write the human-facing code in Python, write the performance-critical code in C or Fortran, and bridge them with a thin wrapper. Python becomes the dashboard. C is the engine. Both run.

Fig 7.8 — How NumPy actually does work
Fig 7.8 — How NumPy actually does work Python user code np.dot(A, B) # one line of Python readable, debuggable, interactive NumPy — thin C wrapper type-check, dispatch to BLAS routine, return view ~20 lines of C per NumPy operation BLAS / LAPACK — Fortran & assembly hand-tuned matmul: SIMD, cache blocking, AVX-512 written by numerical analysts since the 1970s CPU SIMD lanes / GPU cores

A single NumPy call descends through three layers and ends in hand-tuned numerical kernels that have been refined by specialists since the 1970s. The Python programmer never touches Fortran; the BLAS authors never touch Python. The wrapper layer in NumPy is what makes the bridge work — it accepts Python objects, validates types and shapes, calls the right BLAS routine, and returns a Python object pointing into the result buffer. The runtime cost of the Python layer is amortised across the giant compute the BLAS layer performs. For a 1000×1000 matmul, Python's overhead is microseconds; BLAS's compute is milliseconds. The Python layer is invisible.

Fig 7.9 — The same matmul, two ways
Fig 7.9 — The same matmul, two ways pure Python 12.4 seconds np.dot 0.012 s · ~1000× faster 1000 × 1000 matrix multiplication PURE-PYTHON NUMERICS IS NOT REALLY A THING ANYONE DOES

A million-element matrix multiplication in pure Python takes about twelve seconds — three nested loops, each iteration paying the interpreter tax. The same multiplication in NumPy takes a hundredth of a second. Three orders of magnitude. The difference is not Python improving; the difference is that np.dot never executes Python in its inner loop. Once the BLAS call returns, you are back in Python, plotting or printing or storing — work where the human's time matters more than the CPU's. This is the deal: humans use Python. The machine uses C and Fortran. They meet at the wrapper.

The verdict

Python beat every "fast" language for data science not because Python won a benchmark — it lost every benchmark — but because the people doing the work were not benchmark writers. They were physicists, biologists, traders, climate modellers, machine-learning researchers. They had problems to solve, not loops to tune. Python let them spend their time thinking about the problem instead of thinking about the language. The two-language strategy delegated the parts where cycles mattered to people whose career was making cycles cheap. Both groups got what they wanted.

It is worth saying out loud that this is, almost exactly, the trade C and C++ refused to make. Stroustrup believed any abstraction must be free; Ritchie believed any abstraction must be simple. Van Rossum believed the cost of an abstraction was paid in human attention, and he was willing to spend a hundred machine cycles to save a programmer five minutes of typing. Three languages, three positions on the same axis. All three are correct. They serve different problems, written by different people, on different schedules.

"There should be one — and preferably only one — obvious way to do it. Although that way may not be obvious at first unless you're Dutch."

— The Zen of Python, Tim Peters, 1999
🔐

Python's three sharp edges. Pickle deserialization is arbitrary code execution by design — pickle.loads() on attacker-controlled bytes lets the attacker run any Python they want. Every web framework that accepted pickled session cookies has been bitten by this; the fix is to never deserialize pickled data you didn't sign. PyPI typosquatting: requests and requestz are one keystroke apart, and the malicious package can os.system() on install. eval() and exec() on user input are the same data-as-code mistake that gives Chapter 13's SQL injection its name. Dynamism is convenience; dynamism is also attack surface.

End of Part Two

The kernel and the languages.

Part I built the substrate; Part II opened the program that owns it and the languages we write on top of it. The kernel as code, not silicon. C, the language UNIX rewrote itself in. C++, scaling C up to systems too big for one human's head. Python, inverting the trade — slower than C by orders of magnitude, but readable, and built so the parts that need to be fast can be fast in someone else's language. Three different deals with the same machine. All three still in production. All three necessary, for different reasons, for different problems.

Part III steps out of the box entirely. The wire leaves the CPU, crosses the room, and joins the network — voltage on copper becoming voltage across a continent. It is, once again, a story about people: Shannon, Baran, Davies, Cerf, Berners-Lee, Eich. And it is, once again, a story about silicon and the abstractions we build over it.