Under the Code — Part V: The Modern Machine

01 — Threads vs processes

The two ways to make your program use more than one core.

By 1990, the kernel had given UNIX programmers a clean answer to "how do I run two things at once": fork. The child process ran in parallel with the parent, isolated from it, with its own address space, its own file descriptors, its own everything (Chapter 4). The isolation was exactly the problem when the two processes wanted to cooperate. A parent that wanted to share a million-element array with its child had to copy it across the kernel boundary every time, paying for privacy it did not want. POSIX 1003.1c, ratified in 1995, codified the alternative: the thread — a separate stream of execution that shares the address space of its parent process, can be scheduled onto a different core, and can read and write the same memory directly. Threads are the cheap, fast, dangerous answer to "how do I use more than one core"; processes are the slow, safe one. The next thirty years of concurrent programming are essentially about the price of the danger.

Fig 16.1 — Threads vs processes · the same parallelism, two isolation models

Threads share the heap, the globals, the file descriptor table, and most other process state — only the call stack and the registers are per-thread. Processes share none of this; each lives behind its own page table, with the kernel mediating any communication. The trade is real and consequential: threads make IPC free (just write to a variable; the other thread can read it) but make every concurrent access a potential bug. Processes make every interaction a system call but make memory corruption between them impossible. Modern programming languages reach for both: C, C++, Java, Rust default to threads; Python's multiprocessing reaches for processes (because the GIL — Chapter 7 §04 — makes threads less useful); Erlang and Go invented their own lightweight thread-like primitives (BEAM processes, goroutines) that get the best of both.

Where parallelism actually helps

Not every program benefits from more cores. Programs whose work naturally divides into independent chunks — image filters processing pixels, web servers handling unrelated requests, Monte Carlo simulations running independent samples — are called embarrassingly parallel and scale almost linearly with core count. Programs whose steps depend on each other — most stateful computations, anything sequential — benefit much less, or not at all. The mathematics of this constraint is Amdahl's law, which we get to in §05; for now the practical observation is that adding cores is not a substitute for serial speed. A 64-core laptop runs a pure-sequential program no faster than a 1-core one with the same clock speed.

02 — Race conditions and locks

Two threads · one variable · everything depends on luck.

The price of shared memory shows up the first time two threads modify the same variable concurrently. The classic example is so simple it sounds harmless: two threads each incrementing a shared counter a million times. The expected result is two million. The actual result, on any real machine, is somewhere between one million and two million, with the exact value depending on scheduling, cache coherence, and luck. The increment that looks atomic in source code (counter++) is in fact three operations underneath: read the value into a register, add one, write it back. If two threads' three-step sequences overlap, the writes can stomp on each other.

Fig 16.2 — The classic race · two threads, one counter, lost updates

The reference race condition. Both threads execute counter++ when the counter is 100. Thread 1 reads 100, adds 1, plans to store 101. Before Thread 1 stores, Thread 2 also reads 100 (the not-yet-updated value), adds 1, plans to store 101. Both stores happen; the counter ends at 101 instead of 102. Thread 2's increment vanished. Multiply this across a million iterations and the expected result of 2,000,000 routinely comes out as 1,000,000 + ε for some small ε. The bug is invisible in source code, depends entirely on scheduling, fails to reproduce reliably, and is one of the most frustrating classes of error in software. Decades of language design — atomics, locks, channels, Rust's borrow checker — exist to prevent this category specifically.

The mutex · serialise the critical section

The classic fix is a mutex (mutual exclusion lock). Each thread, before touching the shared counter, calls mutex.lock(); after touching it, calls mutex.unlock(). Only one thread can hold the lock at a time — any other thread that calls lock() while it is held blocks until the holder releases. The protected region (between lock and unlock) is called the critical section, and the rule is that all critical sections that touch the same data are serialised. The mutex itself is implemented using atomic CPU instructions (compare-and-swap or load-link/store-conditional) — operations the CPU guarantees are indivisible at the hardware level. Underneath every thread-safe data structure is a chain of these primitives.

C++ — the bug and the fix

// the race — counter ends somewhere between 1,000,000 and 2,000,000
int counter = 0;
void worker() {
    for (int i = 0; i < 1'000'000; ++i)
        ++counter;                  // LOAD; ADD; STORE — three instructions
}

// the fix — std::mutex serialises the critical section
std::mutex m;
int counter = 0;
void worker() {
    for (int i = 0; i < 1'000'000; ++i) {
        std::lock_guard<std::mutex> g(m);
        ++counter;                  // only one thread holds the lock at a time
    }                               // g's destructor releases the lock — RAII
}

// or, for this specific case, an atomic — no lock at all, single CPU instruction
std::atomic<int> counter = 0;
void worker() {
    for (int i = 0; i < 1'000'000; ++i)
        counter.fetch_add(1, std::memory_order_relaxed);
}

Deadlock · the trap when locks compose

Locks solve the race; they introduce a new failure mode. Deadlock happens when two or more threads each hold a lock the other one needs, and both wait forever. The canonical recipe: thread A holds lock L1 and tries to acquire L2; thread B holds L2 and tries to acquire L1. Each is blocked waiting for the other; neither can make progress; the program freezes. Deadlock is mathematically characterised by the existence of a cycle in the "wait-for" graph: an arrow from each waiting thread to the thread whose lock it wants. If the graph has a cycle, there is deadlock; if it has no cycle, there isn't. The standard prevention is lock ordering — every thread acquires locks in the same global order — which ensures the wait-for graph remains acyclic.

Fig 16.3 — Mutex protects · deadlock is the new failure mode

The mutex closes the race, but introduces deadlock as a new failure mode. Two threads, two locks, each holding what the other needs — both block forever. The same shape with spinning threads (each repeatedly trying and failing rather than blocking) is called livelock — equally fatal, harder to debug because the program looks busy. The standard prevention is lock ordering: define a global order over all locks, and require every thread to acquire them in that order. The wait-for graph can then only go forward, never form a cycle. The discipline is real: large codebases (Linux, PostgreSQL, V8) document their lock hierarchies, and acquiring locks out of order is a code-review red flag. Above the mutex, every higher-level concurrency primitive — semaphores, condition variables, read-write locks, channels — is built from the same atomic foundation, with the same potential for these three bugs.

⚠

Therac-25, 1985–87. The first software-caused medical fatalities in history were race conditions. The Therac-25 radiation therapy machine used a single computer to control beam intensity and targeting, with no hardware interlock — its predecessors had hardware interlocks, but the engineers had concluded the software was reliable enough to remove them. A specific keystroke sequence, fast enough to interleave with the beam-mode-select state machine, could leave the machine in high-power X-ray mode while the targeting magnet was set for low-power electron mode. Six patients received radiation overdoses of up to a hundred times the prescribed dose; three died. The lesson is older than the field: any concurrent state machine without explicit synchronisation will eventually be observed in every reachable state, including the states the designer assumed were impossible.

03 — Lock-free programming

Concurrency without locks · the deeper rabbit hole.

Locks have a fundamental cost beyond their nanosecond overhead: they serialise. A queue protected by a single mutex can be used by one thread at a time regardless of how many cores are available. In high-performance contexts — kernels, databases, network stacks, JavaScript engines, garbage collectors — engineers reach for lock-free data structures: queues, stacks, hash tables built so that multiple threads can read and write concurrently without ever blocking each other. The mechanism is the same atomic CPU instruction that mutexes are built on, used directly: compare-and-swap (CAS). With CAS, a thread can perform an "if the value is still x, change it to y; otherwise tell me what it is now" operation as a single indivisible step. Built carefully, CAS-based data structures can be wait-free (every thread always makes progress) or lock-free (the system as a whole always makes progress, even if individual threads can starve). Both terms come from Maurice Herlihy's 1991 paper Wait-Free Synchronization, which proved a hierarchy of exactly which data structures can be implemented wait-free on which atomic primitives — the result that turned an engineering practice into a mathematical theory. The cost is that the code is genuinely difficult to write correctly, and reasoning about it requires understanding memory orderings — the rules by which writes from one core become visible to others.

Fig 16.4 — Compare-and-swap · the atomic primitive everything else is built on

Compare-and-swap is the foundation primitive of all modern multi-core synchronisation. The CPU provides it as a single atomic instruction (cmpxchg on x86, ldxr/stxr on ARM); higher-level languages expose it as std::atomic::compare_exchange in C++, AtomicInteger.compareAndSet in Java, sync/atomic.CompareAndSwap in Go. The lock-free counter shown reads the current value, computes the increment, and tries to atomically swap the new value in only if the old value is still there. If another thread got there first, the CAS fails and the loop retries with the new value. No locks, no blocking, no deadlock — but every operation can race-lose and retry. Lock-free queues, stacks, and hash tables are built on top of this primitive, with great difficulty.

Memory orderings · why the world is not sequentially consistent

The deeper subtlety is that on modern multi-core CPUs, the order in which one core's writes become visible to another core is not the order they were issued. Out-of-order execution (Chapter 1 §04), store buffers, and cache coherence all conspire to let writes appear reordered from another core's perspective. A thread that writes x = 1; flag = true; may have another thread on a different core observe flag = true while still reading x = 0 — the writes were visible out of order. Programs that depend on cross-thread ordering must explicitly request guarantees by annotating atomic operations with a memory ordering: relaxed (no ordering), acquire (subsequent loads see prior writes by the releaser), release (this write becomes visible after preceding writes on this thread), or sequentially consistent (the strongest, most expensive, easiest to reason about). Reasoning about which ordering each operation needs is the deepest skill in multi-threaded systems programming, and the one most often gotten wrong.

Fig 16.5 — Memory orderings · four levels, four guarantees, four costs

Memory ordering is the second hard problem in lock-free programming, after CAS. Modern multi-core CPUs reorder reads and writes within a single core for performance — out-of-order execution, store buffers, prefetching — and these reorderings are visible across cores. Languages like C++, Rust, and Java let the programmer specify, per atomic operation, what ordering the operation needs to provide. Stricter orderings (sequentially consistent at the top) are easier to reason about but cost performance on weak-memory-model hardware (ARM, POWER, RISC-V); weaker orderings (relaxed) are faster but require careful argument. The mismatch between hardware models is itself a portability hazard: code that depends on x86's nearly-sequential-consistency can fail on ARM. Java and Go default to sequentially consistent atomics for safety; C++, Rust, and modern kernel code use the weaker orderings deliberately for performance, accepting the reasoning burden.

04 — GPU architecture

The other shape of parallelism · thousands of small cores doing the same thing.

The CPU we have built — out-of-order execution, branch prediction, caches, all the machinery of Chapter 1 — is optimised for fast sequential execution: do one thing, do it as fast as possible, then move to the next. The GPU is the opposite. It has thousands of small cores running in lockstep, each executing the same instruction on different data. It evolved from the requirements of 3D graphics: every pixel on a million-pixel screen needs the same shading calculation applied to different inputs, and the natural way to do that is many small processors running the same code on different pixels. Around 2007, NVIDIA realised that this hardware was useful for any embarrassingly parallel computation, exposed it through the CUDA programming model, and watched a new computational substrate emerge. Today GPUs do most of the world's machine learning training and inference, most of its scientific simulation, all of its real-time graphics. They are a fundamentally different shape of computer, and software that uses them well looks very different from software that uses CPUs well.

Fig 16.6 — CPU vs GPU · few clever cores vs thousands of simple ones

A modern CPU spends most of its transistor budget on making one stream of instructions go fast: out-of-order execution, branch prediction, multi-megabyte caches, sophisticated SIMD units. Eight to sixteen of these expensive cores fit on a typical chip. A modern GPU spends its budget on doing the same thing to many things at once: thousands of small cores, very little branch-prediction logic, no out-of-order execution, much smaller caches. The two architectures are good at different problems. CPUs win on branchy, sequential, low-latency code (most application logic, operating system kernels, web servers). GPUs win on workloads that apply the same operation to a large array of data — graphics shading, neural network training and inference, scientific simulations, video transcoding. The deep-learning revolution of 2012 onward is, mechanically, the realisation that backpropagation through a neural network is one giant matrix multiply, and matrix multiplies are exactly what GPUs are built for.

SIMD and the CUDA hierarchy

The GPU's organising principle is SIMD — Single Instruction, Multiple Data. A SIMD instruction operates on a vector of values at once: add these 32 numbers to those 32 numbers in one instruction. CPUs have SIMD too (SSE, AVX on x86; NEON on ARM) — typically 128 or 256 bits wide, processing 4 or 8 values at once. GPUs take this much further: a CUDA warp is 32 threads running in lockstep, executing the same instruction on different data. Many warps are grouped into a block; many blocks make a grid; the entire grid is dispatched to the GPU as a single kernel launch. The programmer writes code that looks like it runs on a single thread; the runtime replicates it across the grid.

Fig 16.7 — The CUDA hierarchy · thread → warp → block → grid

CUDA's four-level hierarchy maps directly onto the GPU's silicon. A thread is the smallest execution unit; threads are bundled into warps of 32 that execute in lockstep (any divergent branches in a warp serialise both paths — a major performance pitfall). Warps are grouped into blocks that share a fast scratchpad memory (~100 KB, called shared memory) and can synchronise with each other; blocks make up the grid that constitutes one kernel launch. The memory hierarchy is steep: registers are essentially free; shared memory is fast but small and per-block; L2 is grid-wide but moderate latency; global memory (HBM, High-Bandwidth Memory) is enormous but slow on a per-access basis (though with extremely high bandwidth — terabytes per second on modern GPUs). The art of writing fast CUDA kernels is keeping data in shared memory as long as possible and arranging global memory accesses so that an entire warp's threads access contiguous addresses (called coalesced access) — uncoalesced access is often the difference between fast and slow code.

05 — Amdahl & Gustafson

The mathematics of how much speedup more cores can possibly give you.

Adding cores to a computation is not free, and it is not unbounded. Two laws govern what you can expect. Amdahl's law (Gene Amdahl, 1967) says that if a fraction p of a program is parallelisable and the rest is sequential, then the maximum speedup from N cores is 1 / ((1−p) + p/N). As N grows, the speedup approaches 1/(1−p) as a hard ceiling — even with infinite cores, the sequential 5% of the program limits speedup to 20×. The ceiling is brutal in practice: a program with 90% parallel code maxes out at 10×; with 95% parallel, 20×. Adding cores beyond the point where the parallel portion is fully exploited buys you nothing.

The ceiling, in one equation

If a fraction p of a program parallelises perfectly and the remaining 1−p is strictly sequential, the speedup on N cores is:

S(N) = 1 / ((1 − p) + p / N)

The p/N term shrinks toward zero as you add cores; the (1−p) term does not move. So S(N) → 1 / (1 − p) as N → ∞. A program that is 95% parallel cannot exceed 20× speedup no matter how much hardware you buy. Gustafson's law (1988) inverts the question: if you grow the problem with the cores, the parallel portion grows while the sequential portion stays constant, and effective speedup becomes S(N) = N − (1 − p)·(N − 1) — nearly linear in N. Both laws are right; they describe different scenarios.

"The effort expended on achieving high parallel performance is wasted unless it is accompanied by achievements in sequential performance of very nearly the same magnitude."

— Gene Amdahl, "Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities," AFIPS, 1967

Fig 16.8 — Amdahl's law · the ceiling that won't move

Amdahl's law plotted for four different parallel fractions. A program with 90% parallel code (p=0.90) plateaus at 10× speedup no matter how many cores you throw at it; the sequential 10% becomes the dominant cost as N grows. With 95% parallel, the ceiling is 20×; with 99%, 100×; with 50%, only 2×. The implication is severe and counterintuitive: getting from 90% to 95% parallel is harder engineering than going from 8 cores to 64 cores. Most working programs hover around 80–95% parallel — which is why a 64-core machine very rarely runs an actual program 64× faster than a 1-core machine. Amdahl's law explains why CPUs have stopped at "tens of cores" for desktop and laptop loads while supercomputers and GPUs (where workloads are 99%+ parallel) keep adding more.

Gustafson's law (1988) appears to contradict Amdahl's, but actually answers a different question. Amdahl asks: "I have a fixed problem. How much faster can I make it with more cores?" Gustafson asks: "I have more cores. How much bigger a problem can I tackle in the same time?" If problem size grows with the available compute — more particles in a physics simulation, more pixels in a rendering, more samples in a Monte Carlo — the sequential portion stays roughly constant while the parallel portion scales, and effective speedup grows nearly linearly with cores. This is why GPU-driven workloads (training a bigger neural network on more GPUs) and HPC workloads (simulating a finer-grained physical model on more nodes) really do scale to thousands of cores: the problem grew with the resources.

Fig 16.9 — Amdahl vs Gustafson · same hardware, different question

Amdahl and Gustafson asked complementary questions and got opposite-looking answers. Amdahl: "I have a one-hour computation. Will doubling the cores let me run it in 30 minutes?" — usually no, because the sequential portion limits speedup. Gustafson: "I have a one-hour budget. Will doubling the cores let me run a problem twice as big?" — usually yes, because the parallel portion grows while the sequential portion stays roughly constant. Both are right; both apply; the choice of which to invoke depends on whether your problem size is fixed or scaling with hardware. The deep-learning revolution is a Gustafson story: bigger models trained on more GPUs in roughly the same wall-clock time. Real-time graphics is an Amdahl story: render the next frame in 16 ms, no matter how many cores you have.

06 — NUMA & MESI · the modern hardware reality

Memory is not flat any more · cache coherence is the only reason it pretends to be.

Two more facts about modern multi-core machines deserve naming. First: memory is not uniformly accessible. On a server with two CPU sockets, each socket has its own local DRAM, and accessing memory attached to your own socket is dramatically faster than accessing memory attached to the other socket. This is NUMA — Non-Uniform Memory Access — and a thread that runs on socket 0 but accesses memory allocated on socket 1 pays a 50–100% latency penalty per access. Operating systems and runtimes try to keep threads near their data automatically; when they fail, performance silently halves. Second: caches must stay coherent across cores even though each core has its own L1 and L2. The mechanism is a hardware protocol called MESI — Modified, Exclusive, Shared, Invalid — that tracks the state of every cache line on every core and ensures that no two cores have conflicting writes to the same line.

Fig 16.10 — NUMA · two sockets, two memories, asymmetric latency

A two-socket server is two computers sharing a coherence protocol. Each socket has its own DRAM controllers and its own local memory; accessing the other socket's memory traverses the inter-socket interconnect (Intel calls it UPI, AMD calls it Infinity Fabric) and pays substantially more latency. The OS scheduler tries to keep threads on the same socket as their data; allocators try to put new memory on the requesting thread's socket; runtimes (Java, Go, .NET) have NUMA-aware schedulers. When all of this works, you don't notice; when it fails, your program runs at half speed and looks otherwise correct. NUMA is also the architecture inside high-performance GPUs (each GPU is its own NUMA node) and across multi-GPU setups (NVLink interconnects between GPUs). The pattern repeats: parallelism brings speed, distance brings latency, and good engineering keeps the two aligned.

Fig 16.11 — MESI · how four cores keep one cache line consistent

MESI is the protocol that lets every core have its own cache while pretending memory is globally consistent. Every cache line is in one of four states; every read or write is mediated by bus transactions that update the states across all cores. The protocol is invisible to software but pervasive in performance: when two threads on different cores write the same cache line repeatedly, the line "ping-pongs" between caches with each write, costing a hundred or more cycles per access. This is why false sharing — two unrelated variables that happen to share a 64-byte cache line — can silently halve a program's throughput, and why high-performance code pads structures to cache-line boundaries. The protocol predates multi-core CPUs (it was developed for cache-coherent multiprocessor systems in the 1980s) and survives essentially unchanged because the alternatives are worse. Variants exist (MOESI, MESIF — adding "owned" or "forward" states for optimisations) but the four-state core is universal.

🔐

The cache as a side channel. In January 2018, Spectre and Meltdown showed that the speculative-execution machinery of every modern CPU since 1995 — the same machinery that gave us the speed described in Chapter 1 — could be coerced into reading memory across security boundaries. The trick: speculative loads pulled secret data into the cache; even though the speculation was rolled back at the architectural level, the cache state survived, and an attacker measuring access timings could read out the secret bit by bit. The fix required microcode patches, kernel page-table isolation (KPTI), and a permanent ~5–30% performance regression on syscall-heavy workloads. Cache coherence is not just a performance mechanism; it is also the most expensive security boundary in the machine.

Closing the chapter · seam to Chapter 17

Parallelism scales the single computer up: from one core to sixty-four, from one socket to two, from CPU to GPU. It runs against three constraints we have just named — Amdahl's ceiling on speedup, NUMA's penalty for non-local memory, and MESI's cache-line-ping-pong — and the discipline of high-performance code is the discipline of designing around all three. Above the single computer is the next category: many computers, not sharing memory, communicating over a network, having to reach agreement despite the network being unreliable and the computers being individually fallible. That is the problem of Chapter 17. The mathematics that governs it — distributed consensus, the CAP theorem — is one of the most consequential bodies of work in computer science.

01 — Virtualization

One physical machine · pretending to be many.

The cloud's foundational abstraction is the virtual machine: software that emulates a complete computer — CPU, memory, disks, network — well enough that an unmodified operating system runs inside it without knowing it isn't running on real hardware. The technology dates to IBM's CP/CMS in 1968: a hypervisor on the System/360 that let many isolated VM/370 sessions share one expensive mainframe. The x86 world rediscovered virtualization in the late 1990s (VMware founded 1998), and CPU-level support arrived with Intel VT-x in 2005 and AMD-V in 2006 — special instruction-set extensions that let a hypervisor run guest operating systems at near-native speed. Public clouds — AWS launched EC2 in 2006 — took the technology and turned it into a commodity: pay by the hour, get an isolated VM in any datacenter on Earth, billed to the second.

Fig 17.1 — Type 1 vs Type 2 hypervisors · bare metal vs hosted

Two hypervisor architectures. Type 1 ("bare metal") sits directly on the hardware with no host OS underneath; KVM, ESXi, Xen, and Microsoft's Hyper-V are all type 1, and this is what every public cloud uses. Type 2 ("hosted") runs as an application inside a normal operating system; VMware Workstation, VirtualBox, and Parallels are type 2, used for desktop development. Both types rely on hardware extensions (Intel VT-x, AMD-V) that introduce a privilege level below Ring 0 — sometimes called "Ring -1" — so a guest OS can run at Ring 0 thinking it controls the machine while the hypervisor below intercepts privileged operations and emulates them. Without VT-x, virtualization required heroic software tricks (binary translation, paravirtualization) and ran an order of magnitude slower; with VT-x, virtualization overhead is typically under 5%, and the entire public cloud became economically possible.

The economic consequence is the on-demand, elastic, pay-per-use cloud. Before EC2, deploying a web service required buying or leasing physical servers, racking them in a colocation facility, and waiting weeks for capacity. After EC2 (and its successors), you ran a single API call and got a fresh VM in 60 seconds, billed by the second, in any of dozens of regions worldwide. The capability transformed startup economics — Airbnb and Dropbox and Pinterest could each scale from zero to millions of users without ever owning a server — and changed how the rest of the industry thought about infrastructure. The hypervisor underneath all of this is invisible to the customer; the abstraction it provides is what they pay for.

02 — Containers revisited

Lighter than VMs · faster to start · the deployment unit of the modern web.

Chapter 4 §06 introduced containers as kernel-level isolation built on namespaces and cgroups. The economics that followed turned containers into the dominant deployment unit of the 2020s. Where a VM takes minutes to boot, megabytes to gigabytes of memory, and contains an entire operating system, a container starts in milliseconds, weighs a few megabytes, and shares the host's kernel. Docker — released 2013, built on a Linux capability called LXC and on its own image format — gave containers a usable interface and a portable image format. Running docker run nginx downloads a layered image, creates the namespaces, configures cgroups, and starts the process; the whole thing takes a second. Compared to launching a VM, that's a hundredfold improvement in start time and a tenfold improvement in density. Modern Kubernetes clusters routinely run tens of thousands of containers per node-group; modern hyperscale datacenters run billions globally.

Fig 17.2 — Docker image layers · copy-on-write filesystems for software distribution

A Docker image is a stack of immutable, content-addressed layers — each layer is the filesystem diff produced by one instruction in the Dockerfile, identified by the SHA-256 hash of its contents. The runtime presents the layers as a unified read-only filesystem to the container, with a thin writable layer on top. The economics are powerful: a host running 100 microservices that all use the same Ubuntu base layer stores the base layer's 80 MB once, not 100 times. Pushing a new version of an app means uploading only the changed layer, not the entire image. Container registries (Docker Hub, GitHub Container Registry, Amazon ECR) deduplicate aggressively, and a typical CI/CD pipeline pushes only kilobytes per code change. The mechanism — layered immutable content-addressed storage — is borrowed from Git's object model and from the copy-on-write filesystems of Chapter 4 §04. The pattern is recurring in modern systems engineering: address things by their content hash; share by reference; never rewrite an immutable thing.

⚙️

The MicroVM convergence. The trade between VM isolation and container speed has been narrowing. AWS Firecracker (2018), Google gVisor (2018), and Kata Containers (2018) all build "microVMs" — extremely lightweight VMs that boot in 100 ms, weigh a few megabytes, and provide near-VM isolation while preserving most of containers' speed and density. AWS Lambda runs every function invocation in a dedicated Firecracker microVM. Modern serverless platforms have effectively pushed past the container/VM dichotomy by making both fast enough that the practical difference dissolves. The deeper trajectory: as hardware virtualization extensions get cheaper and faster, the security argument for VMs over containers gets stronger, and the speed argument against VMs gets weaker. Expect the boundary between the two to keep eroding.

03 — Kubernetes & the control loop

The orchestrator that took over deployment.

A handful of containers on one host is easy. A thousand containers across a hundred hosts, with health checks, automatic restarts, rolling updates, traffic routing, secret management, and storage attachment, is not. The system that solved this problem at scale — and won the orchestration wars in the late 2010s — is Kubernetes, originally built at Google (where its ancestor, Borg, had been running internally since 2003), open-sourced in 2014, and now the dominant deployment substrate for cloud-native software. Kubernetes is not a single program; it is a set of cooperating control-plane components and a fleet of worker nodes, glued together by a deceptively simple architectural idea — the control loop. The user declares what the system should look like; the controllers continuously observe what it does look like and reconcile the difference.

Fig 17.3 — Kubernetes architecture · control plane + workers

Kubernetes' five major components, divided between control plane and workers. The API server is the only entrypoint — every other component talks to it, never to each other directly. etcd (a distributed KV store running Raft consensus, internally) is the source of truth: every desired-state object lives there. The scheduler watches for unscheduled pods and assigns them to nodes based on resources and constraints. The controller manager runs the reconcile loops for every kind of object — Deployments, ReplicaSets, Services, etc. On worker nodes, the kubelet reports node state up and pulls pod assignments down; kube-proxy programs the network rules for service discovery; the container runtime (containerd, CRI-O) actually runs the pods. The whole architecture is decentralised — no single component is in charge — and built around watch-and-reconcile rather than command-and-respond.

The control loop · the deepest pattern in Kubernetes

The heart of Kubernetes is not its components but its architectural pattern. Every controller — every piece of logic in Kubernetes that maintains some kind of object — is a reconcile loop: a tight loop that observes the current state of the world, compares it to the desired state, and takes whichever action will move current closer to desired. Run forever. The loop is idempotent: if it runs twice without anything changing, the second run does nothing. The loop is robust: if a command fails, the loop runs again, observes that the change still hasn't happened, and tries again. The loop is stateless: it does not remember what it did last time; it only looks at what is versus what should be. This pattern — declarative desired state, observed actual state, continuous reconciliation — is borrowed from control theory (Chapter 10's TCP congestion control is the same shape) and is the architectural genus of all modern infrastructure-as-code: Terraform, Pulumi, Argo CD, everything.

Fig 17.4 — The reconcile loop · observe, diff, act, repeat

The reconcile loop is Kubernetes' deepest architectural idea, and it generalises far beyond container orchestration. Define what the system should look like (the desired state) as data. Observe what the system actually looks like (the actual state) as data. Run a tight loop that reads both and takes whatever action moves actual closer to desired. The loop is idempotent (running it twice does nothing if nothing has changed), stateless (it doesn't remember what it did before), and self-healing (failed actions just get retried). This is the same shape as Terraform plans, Argo CD GitOps, every modern infrastructure-as-code tool, and — in spirit — TCP's congestion control loop from Chapter 10. Declare the intent; let convergence happen is one of the most powerful ideas in modern systems engineering.

Go — every Kubernetes controller, in essence

// the reconcile loop, simplified
func Reconcile(req Request) (Result, error) {
    // 1. read desired state (from etcd, via the API server)
    desired, err := apiServer.Get(req.Name)
    if errors.IsNotFound(err) {
        // the user deleted the object — clean up and stop
        return Result{}, cleanup(req.Name)
    }

    // 2. observe actual state (from the cluster)
    actual, err := cluster.Get(req.Name)

    // 3. diff and act — exactly one step toward convergence
    switch {
    case actual == nil:
        return Result{}, cluster.Create(desired)
    case !equal(desired, actual):
        return Result{}, cluster.Update(desired)
    default:
        // already converged — re-run later in case the world drifts
        return Result{RequeueAfter: 30 * time.Second}, nil
    }
}

04 — Distributed consensus

Many machines · agreement · the math says it shouldn't quite work · we make it work anyway.

Kubernetes' "source of truth" — etcd — runs a distributed consensus protocol called Raft. Why? Because the moment you need multiple machines to agree on something — what is the current version of the configuration, who is the current leader, what is the next entry in the log — the network's unreliability becomes a mathematical problem. Messages can be delayed arbitrarily. Machines can crash. The network can partition into halves that can't reach each other. Distributed consensus is the formal problem of getting n machines to agree on a value despite up to f of them failing. The math underneath it is some of the most consequential in computer science: Lamport's Paxos (1989, finally clearly explained 1998), and Diego Ongaro and John Ousterhout's Raft (2014), which was explicitly designed to be more understandable than Paxos and has since become the dominant consensus protocol in industry.

Paxos · the harder ancestor

Paxos works in two-phase rounds. A proposer first sends a prepare message with a unique proposal number to a quorum of acceptors; each acceptor that has not already promised a higher number replies with a promise not to accept anything lower, returning whatever value (if any) it had previously accepted. The proposer then sends an accept request carrying the highest-numbered previously-accepted value, or its own if none was returned; once a majority accepts, the value is chosen. The mechanism is correct under any pattern of message loss, duplication, or reordering — Lamport proved it in 1989. The difficulty is cognitive. Reasoning about Paxos requires holding several rounds and proposal numbers in your head simultaneously, and Lamport's original presentation used a parable about a fictional Greek parliament that, instead of clarifying things, famously made the field wait nine years for the 1998 paper that explained the same protocol in plain language. Raft's contribution was the same correctness with a single leader, sequential terms, and an explicit log — half the cognitive load. That is why etcd uses Raft.

"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."

— Leslie Lamport, attributed (~1987) · Turing Award 2013

Why a quorum is N/2 + 1

Raft tolerates up to f node failures with a cluster of N = 2f + 1 nodes. A quorum — the smallest set of nodes whose agreement counts as the cluster's decision — is:

Q = ⌊N / 2⌋ + 1

So 5 nodes need 3, 7 need 4, 9 need 5. The reason: any two quorums must intersect in at least one node, so two leaders elected in different terms cannot both have a majority without sharing a witness who would force one to step down. This intersection property is what prevents split brain — and why even-numbered cluster sizes (4 nodes need 3 — same fault tolerance as 3) are wasteful. Add nodes only in odd numbers.

Fig 17.5 — Raft · leader election and log replication

Raft in two phases. Leader election: nodes run with randomised timeouts; the first one to time out increments the term and requests votes; with a majority it becomes leader. Log replication: the leader appends client commands to its log, sends them to followers, and once a majority of nodes acknowledge, the entry is committed and applied to the state machine. The protocol works correctly as long as a majority of nodes are alive and reachable — for a 5-node cluster, that means up to 2 can fail. The math is general: an N-node cluster tolerates ⌊(N-1)/2⌋ failures, which is why production Raft clusters run with 3, 5, or 7 nodes (odd numbers maximise fault tolerance for the resource cost). etcd, ZooKeeper, Consul, CockroachDB, TiDB, HashiCorp Vault — all run Raft underneath, with the leader-and-replication shape of this diagram.

FLP impossibility · why this is harder than it looks

The deep reason consensus is hard is the FLP impossibility result (Fischer, Lynch, Paterson, 1985), one of the most important negative results in computer science. It states: in an asynchronous distributed system where messages can be delayed arbitrarily and even one process can crash, no deterministic consensus protocol can guarantee both safety (never reaching an inconsistent decision) and liveness (always eventually reaching a decision). Some run of the protocol can always be constructed in which it spins forever without deciding. Every real-world consensus protocol — Paxos, Raft, PBFT, Tendermint — works around FLP either by assuming partial synchrony (eventually messages will arrive within some bound) or by giving up determinism (using randomisation, as in some BFT protocols). Neither escape is free; both are pragmatic compromises that the math forces.

Fig 17.6 — FLP impossibility · safety vs liveness vs asynchronous failure

FLP is the deepest negative result in distributed systems and the reason consensus is harder than it looks. In a fully asynchronous model with even one crashing node, no deterministic protocol can guarantee both that it always terminates AND that all nodes always agree. Real-world protocols escape by adding assumptions: partial synchrony (Paxos and Raft assume messages eventually arrive within some bound — they may pause but they don't decide wrong), randomisation (Ben-Or's protocol flips coins to break the bad cases), or giving up safety in adverse conditions (eventually-consistent stores like DynamoDB return possibly-stale reads but never block). FLP doesn't say consensus is impossible — only that it cannot be both safe and live in pure asynchrony with crash failures. Once you understand which assumption a system is making, you understand which failure mode it cannot handle.

FLP, in one line

Fischer, Lynch, and Paterson, 1985:

∄ deterministic protocol that solves consensus in an asynchronous system with even one crash failure.

The proof is short: any protocol that always terminates must, at some configuration, have a bivalent step — one whose outcome can still go either way. The adversary delays exactly the messages that would resolve it. The protocol either decides early (and risks being wrong) or waits forever. Real systems escape by ruling out pure asynchrony: partial synchrony (Paxos, Raft) assumes messages eventually arrive within some bound; randomisation (Ben-Or) flips coins to break adversarial schedules; eventual consistency (Dynamo) keeps serving but stops insisting all nodes agree right now. Every working protocol is FLP plus one assumption.

05 — The CAP theorem

Pick two · the unavoidable trade in distributed storage.

A different impossibility result governs distributed storage rather than distributed agreement. Eric Brewer's CAP theorem (conjectured 2000, proven 2002 by Gilbert & Lynch) says that any distributed data store can provide at most two of three properties: Consistency (every read returns the most recent write), Availability (every request gets a response), and Partition tolerance (the system keeps working when the network splits the cluster). Since network partitions are a real fact of distributed systems — they happen, whether you like it or not — the practical choice during a partition is between C and A. A consistent system, partitioned, refuses to serve requests on the wrong side of the split (you lose A). An available system, partitioned, serves possibly-stale data (you lose C). Different applications choose different points on the spectrum.

"The 'two of three' formulation was always misleading because it tended to oversimplify the tensions among properties. Now such nuances matter. CAP prohibits only a tiny part of the design space: perfect availability and consistency in the presence of partitions, which are rare."

— Eric Brewer, "CAP Twelve Years Later: How the 'Rules' Have Changed," IEEE Computer, 2012

Fig 17.7 — CAP triangle · pick two when the network splits

CAP forced a generation of distributed-system designers to be honest about the trade. CP systems — etcd, ZooKeeper, Google Spanner, traditional relational databases with synchronous replication — refuse to serve writes during a partition rather than risk inconsistency. AP systems — Amazon DynamoDB, Cassandra, CouchDB, Riak — serve writes always, accept that nodes on different sides of a partition may temporarily disagree, and reconcile (last-writer-wins, vector clocks, CRDTs) when the partition heals. CA — total consistency and total availability — is unattainable on a real network because partitions are not optional. The PACELC refinement (Daniel Abadi 2010) added the observation that even without a partition, every system trades latency against consistency: synchronously replicated writes are slow but consistent; asynchronously replicated writes are fast but eventually-consistent. There is no free lunch on this dimension; the practical question is always which failure mode hurts your application less.

Fig 17.8 — CAP in production · which database picked which corner

CAP made concrete. The theorem says you can't have all three; this figure shows which corner each production system actually picked, and why. etcd and ZooKeeper are CP because losing track of "who is the leader" or "what is the current config" silently is catastrophic — better to fail loudly than to lie. Google Spanner achieves something close to CA in the common case by spending engineering effort on TrueTime — GPS-and-atomic-clock-backed timestamps that bound clock skew across continents — but in the rare partition case it still has to choose, and it chooses C. DynamoDB and Cassandra are AP because Amazon's shopping cart is more valuable than perfect consistency: better to occasionally show a stale cart than to refuse to take orders. The procurement decision is always: which failure mode does my application's domain make tolerable? Bank ledgers — refuse rather than risk wrong totals. Shopping carts — accept rather than refuse. Telemetry — accept always. The theorem is the constraint; the corner is the design choice.

06 — Microservices, serverless, and the pendulum

Architecture is fashion · the same problems return every fifteen years.

The dominant application architecture of the late 2010s was microservices: take what would once have been a single monolithic application and decompose it into dozens or hundreds of small services, each owning its own data, communicating over the network. The motivation was real: independent deployability, team autonomy, technology heterogeneity, fault isolation. The cost was also real: every former function call became a network call, every former invariant became a distributed consistency problem, every former crash became a partial failure. By the early 2020s the industry had partially walked the pendulum back — many companies that had eagerly decomposed into microservices discovered they had built a "distributed monolith" with all the complexity of microservices and none of the benefits, and consolidated. The lesson is older than computing: architecture is a series of trades, not a series of progressions, and the same problems return every fifteen years in slightly different costume.

Fig 17.9 — Monolith vs microservices · same business logic, different deployment

The same business logic, two architectures. The monolith runs as a single process with internal modules; calls between modules are normal function calls (nanoseconds, in-process). The microservices architecture splits each module into its own service with its own database, deployed independently, communicating over the network — so what was a function call becomes an HTTP or gRPC call (milliseconds, with retries, with timeouts, with potential failure). The microservices version gains independent deployability, team autonomy, fault isolation. It also pays for distributed-systems complexity: distributed tracing to follow a request across services, the saga pattern for transactions that span databases, service mesh for retries and circuit breaking, and the operational cost of running dozens or hundreds of services. The right answer is rarely "as many services as possible"; it is the smallest decomposition that buys the autonomy you actually need.

Fig 17.10 — Service mesh and the sidecar · how microservices actually talk in 2026

In 2026, the standard pattern for microservices in production is the service mesh with the sidecar proxy. Every service runs in a pod alongside a small proxy — Envoy is the canonical choice, originally written at Lyft — and all of the service's network traffic, in and out, goes through that proxy on localhost. The proxy handles retries, timeouts, circuit breakers, mutual TLS, rate limiting, observability, traffic shifting. The application code never sees any of this; it makes plain HTTP calls to localhost and the sidecar does the rest. A central control plane (Istio, Linkerd, Consul Connect) pushes route rules, certificates, and policies to every sidecar via xDS — the standard discovery protocol Envoy invented. The architectural payoff is that all the operational concerns of running a distributed system — encryption, retries, traffic shaping, observability — become a platform-team responsibility delivered uniformly, instead of every app team reinventing the same patterns in their own languages. The cost is real: every pod runs an extra proxy (memory, latency, complexity), the control plane is itself a non-trivial system to operate, and the configuration surface (Istio's CRDs alone fill hundreds of pages) is its own learning curve. Like every layer in this book, the trade is paid in the operational budget.

Serverless · and the architectural pendulum

The newer wave is serverless: write a function, upload it to AWS Lambda or Google Cloud Functions or Azure Functions, and the platform runs it on demand, scales to whatever number of invocations you receive, charges you per hundred-millisecond of execution, and otherwise abstracts away every notion of a server. The trade is sharp: serverless is cheap and elastic for sporadic workloads (a function called a thousand times per day costs cents); it is expensive and slow for sustained workloads (the same function called a thousand times per second is dramatically cheaper on a dedicated server). The cold-start problem — the first invocation after a quiet period may take hundreds of milliseconds while the platform spins up a runtime — has gotten dramatically better with microVMs (Firecracker again) but still exists. As of 2026, serverless is the right answer for event-driven, bursty workloads (image processing, webhooks, infrequent APIs); it is the wrong answer for sustained high-throughput services. The pendulum keeps swinging.

Fig 17.11 — The architectural pendulum · same problems, different costumes

Software architecture has its fashions. In the 1990s everything was a monolith. In the early 2000s "Service-Oriented Architecture" promised loose coupling via SOAP/WSDL/ESB middleware (it worked, but the tooling was heavy and unloved). Around 2014, with Docker and Kubernetes maturing, microservices became the consensus answer; Netflix, Amazon, and Spotify all famously decomposed. By the early 2020s, several large companies had partially reversed course — Amazon Prime Video published a 2023 article describing how they consolidated a microservices monitoring system back into a monolith for 90% cost savings. The honest pattern: every architectural style has trades; the right answer depends on the team, the workload, and what kind of pain the organisation is currently best equipped to absorb. Distributed-systems complexity is expensive whether you choose to pay it or not.

🔁

The recurring pattern. Every layer of the cloud abstracts away the layer beneath while preserving its essential trades. Hypervisors abstract physical hardware but preserve the cost of virtualization overhead. Containers abstract operating systems but preserve the cost of process isolation. Kubernetes abstracts cluster operations but preserves the cost of distributed-systems complexity. Serverless abstracts servers entirely but preserves the cost of cold starts and per-invocation latency. None of these abstractions are leak-free; every one of them eventually requires the engineer to understand the layer below. The skill of working in the cloud is, in part, the skill of knowing which layer is currently the source of your problem. We have been building stacks of leaky abstractions for sixty years; we will probably keep doing it for at least sixty more.

🔐

SolarWinds, December 2020. Eighteen thousand customers of a network-management vendor installed a routine update of Orion. The update had been silently poisoned at build-time inside SolarWinds's own CI pipeline — a backdoor compiled into the binary the vendor itself signed and shipped. Among the customers were US federal agencies, parts of Microsoft, and SolarWinds itself. The cloud era amplifies the blast radius of supply-chain compromise: one compromised vendor reaches every customer simultaneously, the update mechanism is the attack vector, and the trust boundary that mattered was not in any customer's network. SBOMs, reproducible builds, and signed-but-also-attested artifacts are the industry's response to discovering that a software-update channel is architecturally identical to a worm.

⚠

Knight Capital, 1 August 2012, 9:30 a.m. A high-frequency trading firm pushed a new build to seven of its eight production servers. The eighth still ran a feature flag from years earlier — a debugging mode that, in the new code path, dispatched test orders to the live market. In forty-five minutes Knight Capital placed four million unintended orders worth seven billion dollars; the firm lost $440 million and was acquired weeks later. Heterogeneous deployment state across a cluster is the distributed-systems version of a race condition: a single machine out of phase with its peers, doing the wrong thing at the right speed, is enough to end a company. Every modern deployment system — blue-green, canary, progressive rollout — is a response to the cost of this specific failure mode.

Closing the chapter · seam to Chapter 18

Chapter 17 closes the substrate-and-scale arc that began with the single transistor in Chapter 1. We have followed the layers all the way out: from one transistor to one CPU to one operating system to one network to one TLS connection to one database to one cryptographic primitive, then up again — many cores, many machines, many datacenters, many regions. There is no further "out" we can go without reaching the human social systems that computing intersects with. That is, in some sense, where Chapter 18 looks. Eighteen chapters of mechanism, ending in the observation that the mechanism is now a substrate for everything else.

The Modern
Machine

Parallelism
When One Core
Isn't Enough

The two ways to make your program use more than one core.

Where parallelism actually helps

Two threads · one variable · everything depends on luck.

The mutex · serialise the critical section

Deadlock · the trap when locks compose

Concurrency without locks · the deeper rabbit hole.

Memory orderings · why the world is not sequentially consistent

The other shape of parallelism · thousands of small cores doing the same thing.

SIMD and the CUDA hierarchy

The mathematics of how much speedup more cores can possibly give you.

Memory is not flat any more · cache coherence is the only reason it pretends to be.

Closing the chapter · seam to Chapter 17

The Cloud
Distributed Systems
at Scale

One physical machine · pretending to be many.

Lighter than VMs · faster to start · the deployment unit of the modern web.

The orchestrator that took over deployment.

The control loop · the deepest pattern in Kubernetes

Many machines · agreement · the math says it shouldn't quite work · we make it work anyway.

Paxos · the harder ancestor

FLP impossibility · why this is harder than it looks

Pick two · the unavoidable trade in distributed storage.

Architecture is fashion · the same problems return every fifteen years.

Serverless · and the architectural pendulum

Closing the chapter · seam to Chapter 18

What You
Now See

Every layer is math made physical · made physical again, on top.

Eighteen chapters. One keystroke.

The chain of decisions, in time, by people in real places.

The substrate is changing again · which parts will survive it?

The substrate of this book is now the substrate of much else.

The books that go deeper · one shelf per part of the book.

What you now see.

Under the Code is built.

The ModernMachine

ParallelismWhen One CoreIsn't Enough

The two ways to make your program use more than one core.

Where parallelism actually helps

Two threads · one variable · everything depends on luck.

The mutex · serialise the critical section

Deadlock · the trap when locks compose

Concurrency without locks · the deeper rabbit hole.

Memory orderings · why the world is not sequentially consistent

The other shape of parallelism · thousands of small cores doing the same thing.

SIMD and the CUDA hierarchy

The mathematics of how much speedup more cores can possibly give you.

Memory is not flat any more · cache coherence is the only reason it pretends to be.

Closing the chapter · seam to Chapter 17

The CloudDistributed Systemsat Scale

One physical machine · pretending to be many.

Lighter than VMs · faster to start · the deployment unit of the modern web.

The orchestrator that took over deployment.

The control loop · the deepest pattern in Kubernetes

Many machines · agreement · the math says it shouldn't quite work · we make it work anyway.

Paxos · the harder ancestor

FLP impossibility · why this is harder than it looks

Pick two · the unavoidable trade in distributed storage.

Architecture is fashion · the same problems return every fifteen years.

Serverless · and the architectural pendulum

Closing the chapter · seam to Chapter 18

What YouNow See

Every layer is math made physical · made physical again, on top.

Eighteen chapters. One keystroke.

The chain of decisions, in time, by people in real places.

The substrate is changing again · which parts will survive it?

The substrate of this book is now the substrate of much else.

The books that go deeper · one shelf per part of the book.

What you now see.

Under the Code is built.

The Modern
Machine

Parallelism
When One Core
Isn't Enough

The Cloud
Distributed Systems
at Scale

What You
Now See