Part Five

The Modern
Machine

Parts I through IV built one computer. A single CPU, with its kernel, speaking through one network stack to one TLS endpoint, holding bytes in one relational database, defended by a layered stack we can now name. Part V steps back. The machine that runs nearly all code in 2026 is not a computer at all but a great many computers — sixty-four cores in your laptop, ten thousand in a cloud region, a planetary mesh tied together by software that must reach agreement despite delay and failure. Part V is about that scale-up, mathematically and physically. Then Chapter 18 closes the book.

CHAPTER 16
Parallelism — When One Core Isn't Enough
Threads vs processes. Race conditions and the locks that prevent them. Lock-free programming and the hard math of memory ordering. GPU architecture. Amdahl's law and Gustafson's law. NUMA, MESI, the modern hardware reality.
CHAPTER 17
The Cloud — Distributed Systems at Scale
Virtualization. Containers revisited. Kubernetes and the control loop. Distributed consensus — Paxos, Raft, the FLP impossibility. The CAP theorem. Microservices, serverless, and the architectural pendulum.
CHAPTER 18
Epilogue — What You Now See
The closing synthesis. The whole arc in one view. The computer as a mathematical object. From sand to civilization. What's permanent and what isn't. Where computer science intersects with everything else. The reading list. The final note.
Chapter 16

Parallelism
When One Core
Isn't Enough

For most of computing's history, faster software meant faster clocks. Around 2004 the clock speed of consumer CPUs stopped rising — heat density made it impossible — and the industry pivoted to putting many cores on the same chip and asking software to use them. Twenty years later, every phone has six to ten cores, every laptop has eight to sixteen, and every cloud server has sixty-four or more. Writing code that uses them correctly is one of the deepest skills in programming. The mathematics underneath it is concurrency theory; the engineering is forty years of careful primitive-building; and the failure modes are some of the most frustrating bugs ever produced.

TopicsThreads · races · CAS · GPU · Amdahl · NUMA · MESI
Era covered~1965 → present
Chapter 16 hero · Parallelism When One Core Isn't Enough 8 × 5 = 40 CORES all running at once · all sharing memory faster software is no longer faster clocks · it's more cores
01 — Threads vs processes

The two ways to make your program use more than one core.

By 1990, the kernel had given UNIX programmers a clean answer to "how do I run two things at once": fork. The child process ran in parallel with the parent, isolated from it, with its own address space, its own file descriptors, its own everything (Chapter 4). The isolation was exactly the problem when the two processes wanted to cooperate. A parent that wanted to share a million-element array with its child had to copy it across the kernel boundary every time, paying for privacy it did not want. POSIX 1003.1c, ratified in 1995, codified the alternative: the thread — a separate stream of execution that shares the address space of its parent process, can be scheduled onto a different core, and can read and write the same memory directly. Threads are the cheap, fast, dangerous answer to "how do I use more than one core"; processes are the slow, safe one. The next thirty years of concurrent programming are essentially about the price of the danger.

Fig 16.1 — Threads vs processes · the same parallelism, two isolation models
Fig 16.1 — Threads vs processes · the same parallelism, two isolation models shared memory and shared peril vs separate everything and copy costs THREADS — shared address space SHARED MEMORY heap · globals · file descriptors visible to all threads thread 1 own stack own regs core 0 thread 2 own stack own regs core 1 thread 3 own stack own regs core 2 communicate by reading/writing memory cheap to create (microseconds) · fast IPC a bug in one thread can corrupt all of them PROCESSES — isolated address spaces process A own heap own stack own files core 0 process B own heap own stack own files core 1 process C own heap own stack own files core 2 communicate by IPC (Chapter 4 §05) expensive to start (~1 ms) · IPC ~100× slower than memory CONTEXT-SWITCH COST thread switch within the same process: ~hundreds of nanoseconds (just save/restore registers) process switch: ~microseconds (also flush TLB, swap page tables — Chapter 4 §03) → when threads are enough, threads are dramatically cheaper · when isolation matters, processes are the only safe answer

Threads share the heap, the globals, the file descriptor table, and most other process state — only the call stack and the registers are per-thread. Processes share none of this; each lives behind its own page table, with the kernel mediating any communication. The trade is real and consequential: threads make IPC free (just write to a variable; the other thread can read it) but make every concurrent access a potential bug. Processes make every interaction a system call but make memory corruption between them impossible. Modern programming languages reach for both: C, C++, Java, Rust default to threads; Python's multiprocessing reaches for processes (because the GIL — Chapter 7 §04 — makes threads less useful); Erlang and Go invented their own lightweight thread-like primitives (BEAM processes, goroutines) that get the best of both.

Where parallelism actually helps

Not every program benefits from more cores. Programs whose work naturally divides into independent chunks — image filters processing pixels, web servers handling unrelated requests, Monte Carlo simulations running independent samples — are called embarrassingly parallel and scale almost linearly with core count. Programs whose steps depend on each other — most stateful computations, anything sequential — benefit much less, or not at all. The mathematics of this constraint is Amdahl's law, which we get to in §05; for now the practical observation is that adding cores is not a substitute for serial speed. A 64-core laptop runs a pure-sequential program no faster than a 1-core one with the same clock speed.

02 — Race conditions and locks

Two threads · one variable · everything depends on luck.

The price of shared memory shows up the first time two threads modify the same variable concurrently. The classic example is so simple it sounds harmless: two threads each incrementing a shared counter a million times. The expected result is two million. The actual result, on any real machine, is somewhere between one million and two million, with the exact value depending on scheduling, cache coherence, and luck. The increment that looks atomic in source code (counter++) is in fact three operations underneath: read the value into a register, add one, write it back. If two threads' three-step sequences overlap, the writes can stomp on each other.

Fig 16.2 — The classic race · two threads, one counter, lost updates
Fig 16.2 — The classic race · two threads, one counter, lost updates "counter++" is three machine instructions · interleaving them produces lost updates SOURCE — looks atomic counter++; // both threads run this a million times · expected total: 2,000,000 UNDERNEATH — three instructions ① LOAD reg, [counter] // read current value into register ② ADD reg, 1 // increment register ③ STORE [counter], reg // write back to memory THE RACE — counter starts at 100 THREAD 1 THREAD 2 memory counter = 100 LOAD reg=100 100 LOAD reg=100 100 ADD reg=101 ADD reg=101 STORE 101 → 101 STORE 101 → 101 (lost!) two increments happened · counter went up by 1 · the second thread's increment was lost

The reference race condition. Both threads execute counter++ when the counter is 100. Thread 1 reads 100, adds 1, plans to store 101. Before Thread 1 stores, Thread 2 also reads 100 (the not-yet-updated value), adds 1, plans to store 101. Both stores happen; the counter ends at 101 instead of 102. Thread 2's increment vanished. Multiply this across a million iterations and the expected result of 2,000,000 routinely comes out as 1,000,000 + ε for some small ε. The bug is invisible in source code, depends entirely on scheduling, fails to reproduce reliably, and is one of the most frustrating classes of error in software. Decades of language design — atomics, locks, channels, Rust's borrow checker — exist to prevent this category specifically.

The mutex · serialise the critical section

The classic fix is a mutex (mutual exclusion lock). Each thread, before touching the shared counter, calls mutex.lock(); after touching it, calls mutex.unlock(). Only one thread can hold the lock at a time — any other thread that calls lock() while it is held blocks until the holder releases. The protected region (between lock and unlock) is called the critical section, and the rule is that all critical sections that touch the same data are serialised. The mutex itself is implemented using atomic CPU instructions (compare-and-swap or load-link/store-conditional) — operations the CPU guarantees are indivisible at the hardware level. Underneath every thread-safe data structure is a chain of these primitives.

C++ — the bug and the fix
// the race — counter ends somewhere between 1,000,000 and 2,000,000
int counter = 0;
void worker() {
    for (int i = 0; i < 1'000'000; ++i)
        ++counter;                  // LOAD; ADD; STORE — three instructions
}

// the fix — std::mutex serialises the critical section
std::mutex m;
int counter = 0;
void worker() {
    for (int i = 0; i < 1'000'000; ++i) {
        std::lock_guard<std::mutex> g(m);
        ++counter;                  // only one thread holds the lock at a time
    }                               // g's destructor releases the lock — RAII
}

// or, for this specific case, an atomic — no lock at all, single CPU instruction
std::atomic<int> counter = 0;
void worker() {
    for (int i = 0; i < 1'000'000; ++i)
        counter.fetch_add(1, std::memory_order_relaxed);
}

Deadlock · the trap when locks compose

Locks solve the race; they introduce a new failure mode. Deadlock happens when two or more threads each hold a lock the other one needs, and both wait forever. The canonical recipe: thread A holds lock L1 and tries to acquire L2; thread B holds L2 and tries to acquire L1. Each is blocked waiting for the other; neither can make progress; the program freezes. Deadlock is mathematically characterised by the existence of a cycle in the "wait-for" graph: an arrow from each waiting thread to the thread whose lock it wants. If the graph has a cycle, there is deadlock; if it has no cycle, there isn't. The standard prevention is lock ordering — every thread acquires locks in the same global order — which ensures the wait-for graph remains acyclic.

Fig 16.3 — Mutex protects · deadlock is the new failure mode
Fig 16.3 — Mutex protects · deadlock is the new failure mode a cycle in the wait-for graph means everyone waits forever MUTEX — serialised critical section // in any thread: mutex.lock(); counter++; // safe — only this thread is here mutex.unlock(); PROPERTIES — exactly one thread holds the lock — others block until release — underlying: atomic CAS or LL/SC — scope: a few cache-line writes (cheap) COSTS — ~50 ns even uncontended — cache-line ping-pong on contention — deadlock if naïvely composed → DEADLOCK — cycle in wait-for graph Thread A holds L1 wants L2 Thread B holds L2 wants L1 A → B → A → B → ... cycle in the wait-for graph neither can release · neither can proceed PREVENTION — lock ordering always acquire locks in the same global order → wait-for graph cannot have a cycle "livelock" — same shape, threads spinning instead of blocking races, deadlocks, and livelocks are the three foundational concurrency bugs · they appear at every layer above

The mutex closes the race, but introduces deadlock as a new failure mode. Two threads, two locks, each holding what the other needs — both block forever. The same shape with spinning threads (each repeatedly trying and failing rather than blocking) is called livelock — equally fatal, harder to debug because the program looks busy. The standard prevention is lock ordering: define a global order over all locks, and require every thread to acquire them in that order. The wait-for graph can then only go forward, never form a cycle. The discipline is real: large codebases (Linux, PostgreSQL, V8) document their lock hierarchies, and acquiring locks out of order is a code-review red flag. Above the mutex, every higher-level concurrency primitive — semaphores, condition variables, read-write locks, channels — is built from the same atomic foundation, with the same potential for these three bugs.

Therac-25, 1985–87. The first software-caused medical fatalities in history were race conditions. The Therac-25 radiation therapy machine used a single computer to control beam intensity and targeting, with no hardware interlock — its predecessors had hardware interlocks, but the engineers had concluded the software was reliable enough to remove them. A specific keystroke sequence, fast enough to interleave with the beam-mode-select state machine, could leave the machine in high-power X-ray mode while the targeting magnet was set for low-power electron mode. Six patients received radiation overdoses of up to a hundred times the prescribed dose; three died. The lesson is older than the field: any concurrent state machine without explicit synchronisation will eventually be observed in every reachable state, including the states the designer assumed were impossible.

03 — Lock-free programming

Concurrency without locks · the deeper rabbit hole.

Locks have a fundamental cost beyond their nanosecond overhead: they serialise. A queue protected by a single mutex can be used by one thread at a time regardless of how many cores are available. In high-performance contexts — kernels, databases, network stacks, JavaScript engines, garbage collectors — engineers reach for lock-free data structures: queues, stacks, hash tables built so that multiple threads can read and write concurrently without ever blocking each other. The mechanism is the same atomic CPU instruction that mutexes are built on, used directly: compare-and-swap (CAS). With CAS, a thread can perform an "if the value is still x, change it to y; otherwise tell me what it is now" operation as a single indivisible step. Built carefully, CAS-based data structures can be wait-free (every thread always makes progress) or lock-free (the system as a whole always makes progress, even if individual threads can starve). Both terms come from Maurice Herlihy's 1991 paper Wait-Free Synchronization, which proved a hierarchy of exactly which data structures can be implemented wait-free on which atomic primitives — the result that turned an engineering practice into a mathematical theory. The cost is that the code is genuinely difficult to write correctly, and reasoning about it requires understanding memory orderings — the rules by which writes from one core become visible to others.

Fig 16.4 — Compare-and-swap · the atomic primitive everything else is built on
Fig 16.4 — Compare-and-swap · the atomic primitive everything else is built on "if the value is still what I expect, replace it; otherwise tell me what it is" — atomically CAS — semantically // atomically: CAS(addr, expected, new): if (*addr == expected) { *addr = new; return true; } else { /* no change */ return false; } LOCK-FREE INCREMENT — using CAS do { old = counter; // read current value new = old + 1; // compute new value } while (!CAS(&counter, old, new)); // retry if someone else changed it first no lock taken · no thread blocks · race losses just retry WHY THIS IS DECEPTIVELY HARD — ABA problem: value changes from A to B and back to A; CAS thinks nothing happened, but it did — memory orderings: writes from other cores may not be visible yet (covered next) — retry storms: under high contention, threads can spin without progress production-grade lock-free queues (Michael & Scott 1996) are surprisingly subtle; the bugs are rare and corruption-flavoured

Compare-and-swap is the foundation primitive of all modern multi-core synchronisation. The CPU provides it as a single atomic instruction (cmpxchg on x86, ldxr/stxr on ARM); higher-level languages expose it as std::atomic::compare_exchange in C++, AtomicInteger.compareAndSet in Java, sync/atomic.CompareAndSwap in Go. The lock-free counter shown reads the current value, computes the increment, and tries to atomically swap the new value in only if the old value is still there. If another thread got there first, the CAS fails and the loop retries with the new value. No locks, no blocking, no deadlock — but every operation can race-lose and retry. Lock-free queues, stacks, and hash tables are built on top of this primitive, with great difficulty.

Memory orderings · why the world is not sequentially consistent

The deeper subtlety is that on modern multi-core CPUs, the order in which one core's writes become visible to another core is not the order they were issued. Out-of-order execution (Chapter 1 §04), store buffers, and cache coherence all conspire to let writes appear reordered from another core's perspective. A thread that writes x = 1; flag = true; may have another thread on a different core observe flag = true while still reading x = 0 — the writes were visible out of order. Programs that depend on cross-thread ordering must explicitly request guarantees by annotating atomic operations with a memory ordering: relaxed (no ordering), acquire (subsequent loads see prior writes by the releaser), release (this write becomes visible after preceding writes on this thread), or sequentially consistent (the strongest, most expensive, easiest to reason about). Reasoning about which ordering each operation needs is the deepest skill in multi-threaded systems programming, and the one most often gotten wrong.

Fig 16.5 — Memory orderings · four levels, four guarantees, four costs
Fig 16.5 — Memory orderings · four levels, four guarantees, four costs stricter ordering = easier to reason about · weaker = faster on hardware ORDERING GUARANTEE USE WHEN RELAXED atomicity only · no ordering at all simple counters · stats cheapest other writes can be observed in any order no inter-thread dependency ACQUIRE subsequent loads see prior releases by other threads reading a flag · entering a critical section cheap "after I see this load, I see what they wrote before their release" RELEASE prior writes on this thread are visible to acquirers setting a flag · leaving a critical section cheap paired with acquire to form happens-before edges SEQ-CST total global order over all SC operations when in doubt · default in Java/Go expensive all threads agree on the order — most intuitive, slowest on weak hardware THE TRADE x86 hardware is nearly seq-cst by default · ARM and POWER are not — they have weak memory models → code that "works" on x86 may corrupt data on ARM if it relies on sequential consistency without saying so

Memory ordering is the second hard problem in lock-free programming, after CAS. Modern multi-core CPUs reorder reads and writes within a single core for performance — out-of-order execution, store buffers, prefetching — and these reorderings are visible across cores. Languages like C++, Rust, and Java let the programmer specify, per atomic operation, what ordering the operation needs to provide. Stricter orderings (sequentially consistent at the top) are easier to reason about but cost performance on weak-memory-model hardware (ARM, POWER, RISC-V); weaker orderings (relaxed) are faster but require careful argument. The mismatch between hardware models is itself a portability hazard: code that depends on x86's nearly-sequential-consistency can fail on ARM. Java and Go default to sequentially consistent atomics for safety; C++, Rust, and modern kernel code use the weaker orderings deliberately for performance, accepting the reasoning burden.

04 — GPU architecture

The other shape of parallelism · thousands of small cores doing the same thing.

The CPU we have built — out-of-order execution, branch prediction, caches, all the machinery of Chapter 1 — is optimised for fast sequential execution: do one thing, do it as fast as possible, then move to the next. The GPU is the opposite. It has thousands of small cores running in lockstep, each executing the same instruction on different data. It evolved from the requirements of 3D graphics: every pixel on a million-pixel screen needs the same shading calculation applied to different inputs, and the natural way to do that is many small processors running the same code on different pixels. Around 2007, NVIDIA realised that this hardware was useful for any embarrassingly parallel computation, exposed it through the CUDA programming model, and watched a new computational substrate emerge. Today GPUs do most of the world's machine learning training and inference, most of its scientific simulation, all of its real-time graphics. They are a fundamentally different shape of computer, and software that uses them well looks very different from software that uses CPUs well.

Fig 16.6 — CPU vs GPU · few clever cores vs thousands of simple ones
Fig 16.6 — CPU vs GPU · few clever cores vs thousands of simple ones two completely different ways to spend a transistor budget CPU — few complex cores OoO cache branch predict SIMD 8 cores · each ~2 billion transistors · GHz clocks GPU — thousands of simple cores ~10,000 cores · each tiny · lock-step in groups of 32 DIFFERENT TASKS · DIFFERENT SHAPES CPU wins on: branchy, sequential, low-latency code · most application logic GPU wins on: same op over a million data points · graphics, ML, scientific simulation, video transcoding

A modern CPU spends most of its transistor budget on making one stream of instructions go fast: out-of-order execution, branch prediction, multi-megabyte caches, sophisticated SIMD units. Eight to sixteen of these expensive cores fit on a typical chip. A modern GPU spends its budget on doing the same thing to many things at once: thousands of small cores, very little branch-prediction logic, no out-of-order execution, much smaller caches. The two architectures are good at different problems. CPUs win on branchy, sequential, low-latency code (most application logic, operating system kernels, web servers). GPUs win on workloads that apply the same operation to a large array of data — graphics shading, neural network training and inference, scientific simulations, video transcoding. The deep-learning revolution of 2012 onward is, mechanically, the realisation that backpropagation through a neural network is one giant matrix multiply, and matrix multiplies are exactly what GPUs are built for.

SIMD and the CUDA hierarchy

The GPU's organising principle is SIMD — Single Instruction, Multiple Data. A SIMD instruction operates on a vector of values at once: add these 32 numbers to those 32 numbers in one instruction. CPUs have SIMD too (SSE, AVX on x86; NEON on ARM) — typically 128 or 256 bits wide, processing 4 or 8 values at once. GPUs take this much further: a CUDA warp is 32 threads running in lockstep, executing the same instruction on different data. Many warps are grouped into a block; many blocks make a grid; the entire grid is dispatched to the GPU as a single kernel launch. The programmer writes code that looks like it runs on a single thread; the runtime replicates it across the grid.

Fig 16.7 — The CUDA hierarchy · thread → warp → block → grid
Fig 16.7 — The CUDA hierarchy · thread → warp → block → grid millions of threads in a single kernel launch · organised in a four-level hierarchy THREAD single execution unit ×32 WARP 32 threads in lockstep ×N BLOCK ~1024 threads · shared mem ×N GRID millions of threads · 1 kernel GPU MEMORY HIERARCHY REGISTERS per-thread · ~256 per thread · ~1 cycle SHARED MEM (L1) per-block · ~100 KB · ~30 cycles · the secret weapon for fast kernels L2 CACHE grid-wide · ~50 MB · ~200 cycles GLOBAL (HBM) grid-wide · ~80 GB · ~500 cycles · highest bandwidth in the system good GPU code keeps as much as possible in shared memory · poorly-coalesced global accesses dominate cost

CUDA's four-level hierarchy maps directly onto the GPU's silicon. A thread is the smallest execution unit; threads are bundled into warps of 32 that execute in lockstep (any divergent branches in a warp serialise both paths — a major performance pitfall). Warps are grouped into blocks that share a fast scratchpad memory (~100 KB, called shared memory) and can synchronise with each other; blocks make up the grid that constitutes one kernel launch. The memory hierarchy is steep: registers are essentially free; shared memory is fast but small and per-block; L2 is grid-wide but moderate latency; global memory (HBM, High-Bandwidth Memory) is enormous but slow on a per-access basis (though with extremely high bandwidth — terabytes per second on modern GPUs). The art of writing fast CUDA kernels is keeping data in shared memory as long as possible and arranging global memory accesses so that an entire warp's threads access contiguous addresses (called coalesced access) — uncoalesced access is often the difference between fast and slow code.

05 — Amdahl & Gustafson

The mathematics of how much speedup more cores can possibly give you.

Adding cores to a computation is not free, and it is not unbounded. Two laws govern what you can expect. Amdahl's law (Gene Amdahl, 1967) says that if a fraction p of a program is parallelisable and the rest is sequential, then the maximum speedup from N cores is 1 / ((1−p) + p/N). As N grows, the speedup approaches 1/(1−p) as a hard ceiling — even with infinite cores, the sequential 5% of the program limits speedup to 20×. The ceiling is brutal in practice: a program with 90% parallel code maxes out at 10×; with 95% parallel, 20×. Adding cores beyond the point where the parallel portion is fully exploited buys you nothing.

The ceiling, in one equation

If a fraction p of a program parallelises perfectly and the remaining 1−p is strictly sequential, the speedup on N cores is:

S(N) = 1 / ((1 − p) + p / N)

The p/N term shrinks toward zero as you add cores; the (1−p) term does not move. So S(N) → 1 / (1 − p) as N → ∞. A program that is 95% parallel cannot exceed 20× speedup no matter how much hardware you buy. Gustafson's law (1988) inverts the question: if you grow the problem with the cores, the parallel portion grows while the sequential portion stays constant, and effective speedup becomes S(N) = N − (1 − p)·(N − 1) — nearly linear in N. Both laws are right; they describe different scenarios.

"The effort expended on achieving high parallel performance is wasted unless it is accompanied by achievements in sequential performance of very nearly the same magnitude."

— Gene Amdahl, "Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities," AFIPS, 1967
Fig 16.8 — Amdahl's law · the ceiling that won't move
Fig 16.8 — Amdahl's law · the ceiling that won't move speedup = 1 / ((1−p) + p/N) · ceiling = 1 / (1−p) number of cores N speedup 10× 20× 40× 80× 1 8 32 128 512 p=0.99 · ceiling 100× p=0.95 · ceiling 20× p=0.90 · ceiling 10× p=0.50 · ceiling 2× 5% sequential ⇒ no matter how many cores you add, you cannot get more than 20× faster

Amdahl's law plotted for four different parallel fractions. A program with 90% parallel code (p=0.90) plateaus at 10× speedup no matter how many cores you throw at it; the sequential 10% becomes the dominant cost as N grows. With 95% parallel, the ceiling is 20×; with 99%, 100×; with 50%, only 2×. The implication is severe and counterintuitive: getting from 90% to 95% parallel is harder engineering than going from 8 cores to 64 cores. Most working programs hover around 80–95% parallel — which is why a 64-core machine very rarely runs an actual program 64× faster than a 1-core machine. Amdahl's law explains why CPUs have stopped at "tens of cores" for desktop and laptop loads while supercomputers and GPUs (where workloads are 99%+ parallel) keep adding more.

Gustafson's law (1988) appears to contradict Amdahl's, but actually answers a different question. Amdahl asks: "I have a fixed problem. How much faster can I make it with more cores?" Gustafson asks: "I have more cores. How much bigger a problem can I tackle in the same time?" If problem size grows with the available compute — more particles in a physics simulation, more pixels in a rendering, more samples in a Monte Carlo — the sequential portion stays roughly constant while the parallel portion scales, and effective speedup grows nearly linearly with cores. This is why GPU-driven workloads (training a bigger neural network on more GPUs) and HPC workloads (simulating a finer-grained physical model on more nodes) really do scale to thousands of cores: the problem grew with the resources.

Fig 16.9 — Amdahl vs Gustafson · same hardware, different question
Fig 16.9 — Amdahl vs Gustafson · same hardware, different question two laws · two questions · two completely different practical implications AMDAHL — fixed problem "how much faster on more cores?" speedup = 1 / ((1−p) + p/N) → saturates at 1/(1−p) use case fixed-size compile rendering one frame sorting a fixed array → adding cores hits a ceiling fast GUSTAFSON — scaled problem "how much bigger problem in the same time?" scaled speedup = N − (1−p)·(N−1) → grows nearly linearly with N use case bigger neural network finer climate simulation more Monte Carlo samples → adding cores keeps paying off both laws are right · they describe two different things

Amdahl and Gustafson asked complementary questions and got opposite-looking answers. Amdahl: "I have a one-hour computation. Will doubling the cores let me run it in 30 minutes?" — usually no, because the sequential portion limits speedup. Gustafson: "I have a one-hour budget. Will doubling the cores let me run a problem twice as big?" — usually yes, because the parallel portion grows while the sequential portion stays roughly constant. Both are right; both apply; the choice of which to invoke depends on whether your problem size is fixed or scaling with hardware. The deep-learning revolution is a Gustafson story: bigger models trained on more GPUs in roughly the same wall-clock time. Real-time graphics is an Amdahl story: render the next frame in 16 ms, no matter how many cores you have.

06 — NUMA & MESI · the modern hardware reality

Memory is not flat any more · cache coherence is the only reason it pretends to be.

Two more facts about modern multi-core machines deserve naming. First: memory is not uniformly accessible. On a server with two CPU sockets, each socket has its own local DRAM, and accessing memory attached to your own socket is dramatically faster than accessing memory attached to the other socket. This is NUMA — Non-Uniform Memory Access — and a thread that runs on socket 0 but accesses memory allocated on socket 1 pays a 50–100% latency penalty per access. Operating systems and runtimes try to keep threads near their data automatically; when they fail, performance silently halves. Second: caches must stay coherent across cores even though each core has its own L1 and L2. The mechanism is a hardware protocol called MESI — Modified, Exclusive, Shared, Invalid — that tracks the state of every cache line on every core and ensures that no two cores have conflicting writes to the same line.

Fig 16.10 — NUMA · two sockets, two memories, asymmetric latency
Fig 16.10 — NUMA · two sockets, two memories, asymmetric latency a thread on socket 0 reading socket-1 memory pays a real, measurable, surprising tax SOCKET 0 core 0 core 1 core 2 core 3 LOCAL DRAM ~80 ns access QPI / UPI interconnect SOCKET 1 core 4 core 5 core 6 core 7 LOCAL DRAM ~80 ns from socket 1 core 0 reads its own DRAM: ~80 ns core 0 reads socket 1's DRAM via QPI: ~140 ns (75% slower!)

A two-socket server is two computers sharing a coherence protocol. Each socket has its own DRAM controllers and its own local memory; accessing the other socket's memory traverses the inter-socket interconnect (Intel calls it UPI, AMD calls it Infinity Fabric) and pays substantially more latency. The OS scheduler tries to keep threads on the same socket as their data; allocators try to put new memory on the requesting thread's socket; runtimes (Java, Go, .NET) have NUMA-aware schedulers. When all of this works, you don't notice; when it fails, your program runs at half speed and looks otherwise correct. NUMA is also the architecture inside high-performance GPUs (each GPU is its own NUMA node) and across multi-GPU setups (NVLink interconnects between GPUs). The pattern repeats: parallelism brings speed, distance brings latency, and good engineering keeps the two aligned.

Fig 16.11 — MESI · how four cores keep one cache line consistent
Fig 16.11 — MESI · how four cores keep one cache line consistent every cache line on every core is in one of four states · transitions tracked in hardware M — MODIFIED my copy is dirty no other core has it main memory stale must write back E — EXCLUSIVE I have the only copy unmodified memory in sync can write without bus S — SHARED multiple copies exist all read-only memory in sync read freely; write requires invalidate I — INVALID my copy is stale don't trust it must reload to use someone else wrote EXAMPLE — two cores, one cache line, one write initial: core A has line in S (shared) · core B has line in S (shared) ① core A wants to write ② core A broadcasts "invalidate" on the bus · core B's copy → I (invalid) ③ core A's state → M (modified) · performs the write to its local cache if core B then reads the line: cache miss → fetches updated value from A → both go to S "cache-line ping-pong" — two cores writing the same line bounce ownership back and forth · catastrophic for performance

MESI is the protocol that lets every core have its own cache while pretending memory is globally consistent. Every cache line is in one of four states; every read or write is mediated by bus transactions that update the states across all cores. The protocol is invisible to software but pervasive in performance: when two threads on different cores write the same cache line repeatedly, the line "ping-pongs" between caches with each write, costing a hundred or more cycles per access. This is why false sharing — two unrelated variables that happen to share a 64-byte cache line — can silently halve a program's throughput, and why high-performance code pads structures to cache-line boundaries. The protocol predates multi-core CPUs (it was developed for cache-coherent multiprocessor systems in the 1980s) and survives essentially unchanged because the alternatives are worse. Variants exist (MOESI, MESIF — adding "owned" or "forward" states for optimisations) but the four-state core is universal.

🔐

The cache as a side channel. In January 2018, Spectre and Meltdown showed that the speculative-execution machinery of every modern CPU since 1995 — the same machinery that gave us the speed described in Chapter 1 — could be coerced into reading memory across security boundaries. The trick: speculative loads pulled secret data into the cache; even though the speculation was rolled back at the architectural level, the cache state survived, and an attacker measuring access timings could read out the secret bit by bit. The fix required microcode patches, kernel page-table isolation (KPTI), and a permanent ~5–30% performance regression on syscall-heavy workloads. Cache coherence is not just a performance mechanism; it is also the most expensive security boundary in the machine.

Closing the chapter · seam to Chapter 17

Parallelism scales the single computer up: from one core to sixty-four, from one socket to two, from CPU to GPU. It runs against three constraints we have just named — Amdahl's ceiling on speedup, NUMA's penalty for non-local memory, and MESI's cache-line-ping-pong — and the discipline of high-performance code is the discipline of designing around all three. Above the single computer is the next category: many computers, not sharing memory, communicating over a network, having to reach agreement despite the network being unreliable and the computers being individually fallible. That is the problem of Chapter 17. The mathematics that governs it — distributed consensus, the CAP theorem — is one of the most consequential bodies of work in computer science.

Chapter 17

The Cloud
Distributed Systems
at Scale

Most code in 2026 runs in someone else's datacenter, on hardware its developers will never touch, alongside thousands of other programs they will never know about, coordinating across a network that is unreliable in ways that look like physics. The cloud is the substrate of nearly every web service, every mobile backend, every AI training run, every database larger than a laptop. It is built from a handful of layered abstractions — virtualization, containers, orchestration, consensus protocols, service meshes — each of which solves a problem the layer below could not. This chapter walks through them in order.

TopicsHypervisor · Docker · Kubernetes · Paxos/Raft · CAP · microservices
Era covered~1965 → present
Chapter 17 hero · The Cloud Distributed Systems at Scale LDR F1 F2 F3 F4 FIVE NODES · ONE TRUTH consensus across the network is mathematically hard
01 — Virtualization

One physical machine · pretending to be many.

The cloud's foundational abstraction is the virtual machine: software that emulates a complete computer — CPU, memory, disks, network — well enough that an unmodified operating system runs inside it without knowing it isn't running on real hardware. The technology dates to IBM's CP/CMS in 1968: a hypervisor on the System/360 that let many isolated VM/370 sessions share one expensive mainframe. The x86 world rediscovered virtualization in the late 1990s (VMware founded 1998), and CPU-level support arrived with Intel VT-x in 2005 and AMD-V in 2006 — special instruction-set extensions that let a hypervisor run guest operating systems at near-native speed. Public clouds — AWS launched EC2 in 2006 — took the technology and turned it into a commodity: pay by the hour, get an isolated VM in any datacenter on Earth, billed to the second.

Fig 17.1 — Type 1 vs Type 2 hypervisors · bare metal vs hosted
Fig 17.1 — Type 1 vs Type 2 hypervisors · bare metal vs hosted two ways to slot a hypervisor into a stack TYPE 1 — bare metal Guest OS A Guest OS B Guest OS C HYPERVISOR (KVM, ESXi, Xen) runs directly on the metal PHYSICAL HARDWARE used by: AWS EC2, Google Cloud, Azure, all public clouds → minimum overhead · maximum performance TYPE 2 — hosted Guest OS A Guest OS B native app Hypervisor app | (other apps) HOST OS (macOS, Windows, Linux) PHYSICAL HARDWARE used by: VMware Workstation, VirtualBox, Parallels HARDWARE-ASSISTED VIRTUALIZATION (Intel VT-x · AMD-V · 2005–2006) CPU has a fifth privilege level called "Ring -1" or "VMX root mode" guest OS runs at Ring 0 thinking it's privileged · hypervisor catches privileged ops at Ring -1 · near-native speed

Two hypervisor architectures. Type 1 ("bare metal") sits directly on the hardware with no host OS underneath; KVM, ESXi, Xen, and Microsoft's Hyper-V are all type 1, and this is what every public cloud uses. Type 2 ("hosted") runs as an application inside a normal operating system; VMware Workstation, VirtualBox, and Parallels are type 2, used for desktop development. Both types rely on hardware extensions (Intel VT-x, AMD-V) that introduce a privilege level below Ring 0 — sometimes called "Ring -1" — so a guest OS can run at Ring 0 thinking it controls the machine while the hypervisor below intercepts privileged operations and emulates them. Without VT-x, virtualization required heroic software tricks (binary translation, paravirtualization) and ran an order of magnitude slower; with VT-x, virtualization overhead is typically under 5%, and the entire public cloud became economically possible.

The economic consequence is the on-demand, elastic, pay-per-use cloud. Before EC2, deploying a web service required buying or leasing physical servers, racking them in a colocation facility, and waiting weeks for capacity. After EC2 (and its successors), you ran a single API call and got a fresh VM in 60 seconds, billed by the second, in any of dozens of regions worldwide. The capability transformed startup economics — Airbnb and Dropbox and Pinterest could each scale from zero to millions of users without ever owning a server — and changed how the rest of the industry thought about infrastructure. The hypervisor underneath all of this is invisible to the customer; the abstraction it provides is what they pay for.

02 — Containers revisited

Lighter than VMs · faster to start · the deployment unit of the modern web.

Chapter 4 §06 introduced containers as kernel-level isolation built on namespaces and cgroups. The economics that followed turned containers into the dominant deployment unit of the 2020s. Where a VM takes minutes to boot, megabytes to gigabytes of memory, and contains an entire operating system, a container starts in milliseconds, weighs a few megabytes, and shares the host's kernel. Docker — released 2013, built on a Linux capability called LXC and on its own image format — gave containers a usable interface and a portable image format. Running docker run nginx downloads a layered image, creates the namespaces, configures cgroups, and starts the process; the whole thing takes a second. Compared to launching a VM, that's a hundredfold improvement in start time and a tenfold improvement in density. Modern Kubernetes clusters routinely run tens of thousands of containers per node-group; modern hyperscale datacenters run billions globally.

Fig 17.2 — Docker image layers · copy-on-write filesystems for software distribution
Fig 17.2 — Docker image layers · copy-on-write filesystems for software distribution each instruction in a Dockerfile becomes one immutable layer · shared between images that don't change them DOCKERFILE FROM ubuntu:24.04 RUN apt update && apt install -y python3 COPY requirements.txt / RUN pip install -r requirements.txt COPY app.py / CMD ["python3", "/app.py"] RESULTING IMAGE — five layers Layer 5 · CMD config ~0 KB Layer 4 · COPY app.py ~2 KB Layer 3 · pip install ~80 MB Layer 2 · COPY requirements.txt ~1 KB Layer 1 · apt install python3 ~30 MB Layer 0 · ubuntu:24.04 base · ~80 MB · SHARED across all images WHY LAYERS MATTER layers are content-addressed (SHA-256 of contents) · identical layers are stored once on a host 100 microservices using ubuntu:24.04 share the 80 MB base layer · disk usage is 80 MB + per-image deltas, not 100×80 MB when you change app.py and rebuild · only Layer 4 changes · layers 0–3 reused · push to registry sends ~2 KB

A Docker image is a stack of immutable, content-addressed layers — each layer is the filesystem diff produced by one instruction in the Dockerfile, identified by the SHA-256 hash of its contents. The runtime presents the layers as a unified read-only filesystem to the container, with a thin writable layer on top. The economics are powerful: a host running 100 microservices that all use the same Ubuntu base layer stores the base layer's 80 MB once, not 100 times. Pushing a new version of an app means uploading only the changed layer, not the entire image. Container registries (Docker Hub, GitHub Container Registry, Amazon ECR) deduplicate aggressively, and a typical CI/CD pipeline pushes only kilobytes per code change. The mechanism — layered immutable content-addressed storage — is borrowed from Git's object model and from the copy-on-write filesystems of Chapter 4 §04. The pattern is recurring in modern systems engineering: address things by their content hash; share by reference; never rewrite an immutable thing.

⚙️

The MicroVM convergence. The trade between VM isolation and container speed has been narrowing. AWS Firecracker (2018), Google gVisor (2018), and Kata Containers (2018) all build "microVMs" — extremely lightweight VMs that boot in 100 ms, weigh a few megabytes, and provide near-VM isolation while preserving most of containers' speed and density. AWS Lambda runs every function invocation in a dedicated Firecracker microVM. Modern serverless platforms have effectively pushed past the container/VM dichotomy by making both fast enough that the practical difference dissolves. The deeper trajectory: as hardware virtualization extensions get cheaper and faster, the security argument for VMs over containers gets stronger, and the speed argument against VMs gets weaker. Expect the boundary between the two to keep eroding.

03 — Kubernetes & the control loop

The orchestrator that took over deployment.

A handful of containers on one host is easy. A thousand containers across a hundred hosts, with health checks, automatic restarts, rolling updates, traffic routing, secret management, and storage attachment, is not. The system that solved this problem at scale — and won the orchestration wars in the late 2010s — is Kubernetes, originally built at Google (where its ancestor, Borg, had been running internally since 2003), open-sourced in 2014, and now the dominant deployment substrate for cloud-native software. Kubernetes is not a single program; it is a set of cooperating control-plane components and a fleet of worker nodes, glued together by a deceptively simple architectural idea — the control loop. The user declares what the system should look like; the controllers continuously observe what it does look like and reconcile the difference.

Fig 17.3 — Kubernetes architecture · control plane + workers
Fig 17.3 — Kubernetes architecture · control plane + workers a coordinated cluster of components managing a cluster of pods on a cluster of nodes CONTROL PLANE API server REST · only entrypoint authn/authz etcd distributed KV store Raft · the source of truth scheduler picks node for each pod resource fit · constraints controller manager all the reconcile loops Deployment · ReplicaSet · … "the brain" typically 3 or 5 nodes for HA · all behind etcd consensus WORKER NODES (×N) node 1 kubelet kube-proxy CRI runtime pod A · pod B pod C pod D node 2 kubelet kube-proxy CRI runtime pod E pod F · pod G pod H node 3 kubelet kube-proxy CRI runtime pod I pod J pod K · pod L kubelet ← reports node state ← takes pods to run user submits desired state to API server → API server writes to etcd · scheduler picks nodes · controller manager creates pods · kubelets pull and run no component is in charge of the cluster · all of them watch etcd and reconcile their part

Kubernetes' five major components, divided between control plane and workers. The API server is the only entrypoint — every other component talks to it, never to each other directly. etcd (a distributed KV store running Raft consensus, internally) is the source of truth: every desired-state object lives there. The scheduler watches for unscheduled pods and assigns them to nodes based on resources and constraints. The controller manager runs the reconcile loops for every kind of object — Deployments, ReplicaSets, Services, etc. On worker nodes, the kubelet reports node state up and pulls pod assignments down; kube-proxy programs the network rules for service discovery; the container runtime (containerd, CRI-O) actually runs the pods. The whole architecture is decentralised — no single component is in charge — and built around watch-and-reconcile rather than command-and-respond.

The control loop · the deepest pattern in Kubernetes

The heart of Kubernetes is not its components but its architectural pattern. Every controller — every piece of logic in Kubernetes that maintains some kind of object — is a reconcile loop: a tight loop that observes the current state of the world, compares it to the desired state, and takes whichever action will move current closer to desired. Run forever. The loop is idempotent: if it runs twice without anything changing, the second run does nothing. The loop is robust: if a command fails, the loop runs again, observes that the change still hasn't happened, and tries again. The loop is stateless: it does not remember what it did last time; it only looks at what is versus what should be. This pattern — declarative desired state, observed actual state, continuous reconciliation — is borrowed from control theory (Chapter 10's TCP congestion control is the same shape) and is the architectural genus of all modern infrastructure-as-code: Terraform, Pulumi, Argo CD, everything.

Fig 17.4 — The reconcile loop · observe, diff, act, repeat
Fig 17.4 — The reconcile loop · observe, diff, act, repeat three boxes, two arrows, the architectural genus of infrastructure-as-code DESIRED "3 replicas of nginx" declared by user stored in etcd YAML manifest ACTUAL "2 replicas running" observed reality reported by kubelets stored in etcd CONTROLLER while (true) { d = readDesired(); a = readActual(); if (d != a) take_action(d, a); } read read action: create another replica idempotent · stateless · self-healing — same architecture as TCP, Terraform, Argo CD, every IaC tool "declare what you want; let the system figure out how to get there and stay there"

The reconcile loop is Kubernetes' deepest architectural idea, and it generalises far beyond container orchestration. Define what the system should look like (the desired state) as data. Observe what the system actually looks like (the actual state) as data. Run a tight loop that reads both and takes whatever action moves actual closer to desired. The loop is idempotent (running it twice does nothing if nothing has changed), stateless (it doesn't remember what it did before), and self-healing (failed actions just get retried). This is the same shape as Terraform plans, Argo CD GitOps, every modern infrastructure-as-code tool, and — in spirit — TCP's congestion control loop from Chapter 10. Declare the intent; let convergence happen is one of the most powerful ideas in modern systems engineering.

Go — every Kubernetes controller, in essence
// the reconcile loop, simplified
func Reconcile(req Request) (Result, error) {
    // 1. read desired state (from etcd, via the API server)
    desired, err := apiServer.Get(req.Name)
    if errors.IsNotFound(err) {
        // the user deleted the object — clean up and stop
        return Result{}, cleanup(req.Name)
    }

    // 2. observe actual state (from the cluster)
    actual, err := cluster.Get(req.Name)

    // 3. diff and act — exactly one step toward convergence
    switch {
    case actual == nil:
        return Result{}, cluster.Create(desired)
    case !equal(desired, actual):
        return Result{}, cluster.Update(desired)
    default:
        // already converged — re-run later in case the world drifts
        return Result{RequeueAfter: 30 * time.Second}, nil
    }
}
04 — Distributed consensus

Many machines · agreement · the math says it shouldn't quite work · we make it work anyway.

Kubernetes' "source of truth" — etcd — runs a distributed consensus protocol called Raft. Why? Because the moment you need multiple machines to agree on something — what is the current version of the configuration, who is the current leader, what is the next entry in the log — the network's unreliability becomes a mathematical problem. Messages can be delayed arbitrarily. Machines can crash. The network can partition into halves that can't reach each other. Distributed consensus is the formal problem of getting n machines to agree on a value despite up to f of them failing. The math underneath it is some of the most consequential in computer science: Lamport's Paxos (1989, finally clearly explained 1998), and Diego Ongaro and John Ousterhout's Raft (2014), which was explicitly designed to be more understandable than Paxos and has since become the dominant consensus protocol in industry.

Paxos · the harder ancestor

Paxos works in two-phase rounds. A proposer first sends a prepare message with a unique proposal number to a quorum of acceptors; each acceptor that has not already promised a higher number replies with a promise not to accept anything lower, returning whatever value (if any) it had previously accepted. The proposer then sends an accept request carrying the highest-numbered previously-accepted value, or its own if none was returned; once a majority accepts, the value is chosen. The mechanism is correct under any pattern of message loss, duplication, or reordering — Lamport proved it in 1989. The difficulty is cognitive. Reasoning about Paxos requires holding several rounds and proposal numbers in your head simultaneously, and Lamport's original presentation used a parable about a fictional Greek parliament that, instead of clarifying things, famously made the field wait nine years for the 1998 paper that explained the same protocol in plain language. Raft's contribution was the same correctness with a single leader, sequential terms, and an explicit log — half the cognitive load. That is why etcd uses Raft.

"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."

— Leslie Lamport, attributed (~1987) · Turing Award 2013
Why a quorum is N/2 + 1

Raft tolerates up to f node failures with a cluster of N = 2f + 1 nodes. A quorum — the smallest set of nodes whose agreement counts as the cluster's decision — is:

Q = ⌊N / 2⌋ + 1

So 5 nodes need 3, 7 need 4, 9 need 5. The reason: any two quorums must intersect in at least one node, so two leaders elected in different terms cannot both have a majority without sharing a witness who would force one to step down. This intersection property is what prevents split brain — and why even-numbered cluster sizes (4 nodes need 3 — same fault tolerance as 3) are wasteful. Add nodes only in odd numbers.

Fig 17.5 — Raft · leader election and log replication
Fig 17.5 — Raft · leader election and log replication elect a leader · the leader replicates the log · majority confirms · committed PHASE 1 — LEADER ELECTION ① every node starts as Follower with a randomised election timeout ② if no leader heartbeat by timeout: become Candidate, increment term, request votes from all peers ③ peer grants vote if it hasn't voted in this term and candidate's log is at least as up-to-date as its own ④ candidate with majority becomes Leader, sends heartbeats to suppress further elections PHASE 2 — LOG REPLICATION ① client sends command to Leader ② Leader appends to its own log, then sends AppendEntries to all Followers ③ Followers append to their logs, ack back to Leader ④ once Leader sees acks from a majority (incl. self), entry is COMMITTED — applied to state machine FIVE-NODE EXAMPLE — one leader, four followers, majority = 3 LEADER log: [a, b, c] F1 log: [a, b, c] F2 log: [a, b, c] F3 (lagging) log: [a, b] F4 (down) unreachable 3 of 5 acked = majority = entry "c" committed · cluster makes progress

Raft in two phases. Leader election: nodes run with randomised timeouts; the first one to time out increments the term and requests votes; with a majority it becomes leader. Log replication: the leader appends client commands to its log, sends them to followers, and once a majority of nodes acknowledge, the entry is committed and applied to the state machine. The protocol works correctly as long as a majority of nodes are alive and reachable — for a 5-node cluster, that means up to 2 can fail. The math is general: an N-node cluster tolerates ⌊(N-1)/2⌋ failures, which is why production Raft clusters run with 3, 5, or 7 nodes (odd numbers maximise fault tolerance for the resource cost). etcd, ZooKeeper, Consul, CockroachDB, TiDB, HashiCorp Vault — all run Raft underneath, with the leader-and-replication shape of this diagram.

FLP impossibility · why this is harder than it looks

The deep reason consensus is hard is the FLP impossibility result (Fischer, Lynch, Paterson, 1985), one of the most important negative results in computer science. It states: in an asynchronous distributed system where messages can be delayed arbitrarily and even one process can crash, no deterministic consensus protocol can guarantee both safety (never reaching an inconsistent decision) and liveness (always eventually reaching a decision). Some run of the protocol can always be constructed in which it spins forever without deciding. Every real-world consensus protocol — Paxos, Raft, PBFT, Tendermint — works around FLP either by assuming partial synchrony (eventually messages will arrive within some bound) or by giving up determinism (using randomisation, as in some BFT protocols). Neither escape is free; both are pragmatic compromises that the math forces.

Fig 17.6 — FLP impossibility · safety vs liveness vs asynchronous failure
Fig 17.6 — FLP impossibility · safety vs liveness vs asynchronous failure FLP 1985 — pick at most two of safety, liveness, and pure asynchrony SAFETY no two nodes decide differently LIVENESS eventually decides ASYNCHRONY unbounded delays + even 1 crash pick two PRACTICAL ESCAPES → partial synchrony: assume messages eventually arrive within some bound (Paxos, Raft default) → randomisation: flip coins to break ties (Ben-Or, randomised BFT) → give up safety in unusual conditions (eventually consistent stores · Dynamo) every real consensus protocol gives up one of the three · most give up pure asynchrony

FLP is the deepest negative result in distributed systems and the reason consensus is harder than it looks. In a fully asynchronous model with even one crashing node, no deterministic protocol can guarantee both that it always terminates AND that all nodes always agree. Real-world protocols escape by adding assumptions: partial synchrony (Paxos and Raft assume messages eventually arrive within some bound — they may pause but they don't decide wrong), randomisation (Ben-Or's protocol flips coins to break the bad cases), or giving up safety in adverse conditions (eventually-consistent stores like DynamoDB return possibly-stale reads but never block). FLP doesn't say consensus is impossible — only that it cannot be both safe and live in pure asynchrony with crash failures. Once you understand which assumption a system is making, you understand which failure mode it cannot handle.

FLP, in one line

Fischer, Lynch, and Paterson, 1985:

∄ deterministic protocol that solves consensus in an asynchronous system with even one crash failure.

The proof is short: any protocol that always terminates must, at some configuration, have a bivalent step — one whose outcome can still go either way. The adversary delays exactly the messages that would resolve it. The protocol either decides early (and risks being wrong) or waits forever. Real systems escape by ruling out pure asynchrony: partial synchrony (Paxos, Raft) assumes messages eventually arrive within some bound; randomisation (Ben-Or) flips coins to break adversarial schedules; eventual consistency (Dynamo) keeps serving but stops insisting all nodes agree right now. Every working protocol is FLP plus one assumption.

05 — The CAP theorem

Pick two · the unavoidable trade in distributed storage.

A different impossibility result governs distributed storage rather than distributed agreement. Eric Brewer's CAP theorem (conjectured 2000, proven 2002 by Gilbert & Lynch) says that any distributed data store can provide at most two of three properties: Consistency (every read returns the most recent write), Availability (every request gets a response), and Partition tolerance (the system keeps working when the network splits the cluster). Since network partitions are a real fact of distributed systems — they happen, whether you like it or not — the practical choice during a partition is between C and A. A consistent system, partitioned, refuses to serve requests on the wrong side of the split (you lose A). An available system, partitioned, serves possibly-stale data (you lose C). Different applications choose different points on the spectrum.

"The 'two of three' formulation was always misleading because it tended to oversimplify the tensions among properties. Now such nuances matter. CAP prohibits only a tiny part of the design space: perfect availability and consistency in the presence of partitions, which are rare."

— Eric Brewer, "CAP Twelve Years Later: How the 'Rules' Have Changed," IEEE Computer, 2012
Fig 17.7 — CAP triangle · pick two when the network splits
Fig 17.7 — CAP triangle · pick two when the network splits network partitions are not optional · so the real choice is C vs A C CONSISTENCY all reads see latest write A AVAILABILITY every request answered P PARTITION TOL. survives network splits CP etcd · Spanner HBase · ZooKeeper AP Dynamo · Cassandra CouchDB · Riak CA (impossible on real networks) PACELC (2010) — even WITHOUT a partition, there's a trade between Latency and Consistency "if Partition: choose A or C; Else: choose Latency or Consistency"

CAP forced a generation of distributed-system designers to be honest about the trade. CP systems — etcd, ZooKeeper, Google Spanner, traditional relational databases with synchronous replication — refuse to serve writes during a partition rather than risk inconsistency. AP systems — Amazon DynamoDB, Cassandra, CouchDB, Riak — serve writes always, accept that nodes on different sides of a partition may temporarily disagree, and reconcile (last-writer-wins, vector clocks, CRDTs) when the partition heals. CA — total consistency and total availability — is unattainable on a real network because partitions are not optional. The PACELC refinement (Daniel Abadi 2010) added the observation that even without a partition, every system trades latency against consistency: synchronously replicated writes are slow but consistent; asynchronously replicated writes are fast but eventually-consistent. There is no free lunch on this dimension; the practical question is always which failure mode hurts your application less.

Fig 17.8 — CAP in production · which database picked which corner
Fig 17.8 — CAP in production · which database picked which corner the theorem on the left · the procurement decision on the right C consistency A availability P partition tol. CP corner AP corner CA edge: unreachable on real network CP — refuse to lie block writes during partition etcd · ZooKeeper Raft · config-store role Google Spanner TrueTime · atomic clocks · GPS CockroachDB software clocks · Spanner-like HBase · MongoDB (default) AP — answer anyway serve possibly-stale data Amazon DynamoDB 2007 paper · vector clocks Cassandra tunable consistency · LWW CouchDB · Riak CRDT-friendly · gossip DNS · Git · email PROCUREMENT QUESTION If a partition cuts your cluster in half: → refuse all writes & return errors use CP — bank ledger, k8s config, leader election → keep accepting writes, reconcile later use AP — shopping cart, social feed, telemetry CAP IS NOT ABOUT WHICH IS BETTER — IT IS ABOUT WHICH FAILURE MODE YOUR APP CAN TOLERATE Spanner spent ten years engineering atomic clocks just to push the boundary; the trade itself never moves

CAP made concrete. The theorem says you can't have all three; this figure shows which corner each production system actually picked, and why. etcd and ZooKeeper are CP because losing track of "who is the leader" or "what is the current config" silently is catastrophic — better to fail loudly than to lie. Google Spanner achieves something close to CA in the common case by spending engineering effort on TrueTime — GPS-and-atomic-clock-backed timestamps that bound clock skew across continents — but in the rare partition case it still has to choose, and it chooses C. DynamoDB and Cassandra are AP because Amazon's shopping cart is more valuable than perfect consistency: better to occasionally show a stale cart than to refuse to take orders. The procurement decision is always: which failure mode does my application's domain make tolerable? Bank ledgers — refuse rather than risk wrong totals. Shopping carts — accept rather than refuse. Telemetry — accept always. The theorem is the constraint; the corner is the design choice.

06 — Microservices, serverless, and the pendulum

Architecture is fashion · the same problems return every fifteen years.

The dominant application architecture of the late 2010s was microservices: take what would once have been a single monolithic application and decompose it into dozens or hundreds of small services, each owning its own data, communicating over the network. The motivation was real: independent deployability, team autonomy, technology heterogeneity, fault isolation. The cost was also real: every former function call became a network call, every former invariant became a distributed consistency problem, every former crash became a partial failure. By the early 2020s the industry had partially walked the pendulum back — many companies that had eagerly decomposed into microservices discovered they had built a "distributed monolith" with all the complexity of microservices and none of the benefits, and consolidated. The lesson is older than computing: architecture is a series of trades, not a series of progressions, and the same problems return every fifteen years in slightly different costume.

Fig 17.9 — Monolith vs microservices · same business logic, different deployment
Fig 17.9 — Monolith vs microservices · same business logic, different deployment one process · or a hundred · the work is the same · the costs aren't MONOLITH single process auth module user module orders module payment module notifications module function calls between modules · single DB MICROSERVICES auth svc own DB own pods user svc own DB own pods orders own DB payment svc own DB own pods notifications svc queue worker · own DB · own pods network calls between services · per-service DB distributed tracing · service mesh · saga pattern · … THE TRADE — what microservices buy and what they cost + independent deploy · + team autonomy · + tech heterogeneity · + fault isolation – network latency for every former function call · – distributed transactions · – partial failure · – operational cost

The same business logic, two architectures. The monolith runs as a single process with internal modules; calls between modules are normal function calls (nanoseconds, in-process). The microservices architecture splits each module into its own service with its own database, deployed independently, communicating over the network — so what was a function call becomes an HTTP or gRPC call (milliseconds, with retries, with timeouts, with potential failure). The microservices version gains independent deployability, team autonomy, fault isolation. It also pays for distributed-systems complexity: distributed tracing to follow a request across services, the saga pattern for transactions that span databases, service mesh for retries and circuit breaking, and the operational cost of running dozens or hundreds of services. The right answer is rarely "as many services as possible"; it is the smallest decomposition that buys the autonomy you actually need.

Fig 17.10 — Service mesh and the sidecar · how microservices actually talk in 2026
Fig 17.10 — Service mesh and the sidecar · how microservices actually talk in 2026 every service gets a proxy attached · the network becomes programmable POD · auth-service SERVICE app code just business logic no retries no TLS SIDECAR Envoy proxy retries · timeouts mTLS · auth metrics · tracing localhost · plain HTTP POD · orders-service SERVICE app code just business logic SIDECAR Envoy proxy retries · timeouts mTLS · auth metrics · tracing mTLS · gRPC · retried · traced app code never sees any of this CONTROL PLANE — Istio · Linkerd · Consul routes · policies · certs · all delivered to every sidecar via xDS push platform team owns this · app teams never touch it WHAT THE SIDECAR HANDLES — so app code doesn't have to → retries with exponential backoff (transient failures absorbed) → circuit breakers (stop hammering a downed service) → mTLS (every pod-to-pod call encrypted and mutually authenticated) → traffic shifting (5% canary, blue-green, A/B) → observability (RED metrics + distributed traces) auto-generated → rate limiting (per-route quotas) → outlier detection (eject misbehaving instances) → JWT auth at the edge of every service → topology routing (locality-aware load balance) → same logic for every service · same cert rotation the network becomes a platform-team responsibility · app teams write business logic only

In 2026, the standard pattern for microservices in production is the service mesh with the sidecar proxy. Every service runs in a pod alongside a small proxy — Envoy is the canonical choice, originally written at Lyft — and all of the service's network traffic, in and out, goes through that proxy on localhost. The proxy handles retries, timeouts, circuit breakers, mutual TLS, rate limiting, observability, traffic shifting. The application code never sees any of this; it makes plain HTTP calls to localhost and the sidecar does the rest. A central control plane (Istio, Linkerd, Consul Connect) pushes route rules, certificates, and policies to every sidecar via xDS — the standard discovery protocol Envoy invented. The architectural payoff is that all the operational concerns of running a distributed system — encryption, retries, traffic shaping, observability — become a platform-team responsibility delivered uniformly, instead of every app team reinventing the same patterns in their own languages. The cost is real: every pod runs an extra proxy (memory, latency, complexity), the control plane is itself a non-trivial system to operate, and the configuration surface (Istio's CRDs alone fill hundreds of pages) is its own learning curve. Like every layer in this book, the trade is paid in the operational budget.

Serverless · and the architectural pendulum

The newer wave is serverless: write a function, upload it to AWS Lambda or Google Cloud Functions or Azure Functions, and the platform runs it on demand, scales to whatever number of invocations you receive, charges you per hundred-millisecond of execution, and otherwise abstracts away every notion of a server. The trade is sharp: serverless is cheap and elastic for sporadic workloads (a function called a thousand times per day costs cents); it is expensive and slow for sustained workloads (the same function called a thousand times per second is dramatically cheaper on a dedicated server). The cold-start problem — the first invocation after a quiet period may take hundreds of milliseconds while the platform spins up a runtime — has gotten dramatically better with microVMs (Firecracker again) but still exists. As of 2026, serverless is the right answer for event-driven, bursty workloads (image processing, webhooks, infrequent APIs); it is the wrong answer for sustained high-throughput services. The pendulum keeps swinging.

Fig 17.11 — The architectural pendulum · same problems, different costumes
Fig 17.11 — The architectural pendulum · same problems, different costumes monolith → SOA → microservices → serverless → ? · every 10–15 years a new costume 1990s monolith single binary ~2005 SOA SOAP · ESB · WSDL ~2014 microservices REST · containers · K8s ~2018 serverless Lambda · FaaS · cloud-only → 2026? "modular monolith" deploy unit ≠ org unit decompose · feel pain of distribution · partially recompose · repeat

Software architecture has its fashions. In the 1990s everything was a monolith. In the early 2000s "Service-Oriented Architecture" promised loose coupling via SOAP/WSDL/ESB middleware (it worked, but the tooling was heavy and unloved). Around 2014, with Docker and Kubernetes maturing, microservices became the consensus answer; Netflix, Amazon, and Spotify all famously decomposed. By the early 2020s, several large companies had partially reversed course — Amazon Prime Video published a 2023 article describing how they consolidated a microservices monitoring system back into a monolith for 90% cost savings. The honest pattern: every architectural style has trades; the right answer depends on the team, the workload, and what kind of pain the organisation is currently best equipped to absorb. Distributed-systems complexity is expensive whether you choose to pay it or not.

🔁

The recurring pattern. Every layer of the cloud abstracts away the layer beneath while preserving its essential trades. Hypervisors abstract physical hardware but preserve the cost of virtualization overhead. Containers abstract operating systems but preserve the cost of process isolation. Kubernetes abstracts cluster operations but preserves the cost of distributed-systems complexity. Serverless abstracts servers entirely but preserves the cost of cold starts and per-invocation latency. None of these abstractions are leak-free; every one of them eventually requires the engineer to understand the layer below. The skill of working in the cloud is, in part, the skill of knowing which layer is currently the source of your problem. We have been building stacks of leaky abstractions for sixty years; we will probably keep doing it for at least sixty more.

🔐

SolarWinds, December 2020. Eighteen thousand customers of a network-management vendor installed a routine update of Orion. The update had been silently poisoned at build-time inside SolarWinds's own CI pipeline — a backdoor compiled into the binary the vendor itself signed and shipped. Among the customers were US federal agencies, parts of Microsoft, and SolarWinds itself. The cloud era amplifies the blast radius of supply-chain compromise: one compromised vendor reaches every customer simultaneously, the update mechanism is the attack vector, and the trust boundary that mattered was not in any customer's network. SBOMs, reproducible builds, and signed-but-also-attested artifacts are the industry's response to discovering that a software-update channel is architecturally identical to a worm.

Knight Capital, 1 August 2012, 9:30 a.m. A high-frequency trading firm pushed a new build to seven of its eight production servers. The eighth still ran a feature flag from years earlier — a debugging mode that, in the new code path, dispatched test orders to the live market. In forty-five minutes Knight Capital placed four million unintended orders worth seven billion dollars; the firm lost $440 million and was acquired weeks later. Heterogeneous deployment state across a cluster is the distributed-systems version of a race condition: a single machine out of phase with its peers, doing the wrong thing at the right speed, is enough to end a company. Every modern deployment system — blue-green, canary, progressive rollout — is a response to the cost of this specific failure mode.

Closing the chapter · seam to Chapter 18

Chapter 17 closes the substrate-and-scale arc that began with the single transistor in Chapter 1. We have followed the layers all the way out: from one transistor to one CPU to one operating system to one network to one TLS connection to one database to one cryptographic primitive, then up again — many cores, many machines, many datacenters, many regions. There is no further "out" we can go without reaching the human social systems that computing intersects with. That is, in some sense, where Chapter 18 looks. Eighteen chapters of mechanism, ending in the observation that the mechanism is now a substrate for everything else.

Chapter 18 · Epilogue

What You
Now See

Seventeen chapters ago this book began with a single transistor — a switch made of doped silicon, controlled by voltage, doing nothing on its own. Around it we have now built an entire civilization: gates, instructions, kernels, languages, networks, cryptography, security, scale. The computer is no longer a thing. It is a layered system that runs everywhere, holds nearly all knowledge produced by humans, and increasingly shapes the decisions of every domain it touches. This chapter does no new teaching. It looks back at the whole, names what's permanent and what isn't, points at where the field is going, and ends.

TopicsSynthesis · history · futures · intersections · reading list
Era coveredall of it
Chapter 18 hero · What You Now See WEB SERVICE network process kernel instruction CPU gate transistor SAND every layer made of the layer below it
01 — The computer as a mathematical object

Every layer is math made physical · made physical again, on top.

The deepest claim this book has made — implicit in every chapter, visible in retrospect — is that the modern computer is, in a precise sense, a mathematical object. It is a stack of layers in which each layer is a mathematical structure realised in the layer below, and the next layer up is another mathematical structure realised in this one. Boolean algebra is realised in voltages and gates. Arithmetic is realised in adders and registers. The instruction set is realised in microcode. The kernel is a state machine over processes. The relational model is set theory over rows. TLS is number theory over network packets. Distributed consensus is a graph-theoretic protocol over asynchronous nodes. The whole tower is built by translating one kind of mathematics into the operations of another.

Fig 18.1 — The full stack · sand to web service · math at every layer
Fig 18.1 — The full stack · sand to web service · math at every layer eighteen chapters · one stack · the whole arc in one diagram DISTRIBUTED SYSTEMS consensus theory · CAP · queueing WEB SERVICE graph theory · request-response semantics CRYPTOGRAPHY number theory · elliptic curves · lattices DATABASE set theory · relational algebra NETWORK PROTOCOLS control theory · information theory PROGRAMMING LANGUAGE type theory · lambda calculus PROCESS / KERNEL queueing theory · Little's Law INSTRUCTION SET finite-state machine · register algebra ARITHMETIC LOGIC UNIT two's complement · IEEE 754 LOGIC GATES Boolean algebra TRANSISTOR solid-state physics · field-effect

The whole book in one column. Each layer is a mathematical structure realised in the layer below it; each layer up is a different kind of math that fits the affordances of the one beneath. The transistor at the bottom is solid-state physics. The logic gates above it are Boolean algebra. The ALU is two's-complement arithmetic and IEEE 754. The instruction set is a finite-state machine over registers. The kernel is queueing theory and scheduling theory. Programming languages bring type theory and lambda calculus. Network protocols bring control theory (TCP) and information theory (Shannon). Databases bring set theory. Cryptography brings number theory and elliptic curves. Distributed systems bring consensus theory and CAP. The whole stack is, structurally, a chain of mathematical translations from physics to people. The remarkable thing is not that any one layer works; it's that all of them work at once, each defending the abstraction the next one above relies on.

01.5 — The trace

Eighteen chapters. One keystroke.

Every senior engineer is supposed to be able to answer one question: "what happens when you type google.com into your browser and press Enter?" The honest answer touches every layer of computing. The book you have just read is, structurally, the long-form answer. This section is the visual one. One figure, one keystroke, every chapter activated at once.

The trace below follows a single keystroke from the moment your finger closes a contact on a key — voltage rising on a wire, in the same way Chapter 1 described — through every layer of your machine, out across the network, into Google's server stack, back through the network, and onto pixels on your screen. Eighteen waypoints, one for each chapter (almost). The pulse animates the journey end to end. The reader who has finished the book can name what is happening at every station.

Two chapters are not numbered stations and warrant a note. Chapter 15 (security) is woven across the trace rather than located at one point — each time the journey crosses from less-trusted to more-trusted territory (user input → kernel; client → DNS resolver; client → server's TLS endpoint; HTTP query → SQL parser), a threat boundary is crossed and a defence layer applies. The figure marks these transitions with thin red dotted lines. Chapter 18 is not a station because Chapter 18 is this figure: the moment the layers stop being a list and become a system.

Fig 18.1.5 — The trace · what happens when you press Enter on google.com
Fig 18.1.5 — The trace · 18 waypoints across every chapter the book has taught eighteen chapters · one keystroke A LIVE ROUND TRIP THROUGH THE BOOK YOUR MACHINE THE NETWORK GOOGLE trust ↔ trust ↔ trust ↔ trust ↔ Ch 15 Ch 15 Ch 15 Ch 15 KEY contact closes 1 Ch 1 GATE keycode decoded 2 Ch 2 CPU interrupt handler 3 Ch 3 TRAP → ring 0 4 Bridge KERN event queued 5 Ch 4 libc to user space 6 Ch 5 V8 C++ event loop 7 Ch 6 JS build request 8 Ch 12 DNS name → IP 9 Ch 11 TCP handshake 10 Ch 10 TLS ECDH · AES-GCM 11 Ch 14 HTTP GET /search 12 Ch 11 IP BGP routing 13 Ch 9 WIRE Ethernet · fibre 14 Ch 8 CLOUD LB + sidecar 15 Ch 17 SVC Python service 16 Ch 7 INDEX SQL · sharded 17 Ch 13 RANK cores + GPUs 18 Ch 16 return path · response decrypts, packets reassemble, browser parses HTML, GPU rasterizes pixels Ch 14 · Ch 10 · Ch 8 · Ch 4 · Ch 12 · Ch 1 YOU GOOGLE EIGHTEEN CHAPTERS · ONE KEYSTROKE · ROUGHLY 200 MILLISECONDS

The trace, end to end. The pulse leaves your finger as voltage on a key, climbs through the abstraction layers of your machine — gates, CPU, kernel, libc, V8, JavaScript — descends into the network protocols (DNS, TCP, TLS, HTTP, IP, the wire itself) and out across continents, lands at Google's load balancer, threads through a Python microservice, queries a sharded inverted index, runs parallel ranking on many cores, and turns around. The return packet (lower arc, in green) carries an HTTP response back through TLS decryption, TCP reassembly, the wire, your kernel, your browser's HTML parser, and finally the GPU that rasterizes the pixels you read. Eighteen chapters of the book, all activated, in roughly two hundred milliseconds. Trust boundaries — places where data crosses from less-trusted to more-trusted territory and Chapter 15's defences apply — are marked with red dotted lines. The figure is the only one in the book that visits every chapter; that is its point. Watch the pulse for a minute. The thing you are looking at is the substrate of modern civilization — the same substrate the previous 235 figures were quietly preparing you to see at this final scale.

That is the answer the book has been long-form preparing you to give. The interview question — "what happens when you press Enter on google.com?" — has been answered countless times in blog posts, in lectures, in technical books. Most answers are lists. The book's contribution is the visual answer: the pulse traveling along a single arc through eighteen distinct mathematical and engineering structures, every one of which had to be invented by specific people in specific decades to be available at this moment.

"The remarkable thing is not that any one layer works; it's that all of them work at once."

— from §01 of this chapter, applied here

That sentence has been the through-line of Under the Code from the first transistor to this trace. The two hundred and forty figures that came before have, in retrospect, been preparation for the figure above. A reader who has finished the book has, in some precise sense, watched a keystroke travel through their entire civilization and come back. Whatever they go on to build or break or maintain, they will see those layers from now on. That is what reading the book has done.

02 — From sand to civilization

The chain of decisions, in time, by people in real places.

Every layer in that diagram is also a piece of history. Each one was invented by specific people, in specific places, in response to specific problems they were trying to solve. None of them was inevitable — each could have happened differently, or not at all, or much later. The Boolean algebra in your CPU was invented by a Lincolnshire schoolteacher who died in 1864. The transistor came out of three physicists at Bell Labs in 1947. The kernel structure of every modern operating system descends from UNIX, which two engineers wrote on a spare PDP-7 to play a game called Space Travel. The web came out of a particle physics lab in 1989 because a researcher wanted a better way to organise lab notes. The cloud you deploy your code into is built on virtualization that IBM invented in 1968 to let many users share one mainframe. Civilization is the long sequence of these decisions, each one made by a real person under real constraints, and accumulated over a few decades into the substrate of nearly everything modern.

Fig 18.2 — The civilization timeline · seven moments that made the rest possible
Fig 18.2 — The civilization timeline · seven moments that made the rest possible a few rooms · a few people · a few decades · the substrate of everything since 1854 Boole Laws of Thought Lincolnshire 1936 Turing universal machine Cambridge 1947 Shockley + 2 transistor Bell Labs NJ 1969 ARPANET + UNIX 2 networks · 1 kernel UCLA · Bell Labs 1976 Diffie + Hellman public-key crypto Stanford 1989 Berners-Lee the web CERN, Geneva 1991 Torvalds Linux Helsinki SEVEN MOMENTS · ONE CIVILIZATION a Lincolnshire schoolteacher · a Cambridge mathematician · three physicists · two networks · two cryptographers · a CERN researcher · a Helsinki student. Less than 150 years from beginning to here.

Seven moments. Everything we have used in this book traces back, eventually, to a small set of decisions made by named people in named places. Boole's Laws of Thought in 1854 made the algebra. Turing's 1936 paper made the universal machine. Shockley, Bardeen, and Brattain's 1947 transistor made the substrate. ARPANET and UNIX both arrived in 1969 — two networks of computers, one kernel of an operating system, neither aware of the other for years. Diffie and Hellman's 1976 paper made public-key cryptography. Berners-Lee's 1989 proposal at CERN made the web. Torvalds's 1991 announcement on a Helsinki Usenet group made the kernel that runs every cloud. None of these were predictable; in retrospect, all of them were necessary. The civilization we now live inside is the cumulative consequence of these specific human acts.

03 — What's permanent and what isn't

The substrate is changing again · which parts will survive it?

Three forces are visibly reshaping the substrate as this book is written. Moore's Law is ending — the transistor-density doubling we tracked from Chapter 1 has slowed to a near-halt below 3 nanometres because we are running into single-atom-width physics. Quantum computing is going from laboratory demonstrations to early useful machines — not replacing classical computers but solving specific problems (factoring, simulating quantum systems, optimisation) that classical computers fundamentally can't. Neuromorphic and analog AI accelerators are becoming a real alternative substrate for one specific workload — large neural networks — that classical von Neumann hardware is, by 2026, visibly straining to run. None of these will replace the stack this book described. All of them will sit alongside it as additional substrates. The question, in the long view, is which mathematical primitives travel — Boolean algebra survives anything; relational algebra survives most things; specific cryptographic schemes do not survive quantum.

Fig 18.3 — Three things the next decade will change
Fig 18.3 — Three things the next decade will change none of them replace classical · all of them sit alongside it as new substrates MOORE'S DEATH ~1971 → 2030? below 3 nm transistors are atoms wide consequence: specialised chips for specific workloads QUANTUM |0⟩ |1⟩ qubits in superposition solve specific problems consequence: classical crypto must migrate post-quantum NEUROMORPHIC + AI spiking neuron analog/spiking hardware ~1000× efficient on NN consequence: CPUs/GPUs joined by a third substrate PERMANENT — Boolean algebra · relational algebra · the layered model · most of this book TRANSIENT — RSA / ECDH (post-quantum), single-thread clock scaling, von-Neumann-only thinking

Three substrates that will sit alongside the classical CPU stack over the next decade. Moore's Law is genuinely flattening; below 3 nm a transistor is a few atoms wide and the physics gets in the way. The industry's response — visible already — is specialised chips per workload (TPUs for matrix multiply, custom silicon for video, dedicated DSPs for wireless) rather than ever-faster general CPUs. Quantum computers, with somewhere between hundreds and thousands of stable qubits expected within the decade, will solve specific structured problems (factoring, simulation, certain optimisations) that classical can't — but they will not replace classical computing for the kinds of work this book has covered. Neuromorphic and analog AI accelerators are emerging as a third substrate specifically optimised for neural-network inference, with order-of-magnitude efficiency gains over GPUs for that workload. None of these are replacements; all of them are additions. The mathematical primitives that survive every substrate change — Boolean algebra, set theory, the layered abstraction model — survive because they are independent of the physics.

The deeper claim under all three forces is about which intellectual content of this book will still be useful in fifty years. The mathematical primitives almost always survive — Boolean algebra (Chapter 2), set theory and relational algebra (Chapter 13), information theory (Chapter 8), graph theory (Chapter 9), control theory (Chapter 10), number theory (Chapter 14) — because they are independent of the substrate that computes them. The architectural patterns generally survive — the layered model, the kernel-as-mediator, the request-response cycle, the read-replicate-converge shape of consensus — because they answer questions about how computation should be organised, not about what computes it. The specific implementations are the most fragile: x86 will eventually be a curiosity, RSA will retire when fault-tolerant quantum computers can run Shor's algorithm at scale, particular libraries will become archaic. The reader who has internalised the patterns rather than the products will not be inconvenienced by any of those losses. The reader who has memorised the products will need to keep relearning. That is the whole bet of writing a book about perspective rather than facts.

04 — Where computer science intersects with everything else

The substrate of this book is now the substrate of much else.

Computer science as a discipline began as a branch of mathematics and electrical engineering. It is now the substrate that every other discipline runs on, and increasingly the methodological lens that other fields adopt. Modern biology is computational — genomes are bytes, protein folding is an optimisation problem, drug discovery is a search through chemical space. Modern economics is computational — markets are distributed systems, financial contracts are state machines, trading is a latency competition. Modern governance is increasingly computational — every public service is a software system, public records are databases, the shape of public discourse is shaped by recommendation algorithms. Modern art is increasingly computational — generative models, procedural environments, real-time graphics. The sentences in this paragraph are slogans, but each one names a real arc that is in progress now, that this book's reader will be in the middle of.

Fig 18.4 — CS intersections · the disciplines that now run on the substrate
Fig 18.4 — CS intersections · the disciplines that now run on the substrate computing as the lens · or computing as the substrate · either way, present everywhere COMPUTER SCIENCE BIOLOGY genomics · folding ECONOMICS HFT · DLT · markets GOVERNANCE IDs · platforms · regs ART generative · games EDUCATION platforms · LLMs PHYSICS simulation · ML "every field becomes computational" — slogan, but also true now

Six fields, each one transformed by becoming computational. Biology after the human genome project (2003) and AlphaFold (2020) is essentially impossible without computation; an entire generation of biologists are now also programmers. Economics — high-frequency trading, distributed ledgers, market microstructure — is run on millisecond-latency software systems and modeled with techniques borrowed from distributed-systems research. Governance — digital identity, platform regulation, electoral systems, the public infrastructure many countries are now rebuilding — is increasingly software. Art — generative models, procedural environments, real-time graphics, AI-assisted music — has a computational layer that didn't exist twenty years ago. Education is being reshaped by platforms and large language models. Physics — climate models, particle-physics analyses, materials science — runs on supercomputers and increasingly on neural networks trained on physical data. The reader who has finished this book has, in principle, the vocabulary to understand the substrate of any of these.

The transitive direction matters too. A computer scientist who understands the substrate of biology can talk to biologists about why a particular algorithm scales; one who understands market microstructure can talk to economists about why a particular distributed-ledger design will or will not work. The vocabulary moves both ways. The deeper consequence is that "computer science" is becoming less a separate field and more a kind of literacy — one that other disciplines increasingly assume in the way they once assumed statistics or basic mathematics. Within fifty years, the question may not be whether a biologist or an economist or an artist knows computer science, but which parts of it they know to which depth, and whether they can recognise when a problem in their own domain is actually a computational problem they are trying to solve without the vocabulary. The reader who has finished this book has, on top of that vocabulary, the discipline that produced it: resolve into layers, locate the math, name the security boundary, recognise the historical decision. Those four moves work in any computational domain, regardless of which surface field it is wearing.

05 — Reading list

The books that go deeper · one shelf per part of the book.

Every chapter of Under the Code compresses what is, in its target field, several books worth of detail. The point of compression was to give a reader the whole arc and the vocabulary to navigate; the point of the books below is to give the reader the depth to do real work. The list is curated for durability — every entry has been the reference text of its field for at least a decade. None of them are substitutes for this book; all of them are deeper.

Fig 18.5 — The reading list · grouped by part of this book
Fig 18.5 — The reading list · grouped by part of this book five shelves · one per part of this book · all of them durable PART I — THE PHYSICAL WORLD Charles Petzold · Code: The Hidden Language of Computer Hardware and Software Hennessy & Patterson · Computer Architecture: A Quantitative Approach also: Tanenbaum · Structured Computer Organization; Patterson & Hennessy · Computer Organization and Design PART II — THE SOFTWARE LAYER Kernighan & Ritchie · The C Programming Language (still the reference) Stroustrup · The C++ Programming Language Tanenbaum & Bos · Modern Operating Systems also: Fluent Python (Ramalho); The Linux Programming Interface (Kerrisk) PART III — THE NETWORK W. Richard Stevens · TCP/IP Illustrated (the canonical reference) Ilya Grigorik · High Performance Browser Networking (online, free) Tanenbaum & Wetherall · Computer Networks also: Beej's Guide to Network Programming (online, free) PART IV — DATA AND SECURITY Silberschatz et al. · Database System Concepts Ferguson, Schneier & Kohno · Cryptography Engineering Erickson · Hacking: The Art of Exploitation (genuinely the best intro) also: Boneh & Shoup · A Graduate Course in Applied Cryptography (free, draft) PART V — THE MODERN MACHINE Kleppmann · Designing Data-Intensive Applications (the modern reference)

Five shelves of further reading, one per part of this book. Each entry has been the reference text in its field for at least a decade and will likely remain so. Code by Petzold is the rare popular-science book that takes the reader from a flashlight signal all the way to a CPU; it overlaps Part I in spirit. Stevens's TCP/IP Illustrated is the network bible and the source many of Chapter 8–10's references descend from. Erickson's Hacking: The Art of Exploitation is unusual in that it teaches binary exploitation by actually walking through it on a Linux system the book provides; it is the deepest practical follow-up to Chapter 3 §05–§06. Kleppmann's Designing Data-Intensive Applications is, as of 2026, the single best book on modern distributed systems engineering. None of these are substitutes for Under the Code; all of them are deeper. The reading list is your map of where to go from here.

06 — Final note

What you now see.

A book about how computers work could be a book about facts. This one tried to be a book about seeing. The discipline of this book has been to introduce nothing without its history, explain nothing without its mathematics, name no specific vulnerability without the structural pattern beneath it, describe no protocol without the human decisions that produced it. The reader who has finished — really finished, having followed the diagrams, the worked examples, the pull quotes, the cross-chapter threads — is now able, in principle, to look at any piece of software, any network, any device, and resolve it into the layers it is made of. To recognize that each of those layers is the consequence of specific human decisions made under specific historical pressures. To recognize the math underneath each one. To see where the security lives, and where it doesn't.

That recognition is not easily acquired and is rarely taught. The book has tried to teach it without compression, without skipping, and without ever assuming that an idea introduced in the third chapter will be remembered three hundred pages later — every reference back is explicit, every primitive is named, every figure builds on the figures before. If the discipline has worked, the reader now sees the layers when they didn't before, and that seeing will be portable to new domains the book did not cover.

The substrate is changing. New chips, new substrates, new paradigms — quantum, neuromorphic, who knows what else — will sit alongside the stack this book described. Some of the specific content will go out of date; most of the structural content won't. The reasoning skills the book asked for — resolve into layers, follow the math underneath, locate the security boundary, name the historical decision — are not specific to any year of computing. They will be useful in fifty years for the same reason they are useful now: every layer of every system, no matter what substrate it runs on, will still be a mathematical structure realised in the layer below, built by specific people under specific constraints, defended by carefully chosen assumptions. The vocabulary is durable.

Eighteen chapters from a single transistor to a planet-scale distributed service is a long arc. The reader who has walked it now has, in some real sense, the substrate of modern civilization laid out in their head. The book ends. The work the reader does with what they now see — that is the part the book cannot end.

"The competent programmer is fully aware of the strictly limited size of his own skull; therefore he approaches the programming task in full humility, and among other things he avoids clever tricks like the plague."

— Edsger W. Dijkstra, "The Humble Programmer," ACM Turing Award Lecture, 1972
Fig 18.6 — The closing zoom-out · the inverse of where Chapter 1 began
Fig 18.6 — The closing zoom-out · the inverse of where Chapter 1 began we started with one transistor · we end with the world running on top of them civilization cloud network kernel CPU transistor ~3 nm Chapter 1 Chapter 18 you can now see all the way through

The book's first diagram showed a single transistor. The book's last diagram shows what was built around it: a CPU, a kernel, a network, a cloud, a civilization. The same layered structure that Chapter 1 introduced as a single switch is now visible at every scale, in every direction. You have followed the chain from voltage on a wire to JSON on a screen and back. You have followed it from a thumb-sized transistor on a 1947 lab bench to a billion-transistor CPU running JavaScript inside a virtualised container inside a Kubernetes cluster inside an AWS region. You have followed it from Boole's algebra to elliptic-curve cryptography to the quantum computers that will retire it. The thing you can now see is not any one of these layers. It is the way they are all the same kind of object — a mathematical structure made physical, and re-realised in the layer above. That is what was inside the code.

end of Under the Code

Colophon
Author
Yki Lähteenmäki

Helsinki-based. Wrote this book to see all the layers at once.

Published by
Atheric

A small studio for tools and writing.
YTD Holdings Oy · Helsinki

End of the Book

Under the Code is built.

Eighteen chapters. Five parts. From a single transistor in 1947 to a planet-scale distributed service in 2026 — voltage to gates to instructions to kernels to languages to networks to cryptography to distributed consensus to the closing synthesis. The substrate of modern civilization, named layer by layer.

The discipline of this book — resolve into layers · follow the math · locate the security boundary · name the historical decision — is now yours.