The two ways to make your program use more than one core.
By 1990, the kernel had given UNIX programmers a clean answer to
"how do I run two things at once": fork. The child
process ran in parallel with the parent, isolated from it, with
its own address space, its own file descriptors, its own
everything (Chapter 4). The
isolation was exactly the problem when the two processes wanted
to cooperate. A parent that wanted to share a
million-element array with its child had to copy it across the
kernel boundary every time, paying for privacy it did not want.
POSIX 1003.1c, ratified in 1995, codified the
alternative: the thread — a separate stream of
execution that shares the address space of its parent
process, can be scheduled onto a different core, and can read
and write the same memory directly. Threads are the cheap, fast,
dangerous answer to "how do I use more than one core"; processes
are the slow, safe one. The next thirty years of concurrent
programming are essentially about the price of the danger.
Threads share the heap, the globals, the file descriptor table, and most other process state — only the call stack and the registers are per-thread. Processes share none of this; each lives behind its own page table, with the kernel mediating any communication. The trade is real and consequential: threads make IPC free (just write to a variable; the other thread can read it) but make every concurrent access a potential bug. Processes make every interaction a system call but make memory corruption between them impossible. Modern programming languages reach for both: C, C++, Java, Rust default to threads; Python's multiprocessing reaches for processes (because the GIL — Chapter 7 §04 — makes threads less useful); Erlang and Go invented their own lightweight thread-like primitives (BEAM processes, goroutines) that get the best of both.
Where parallelism actually helps
Not every program benefits from more cores. Programs whose work naturally divides into independent chunks — image filters processing pixels, web servers handling unrelated requests, Monte Carlo simulations running independent samples — are called embarrassingly parallel and scale almost linearly with core count. Programs whose steps depend on each other — most stateful computations, anything sequential — benefit much less, or not at all. The mathematics of this constraint is Amdahl's law, which we get to in §05; for now the practical observation is that adding cores is not a substitute for serial speed. A 64-core laptop runs a pure-sequential program no faster than a 1-core one with the same clock speed.
Two threads · one variable · everything depends on luck.
The price of shared memory shows up the first time two threads
modify the same variable concurrently. The classic example is so
simple it sounds harmless: two threads each incrementing a shared
counter a million times. The expected result is two million. The
actual result, on any real machine, is somewhere between one
million and two million, with the exact value depending on
scheduling, cache coherence, and luck. The increment that looks
atomic in source code (counter++) is in fact three
operations underneath: read the value into a register, add one,
write it back. If two threads' three-step sequences overlap, the
writes can stomp on each other.
The reference race condition. Both threads execute counter++ when the counter is 100. Thread 1 reads 100, adds 1, plans to store 101. Before Thread 1 stores, Thread 2 also reads 100 (the not-yet-updated value), adds 1, plans to store 101. Both stores happen; the counter ends at 101 instead of 102. Thread 2's increment vanished. Multiply this across a million iterations and the expected result of 2,000,000 routinely comes out as 1,000,000 + ε for some small ε. The bug is invisible in source code, depends entirely on scheduling, fails to reproduce reliably, and is one of the most frustrating classes of error in software. Decades of language design — atomics, locks, channels, Rust's borrow checker — exist to prevent this category specifically.
The mutex · serialise the critical section
The classic fix is a mutex (mutual exclusion lock).
Each thread, before touching the shared counter, calls
mutex.lock(); after touching it, calls
mutex.unlock(). Only one thread can hold the lock at
a time — any other thread that calls lock() while it
is held blocks until the holder releases. The protected
region (between lock and unlock) is called the critical
section, and the rule is that all critical sections that
touch the same data are serialised. The mutex itself is implemented
using atomic CPU instructions (compare-and-swap or
load-link/store-conditional) — operations the CPU
guarantees are indivisible at the hardware level. Underneath every
thread-safe data structure is a chain of these primitives.
// the race — counter ends somewhere between 1,000,000 and 2,000,000 int counter = 0; void worker() { for (int i = 0; i < 1'000'000; ++i) ++counter; // LOAD; ADD; STORE — three instructions } // the fix — std::mutex serialises the critical section std::mutex m; int counter = 0; void worker() { for (int i = 0; i < 1'000'000; ++i) { std::lock_guard<std::mutex> g(m); ++counter; // only one thread holds the lock at a time } // g's destructor releases the lock — RAII } // or, for this specific case, an atomic — no lock at all, single CPU instruction std::atomic<int> counter = 0; void worker() { for (int i = 0; i < 1'000'000; ++i) counter.fetch_add(1, std::memory_order_relaxed); }
Deadlock · the trap when locks compose
Locks solve the race; they introduce a new failure mode. Deadlock happens when two or more threads each hold a lock the other one needs, and both wait forever. The canonical recipe: thread A holds lock L1 and tries to acquire L2; thread B holds L2 and tries to acquire L1. Each is blocked waiting for the other; neither can make progress; the program freezes. Deadlock is mathematically characterised by the existence of a cycle in the "wait-for" graph: an arrow from each waiting thread to the thread whose lock it wants. If the graph has a cycle, there is deadlock; if it has no cycle, there isn't. The standard prevention is lock ordering — every thread acquires locks in the same global order — which ensures the wait-for graph remains acyclic.
The mutex closes the race, but introduces deadlock as a new failure mode. Two threads, two locks, each holding what the other needs — both block forever. The same shape with spinning threads (each repeatedly trying and failing rather than blocking) is called livelock — equally fatal, harder to debug because the program looks busy. The standard prevention is lock ordering: define a global order over all locks, and require every thread to acquire them in that order. The wait-for graph can then only go forward, never form a cycle. The discipline is real: large codebases (Linux, PostgreSQL, V8) document their lock hierarchies, and acquiring locks out of order is a code-review red flag. Above the mutex, every higher-level concurrency primitive — semaphores, condition variables, read-write locks, channels — is built from the same atomic foundation, with the same potential for these three bugs.
Therac-25, 1985–87. The first software-caused medical fatalities in history were race conditions. The Therac-25 radiation therapy machine used a single computer to control beam intensity and targeting, with no hardware interlock — its predecessors had hardware interlocks, but the engineers had concluded the software was reliable enough to remove them. A specific keystroke sequence, fast enough to interleave with the beam-mode-select state machine, could leave the machine in high-power X-ray mode while the targeting magnet was set for low-power electron mode. Six patients received radiation overdoses of up to a hundred times the prescribed dose; three died. The lesson is older than the field: any concurrent state machine without explicit synchronisation will eventually be observed in every reachable state, including the states the designer assumed were impossible.
Concurrency without locks · the deeper rabbit hole.
Locks have a fundamental cost beyond their nanosecond overhead: they serialise. A queue protected by a single mutex can be used by one thread at a time regardless of how many cores are available. In high-performance contexts — kernels, databases, network stacks, JavaScript engines, garbage collectors — engineers reach for lock-free data structures: queues, stacks, hash tables built so that multiple threads can read and write concurrently without ever blocking each other. The mechanism is the same atomic CPU instruction that mutexes are built on, used directly: compare-and-swap (CAS). With CAS, a thread can perform an "if the value is still x, change it to y; otherwise tell me what it is now" operation as a single indivisible step. Built carefully, CAS-based data structures can be wait-free (every thread always makes progress) or lock-free (the system as a whole always makes progress, even if individual threads can starve). Both terms come from Maurice Herlihy's 1991 paper Wait-Free Synchronization, which proved a hierarchy of exactly which data structures can be implemented wait-free on which atomic primitives — the result that turned an engineering practice into a mathematical theory. The cost is that the code is genuinely difficult to write correctly, and reasoning about it requires understanding memory orderings — the rules by which writes from one core become visible to others.
Compare-and-swap is the foundation primitive of all modern multi-core synchronisation. The CPU provides it as a single atomic instruction (cmpxchg on x86, ldxr/stxr on ARM); higher-level languages expose it as std::atomic::compare_exchange in C++, AtomicInteger.compareAndSet in Java, sync/atomic.CompareAndSwap in Go. The lock-free counter shown reads the current value, computes the increment, and tries to atomically swap the new value in only if the old value is still there. If another thread got there first, the CAS fails and the loop retries with the new value. No locks, no blocking, no deadlock — but every operation can race-lose and retry. Lock-free queues, stacks, and hash tables are built on top of this primitive, with great difficulty.
Memory orderings · why the world is not sequentially consistent
The deeper subtlety is that on modern multi-core CPUs, the order
in which one core's writes become visible to another core is
not the order they were issued. Out-of-order execution
(Chapter 1 §04), store buffers, and cache coherence all conspire
to let writes appear reordered from another core's perspective.
A thread that writes x = 1; flag = true; may have
another thread on a different core observe flag = true
while still reading x = 0 — the writes were visible
out of order. Programs that depend on cross-thread ordering must
explicitly request guarantees by annotating atomic operations
with a memory ordering: relaxed (no ordering),
acquire (subsequent loads see prior writes by the
releaser), release (this write becomes visible after
preceding writes on this thread), or sequentially consistent
(the strongest, most expensive, easiest to reason about). Reasoning
about which ordering each operation needs is the deepest skill in
multi-threaded systems programming, and the one most often gotten
wrong.
Memory ordering is the second hard problem in lock-free programming, after CAS. Modern multi-core CPUs reorder reads and writes within a single core for performance — out-of-order execution, store buffers, prefetching — and these reorderings are visible across cores. Languages like C++, Rust, and Java let the programmer specify, per atomic operation, what ordering the operation needs to provide. Stricter orderings (sequentially consistent at the top) are easier to reason about but cost performance on weak-memory-model hardware (ARM, POWER, RISC-V); weaker orderings (relaxed) are faster but require careful argument. The mismatch between hardware models is itself a portability hazard: code that depends on x86's nearly-sequential-consistency can fail on ARM. Java and Go default to sequentially consistent atomics for safety; C++, Rust, and modern kernel code use the weaker orderings deliberately for performance, accepting the reasoning burden.
The other shape of parallelism · thousands of small cores doing the same thing.
The CPU we have built — out-of-order execution, branch prediction, caches, all the machinery of Chapter 1 — is optimised for fast sequential execution: do one thing, do it as fast as possible, then move to the next. The GPU is the opposite. It has thousands of small cores running in lockstep, each executing the same instruction on different data. It evolved from the requirements of 3D graphics: every pixel on a million-pixel screen needs the same shading calculation applied to different inputs, and the natural way to do that is many small processors running the same code on different pixels. Around 2007, NVIDIA realised that this hardware was useful for any embarrassingly parallel computation, exposed it through the CUDA programming model, and watched a new computational substrate emerge. Today GPUs do most of the world's machine learning training and inference, most of its scientific simulation, all of its real-time graphics. They are a fundamentally different shape of computer, and software that uses them well looks very different from software that uses CPUs well.
A modern CPU spends most of its transistor budget on making one stream of instructions go fast: out-of-order execution, branch prediction, multi-megabyte caches, sophisticated SIMD units. Eight to sixteen of these expensive cores fit on a typical chip. A modern GPU spends its budget on doing the same thing to many things at once: thousands of small cores, very little branch-prediction logic, no out-of-order execution, much smaller caches. The two architectures are good at different problems. CPUs win on branchy, sequential, low-latency code (most application logic, operating system kernels, web servers). GPUs win on workloads that apply the same operation to a large array of data — graphics shading, neural network training and inference, scientific simulations, video transcoding. The deep-learning revolution of 2012 onward is, mechanically, the realisation that backpropagation through a neural network is one giant matrix multiply, and matrix multiplies are exactly what GPUs are built for.
SIMD and the CUDA hierarchy
The GPU's organising principle is SIMD — Single Instruction, Multiple Data. A SIMD instruction operates on a vector of values at once: add these 32 numbers to those 32 numbers in one instruction. CPUs have SIMD too (SSE, AVX on x86; NEON on ARM) — typically 128 or 256 bits wide, processing 4 or 8 values at once. GPUs take this much further: a CUDA warp is 32 threads running in lockstep, executing the same instruction on different data. Many warps are grouped into a block; many blocks make a grid; the entire grid is dispatched to the GPU as a single kernel launch. The programmer writes code that looks like it runs on a single thread; the runtime replicates it across the grid.
CUDA's four-level hierarchy maps directly onto the GPU's silicon. A thread is the smallest execution unit; threads are bundled into warps of 32 that execute in lockstep (any divergent branches in a warp serialise both paths — a major performance pitfall). Warps are grouped into blocks that share a fast scratchpad memory (~100 KB, called shared memory) and can synchronise with each other; blocks make up the grid that constitutes one kernel launch. The memory hierarchy is steep: registers are essentially free; shared memory is fast but small and per-block; L2 is grid-wide but moderate latency; global memory (HBM, High-Bandwidth Memory) is enormous but slow on a per-access basis (though with extremely high bandwidth — terabytes per second on modern GPUs). The art of writing fast CUDA kernels is keeping data in shared memory as long as possible and arranging global memory accesses so that an entire warp's threads access contiguous addresses (called coalesced access) — uncoalesced access is often the difference between fast and slow code.
The mathematics of how much speedup more cores can possibly give you.
Adding cores to a computation is not free, and it is not unbounded. Two laws govern what you can expect. Amdahl's law (Gene Amdahl, 1967) says that if a fraction p of a program is parallelisable and the rest is sequential, then the maximum speedup from N cores is 1 / ((1−p) + p/N). As N grows, the speedup approaches 1/(1−p) as a hard ceiling — even with infinite cores, the sequential 5% of the program limits speedup to 20×. The ceiling is brutal in practice: a program with 90% parallel code maxes out at 10×; with 95% parallel, 20×. Adding cores beyond the point where the parallel portion is fully exploited buys you nothing.
If a fraction p of a program parallelises perfectly and the remaining 1−p is strictly sequential, the speedup on N cores is:
S(N) = 1 / ((1 − p) + p / N)
The p/N term shrinks toward zero as you add cores; the (1−p) term does not move. So S(N) → 1 / (1 − p) as N → ∞. A program that is 95% parallel cannot exceed 20× speedup no matter how much hardware you buy. Gustafson's law (1988) inverts the question: if you grow the problem with the cores, the parallel portion grows while the sequential portion stays constant, and effective speedup becomes S(N) = N − (1 − p)·(N − 1) — nearly linear in N. Both laws are right; they describe different scenarios.
"The effort expended on achieving high parallel performance is wasted unless it is accompanied by achievements in sequential performance of very nearly the same magnitude."
— Gene Amdahl, "Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities," AFIPS, 1967Amdahl's law plotted for four different parallel fractions. A program with 90% parallel code (p=0.90) plateaus at 10× speedup no matter how many cores you throw at it; the sequential 10% becomes the dominant cost as N grows. With 95% parallel, the ceiling is 20×; with 99%, 100×; with 50%, only 2×. The implication is severe and counterintuitive: getting from 90% to 95% parallel is harder engineering than going from 8 cores to 64 cores. Most working programs hover around 80–95% parallel — which is why a 64-core machine very rarely runs an actual program 64× faster than a 1-core machine. Amdahl's law explains why CPUs have stopped at "tens of cores" for desktop and laptop loads while supercomputers and GPUs (where workloads are 99%+ parallel) keep adding more.
Gustafson's law (1988) appears to contradict Amdahl's, but actually answers a different question. Amdahl asks: "I have a fixed problem. How much faster can I make it with more cores?" Gustafson asks: "I have more cores. How much bigger a problem can I tackle in the same time?" If problem size grows with the available compute — more particles in a physics simulation, more pixels in a rendering, more samples in a Monte Carlo — the sequential portion stays roughly constant while the parallel portion scales, and effective speedup grows nearly linearly with cores. This is why GPU-driven workloads (training a bigger neural network on more GPUs) and HPC workloads (simulating a finer-grained physical model on more nodes) really do scale to thousands of cores: the problem grew with the resources.
Amdahl and Gustafson asked complementary questions and got opposite-looking answers. Amdahl: "I have a one-hour computation. Will doubling the cores let me run it in 30 minutes?" — usually no, because the sequential portion limits speedup. Gustafson: "I have a one-hour budget. Will doubling the cores let me run a problem twice as big?" — usually yes, because the parallel portion grows while the sequential portion stays roughly constant. Both are right; both apply; the choice of which to invoke depends on whether your problem size is fixed or scaling with hardware. The deep-learning revolution is a Gustafson story: bigger models trained on more GPUs in roughly the same wall-clock time. Real-time graphics is an Amdahl story: render the next frame in 16 ms, no matter how many cores you have.
Memory is not flat any more · cache coherence is the only reason it pretends to be.
Two more facts about modern multi-core machines deserve naming. First: memory is not uniformly accessible. On a server with two CPU sockets, each socket has its own local DRAM, and accessing memory attached to your own socket is dramatically faster than accessing memory attached to the other socket. This is NUMA — Non-Uniform Memory Access — and a thread that runs on socket 0 but accesses memory allocated on socket 1 pays a 50–100% latency penalty per access. Operating systems and runtimes try to keep threads near their data automatically; when they fail, performance silently halves. Second: caches must stay coherent across cores even though each core has its own L1 and L2. The mechanism is a hardware protocol called MESI — Modified, Exclusive, Shared, Invalid — that tracks the state of every cache line on every core and ensures that no two cores have conflicting writes to the same line.
A two-socket server is two computers sharing a coherence protocol. Each socket has its own DRAM controllers and its own local memory; accessing the other socket's memory traverses the inter-socket interconnect (Intel calls it UPI, AMD calls it Infinity Fabric) and pays substantially more latency. The OS scheduler tries to keep threads on the same socket as their data; allocators try to put new memory on the requesting thread's socket; runtimes (Java, Go, .NET) have NUMA-aware schedulers. When all of this works, you don't notice; when it fails, your program runs at half speed and looks otherwise correct. NUMA is also the architecture inside high-performance GPUs (each GPU is its own NUMA node) and across multi-GPU setups (NVLink interconnects between GPUs). The pattern repeats: parallelism brings speed, distance brings latency, and good engineering keeps the two aligned.
MESI is the protocol that lets every core have its own cache while pretending memory is globally consistent. Every cache line is in one of four states; every read or write is mediated by bus transactions that update the states across all cores. The protocol is invisible to software but pervasive in performance: when two threads on different cores write the same cache line repeatedly, the line "ping-pongs" between caches with each write, costing a hundred or more cycles per access. This is why false sharing — two unrelated variables that happen to share a 64-byte cache line — can silently halve a program's throughput, and why high-performance code pads structures to cache-line boundaries. The protocol predates multi-core CPUs (it was developed for cache-coherent multiprocessor systems in the 1980s) and survives essentially unchanged because the alternatives are worse. Variants exist (MOESI, MESIF — adding "owned" or "forward" states for optimisations) but the four-state core is universal.
The cache as a side channel. In January 2018, Spectre and Meltdown showed that the speculative-execution machinery of every modern CPU since 1995 — the same machinery that gave us the speed described in Chapter 1 — could be coerced into reading memory across security boundaries. The trick: speculative loads pulled secret data into the cache; even though the speculation was rolled back at the architectural level, the cache state survived, and an attacker measuring access timings could read out the secret bit by bit. The fix required microcode patches, kernel page-table isolation (KPTI), and a permanent ~5–30% performance regression on syscall-heavy workloads. Cache coherence is not just a performance mechanism; it is also the most expensive security boundary in the machine.
Closing the chapter · seam to Chapter 17
Parallelism scales the single computer up: from one core to sixty-four, from one socket to two, from CPU to GPU. It runs against three constraints we have just named — Amdahl's ceiling on speedup, NUMA's penalty for non-local memory, and MESI's cache-line-ping-pong — and the discipline of high-performance code is the discipline of designing around all three. Above the single computer is the next category: many computers, not sharing memory, communicating over a network, having to reach agreement despite the network being unreliable and the computers being individually fallible. That is the problem of Chapter 17. The mathematics that governs it — distributed consensus, the CAP theorem — is one of the most consequential bodies of work in computer science.