Under the Code — Part I: The Physical World

01 — Context

Why does any of this exist?

To understand a modern computer, you have to understand that it was never inevitable. For most of human history, calculation was something humans did. The word computer itself originally referred to a person — usually a woman — employed to perform arithmetic by hand. The machine you are reading this on replaced an entire profession.

The transition from human computer to machine computer required a particular kind of insight: that thinking itself could be mechanized. Not all thinking — Alan Turing was careful about this — but a specific kind. The kind that follows rules. The kind that, given the same input, always produces the same output. Mathematics, in other words. And once you have a machine that can do mathematics, you have a machine that can do anything mathematics can describe. And mathematics, it turns out, can describe a remarkable amount of the world.

The 1936 paper that changed everything

In 1936, a 24-year-old Cambridge mathematician named Alan Turing published a paper titled On Computable Numbers, with an Application to the Entscheidungsproblem. It had nothing to do with machines, ostensibly. It was about a question David Hilbert had posed in 1928: is there a mechanical procedure that, given any mathematical statement, will decide whether the statement is true or false?

Turing's answer was no — but the way he proved it was extraordinary. To show that no such procedure could exist, he had to first define what "mechanical procedure" meant precisely. He invented an imaginary device: an infinite tape divided into cells, a read/write head that could move along the tape, and a finite set of rules that said what to do based on the current symbol and the machine's current state.

He called it an a-machine — automatic machine. We now call it a Turing machine. And he proved something that took decades to fully appreciate: a single, sufficiently complex Turing machine could simulate any other Turing machine, given the right rules on its tape. He called this a universal machine. It is the first formal description of what we now call a computer.

Fig 1.1 — A Turing machine, in motion

A Turing machine has three parts: an infinite tape of symbols, a head that reads and writes one cell at a time, and a finite table of rules. Each rule says: given the current state and the symbol under the head, write a new symbol, move left or right, and change to a new state. Watch the head crawl along the tape, rewrite cells, and hand off to the next state — three transitions on a nine-second loop. Anything that can be computed, can be computed by a machine like this.

"We can only see a short distance ahead, but we can see plenty there that needs to be done."

— Alan Turing, 1950

The war that built the first computers

Theory met urgency in 1939. Britain needed to break German naval codes — the Enigma cipher — faster than humans could. At Bletchley Park, Turing built the Bombe, an electromechanical machine that searched the Enigma key space mechanically. Later, his colleague Tommy Flowers built Colossus, the first programmable electronic digital computer, used to break the higher-level Lorenz cipher used by Hitler's high command. These machines were not Turing machines in his theoretical sense — they were special-purpose. But they proved that mechanical computation worked at scale.

Fig 1.2 — Enigma · why it had to be broken by machine

Why Enigma was so hard: the same key pressed three times in a row gives three different letters because the rotors step between presses, changing the internal wiring path. Multiplied across the rotor stack, plugboard pairings, and daily reflector settings, the keyspace was about 158 × 10¹⁸ possibilities. No pencil-and-paper attack could keep up. Turing's Bombe — the gold-tinted strip below the rotors — automated the search by exploiting structural weaknesses in the encryption: it tested rotor positions in parallel, stopping when the math became consistent. It was the first time a machine was built specifically to think through a problem human minds could not.

Historians estimate Bombe and Colossus shortened the war by two years and saved roughly fourteen million lives. The machines worked. The question, after the war, became: how do you build one that isn't purpose-built for a single problem? How do you build a universal machine — Turing's theoretical idea, in actual hardware?

⚡

Historical note. Turing was prosecuted for homosexuality in 1952, sentenced to chemical castration by court order, and died in 1954 — officially by cyanide poisoning, likely suicide. He never saw the computer revolution he made possible. Britain issued a formal apology in 2009. He now appears on the £50 note.

02 — The Transistor

Everything is sand

The first general-purpose electronic computer, ENIAC (1945), was built from vacuum tubes — glass bulbs that controlled electrical current using a heated filament inside a vacuum. ENIAC had 17,468 of them. It filled a room thirty meters long, weighed thirty tons, consumed 150 kilowatts of power (enough to dim the lights of Philadelphia when it switched on), and broke down on average every two days because a tube would burn out. The mean time between failures was measured in hours, not years.

This was unsustainable. To make computers smaller, faster, more reliable, and affordable, the vacuum tube had to be replaced. The replacement was invented in 1947 at Bell Labs in New Jersey by three physicists — William Shockley, John Bardeen, and Walter Brattain. They called it the transistor, a contraction of transfer resistor.

Fig 1.3 — The replacement that made everything possible

Both devices are amplifiers and switches — both can take a small input signal and use it to control a much larger one. The vacuum tube does it by boiling electrons off a heated filament inside an evacuated glass bulb, then steering them toward a positively-charged plate using a control grid in between. The transistor does the same job by gating electrons through a sliver of doped silicon — no glass, no filament, no heat, no wear. Once you can build one, you can build a billion of them; once you can build a billion, you have a CPU.

What a transistor actually is

A transistor is a switch. Specifically, it is a switch with no moving parts, controlled by electricity rather than by a finger or a relay arm. It has three terminals: a base (or gate, in the modern field-effect design), a collector (or drain), and an emitter (or source). When you apply a small voltage to the base, it lets a much larger current flow between collector and emitter. No voltage at the base, no flow.

That's it. That is the entire foundational mechanism of every computer ever built. Everything else — every program, every webpage, every video game, every neural network — is layers of abstraction built on top of switches turning on and off.

Fig 1.4 — Transistor as a switch

A transistor with no voltage at the base is open: no current flows from collector to emitter. With a small voltage at the base, the channel opens and current flows — the green dots above are electrons, drifting through the channel as long as the gate signal is present. A modern CPU contains roughly fifty billion of these switching at gigahertz speeds.

Why semiconductors

The transistor works because it is built from a semiconductor — a material whose conductivity sits between a conductor (like copper) and an insulator (like glass), and which can be precisely tuned by adding tiny amounts of impurities. Almost all modern transistors are built from silicon, which is the second most abundant element in Earth's crust. Sand is mostly silicon dioxide. The entire digital economy is, in a literal sense, built on purified sand.

Fig 1.5 — Inside the switch · isometric cross-section

Underneath the schematic symbol is a stack of materials. The substrate is silicon doped with electron-acceptor atoms (P-type). Two regions on either side are doped with electron donors (N⁺) — these are the source and drain. Between them, an insulating layer of silicon dioxide separates the substrate from a conducting gate. With no voltage on the gate, no path exists between source and drain. Apply a small positive voltage and the substrate just under the oxide inverts — a thin layer of free electrons forms a conducting channel, and current flows. This effect — the field-effect — is why these are called field-effect transistors, or MOSFETs (Metal-Oxide-Semiconductor Field-Effect Transistors). Every modern CPU is a stack of these about 3 nanometres wide.

In 1958, Jack Kilby at Texas Instruments etched multiple transistors onto a single piece of germanium, creating the first integrated circuit. Within months, Robert Noyce at Fairchild Semiconductor independently did the same on silicon, with a more manufacturable process. Both filed patents. Both were right. Noyce's process became the basis of the entire industry.

In 1965, Gordon Moore — co-founder of Intel and a former colleague of Noyce — published an observation: the number of transistors that could fit on a single integrated circuit was doubling roughly every twelve months, and would likely continue to do so. He later revised this to every two years. This pattern, Moore's Law, held with remarkable accuracy for fifty years and drove the entire semiconductor industry's roadmap. It is why your phone, with perhaps fifteen billion transistors, exists at all.

Fig 1.6 — Moore's Law · transistors per chip, 1971—2024

Moore's Law plotted on a logarithmic y-axis. The points fall almost on a straight line, which is what exponential growth looks like in log space — each gridline is ten times the previous. The slope corresponds to a doubling roughly every two years. The pattern began to slow after 2015 as transistor sizes approached the size of single atoms, but it has lasted longer than nearly any technological prediction in history.

1947

The transistor invented at Bell Labs

Shockley, Bardeen, and Brattain demonstrate the first working transistor on December 23, 1947. Nobel Prize 1956. The replacement of vacuum tubes makes miniaturization possible.

1958

First integrated circuit

Kilby (TI) and Noyce (Fairchild) independently put multiple transistors on one chip. Kilby uses germanium, Noyce uses silicon. Silicon wins because it forms a superior native oxide insulator.

1965

Moore's Law stated

Gordon Moore predicts transistor density doubles every year (later revised to two years). The prediction becomes a self-fulfilling industrial roadmap.

1971

Intel 4004 — first commercial microprocessor

2,300 transistors on a 10μm process. 740 kHz clock. Originally designed for a Japanese calculator manufacturer. Marked the transition from "computers fill rooms" to "computers fit in pockets."

2024

Apple M4 chip

28 billion transistors. 3-nanometer process. Each transistor smaller than the width of a flu virus. Moore's Law is slowing — physical limits are being reached — but it has lasted over half a century.

Fig 1.7 — The same idea, 79 years apart

Both machines compute. ENIAC filled a thirty-metre hall; the M4 fits under your thumbnail and contains roughly twelve million times as many switches. Both fundamentally do the same thing — flip switches according to instructions — but the engineering distance between them is the entire history of the integrated circuit. The figures and silhouettes above are drawn at the same on-screen size, but the actual size ratio is closer to 2,000 to 1.

03 — The Architecture

Von Neumann's insight: instructions are data

Early computers were hardwired. ENIAC, to be reprogrammed, had to be physically rewired — operators (almost all women, called "the ENIAC girls") would unplug and replug cables for days to set up a new calculation. The hardware encoded the program. To change what the machine did, you changed the machine.

In 1945, John von Neumann — a Hungarian-American polymath who had worked on the Manhattan Project and had a hand in nearly every important mathematical development of the mid-20th century — circulated a paper called the First Draft of a Report on the EDVAC. It described a different kind of architecture, one that would become the dominant design for every computer built afterward.

His insight, distilled: store the program itself in memory, alongside the data it operates on. Instructions are just patterns of bits. Data is just patterns of bits. They can live in the same memory. The CPU reads instructions from memory exactly the same way it reads data — which means a program can read another program, modify it, and then run it. A machine can write programs.

This is not a small idea. It is the difference between a machine that does one thing very well and a machine that, given the right instructions, can do anything. The compiler that turns C code into machine code is a program. The web browser running on your computer is a program. The kernel that runs the browser is a program. They all live as data in memory until the CPU reads them as instructions.

Fig 1.8 — Von Neumann architecture, in motion

Von Neumann architecture in its essence: one shared memory holds both program instructions and data. The CPU fetches an instruction (gold packet flowing right on the bus), decodes it, executes it on the ALU, and writes results back to memory or registers (blue packet flowing left). Then it fetches the next instruction. Nearly every computer built since 1945 follows this design — and the diagram above plays out one cycle of it on a continuous loop.

The Von Neumann bottleneck

The architecture has one famous weakness, and it is a consequence of its core design choice. Because instructions and data share the same memory, they share the same bus — the same physical wires between the CPU and main memory. The CPU can only do one transfer at a time. And while CPUs got dramatically faster over the decades (clock speeds rose from kilohertz to gigahertz), main memory became faster much more slowly. Today, a modern CPU can perform a basic operation in under a nanosecond. A round trip to main RAM takes about a hundred nanoseconds. The CPU spends most of its time waiting.

The solution is a hierarchy of caches — small, fast memory built directly into the CPU that holds copies of recently used data. We'll cover this in detail later in this chapter. For now, it's enough to know that this bottleneck is one of the deepest design constraints of modern computing, and essentially every CPU optimization since the 1980s has been an attempt to work around it.

04 — The CPU

Fetch. Decode. Execute. Repeat.

Every CPU ever built does the same four things in a loop, billions of times per second. This loop is the heartbeat of every program you have ever run.

The instruction cycle is the fundamental unit of computation at the hardware level. A modern CPU executes billions of these per second on each of its cores. Everything else — your operating system, your browser, your music, this page — is just a particular sequence of instructions that the cycle runs through.

Fig 1.9 — The instruction cycle, in motion

A modern CPU executes billions of these cycles per second on each of its cores. Watch the gold packet circulate above: it represents one instruction making its way through the four stages, then the program counter advances and the next instruction begins. Everything else — your operating system, your browser, this page — is a particular sequence of instructions threaded through this loop.

What an instruction actually is

At the hardware level, every instruction is simply a binary number — a specific pattern of ones and zeros. The CPU's control unit is a circuit that, when given a particular pattern, activates a particular sequence of internal signals. Different patterns trigger different operations. The mapping from binary patterns to operations is called the instruction set architecture, or ISA. Intel and AMD CPUs use the x86 ISA. Apple Silicon, your phone, the Raspberry Pi, and every modern Mac use ARM. They are mutually incompatible — code compiled for one will not run on the other unless translated.

Assembly language is just a human-readable label for these patterns. The mnemonic mov eax, 5 is the assembler's name for the binary instruction 10111000 00000101 00000000 00000000 00000000 on x86. They mean exactly the same thing — assembly is a one-to-one translation. We will spend all of Chapter 3 on this.

x86 assembly

; A simple addition program
; This is roughly what C's "int x = 5 + 3;" compiles to

mov  eax, 5        ; load the value 5 into register EAX
mov  ebx, 3        ; load the value 3 into register EBX
add  eax, ebx      ; EAX = EAX + EBX, result is now 8

; In memory, "add eax, ebx" is just two bytes: 01 D8
; The CPU's decoder maps this pattern to "ALU add" with
; source EBX and destination EAX. Then the ALU performs it.

Fig 1.10 — From assembly to binary · one instruction decoded

A single x86 instruction in memory, taken apart byte-by-byte. The first byte (gold) is the opcode — but it isn't atomic; its top five bits encode the operation type ("move 32-bit immediate into register"), and its bottom three bits encode which register (000 = EAX). The next four bytes (blue) are the literal value 5, stored in little-endian order — the lowest byte first, which is why 05 appears closest to the opcode and the zero-padding follows. The CPU's instruction decoder is a circuit that recognises these bit patterns and routes them to the appropriate units in microscopic time.

Pipelining and parallelism inside one core

A naive CPU would do each step of the instruction cycle one at a time, finishing one instruction completely before starting the next. Modern CPUs do not. They use pipelining — overlapping the stages of different instructions, like an assembly line. While instruction 3 is being executed, instruction 4 is being decoded, and instruction 5 is being fetched. A modern pipeline may have 14 to 20 stages.

Fig 1.11 — Pipelined execution · five instructions, five stages

Each instruction still takes five cycles to fully complete — but the CPU works on five instructions simultaneously, each in a different stage. The diagonal pattern is the signature of a pipeline: a new instruction enters the IF stage every cycle, and a finished instruction leaves the WB stage every cycle. Modern CPUs may have 14 to 20 pipeline stages, multiple of each kind, and execute several instructions in parallel per cycle. The cursor above sweeps across one cycle at a time so you can see what's happening inside the chip at each tick.

They also use out-of-order execution — if instruction 5 doesn't depend on instruction 4's result, the CPU may execute 5 first while 4 waits for memory. And speculative execution: if there's a branch (an if statement), the CPU guesses which way it will go and starts executing that path before the condition has been computed. If it guesses right, it saves time. If it guesses wrong, it discards the work. Modern CPUs guess right about 95% of the time.

Fig 1.12 — Speculative execution · the CPU guesses, then checks

When the CPU reaches a conditional branch, it can't afford to wait for the condition to be computed — that would idle the pipeline for many cycles. So it guesses, based on the branch's history, and starts executing one path immediately. If the guess turns out right, the speculative work is committed and time was saved. If wrong, the work is discarded and the alternate path begins. Modern CPUs guess right about 95% of the time. The catch — exposed in 2018 — is that "discarded" only means the architectural state is rolled back. Microarchitectural side effects, especially in caches, persist. Read code that "shouldn't have run" leaves a fingerprint, and a careful attacker can read the fingerprint.

🔬

Spectre and Meltdown (2018). Two catastrophic CPU vulnerabilities discovered in nearly every processor made since 1995. They exploited speculative execution: when the CPU guessed wrong and discarded the work, traces of that work remained in the cache — traces that an attacker could measure to read memory they should not have access to. Hardware has bugs too, and they are far harder to fix than software bugs. We will return to this in Chapter 15.

05 — The Kernel

Why the kernel had to be invented

In the early 1950s, running a program meant booking the entire computer for yourself. You brought your stack of punched cards to the machine room, the operator loaded them, the machine ran your program, printed the output, and you came back an hour later to read it. One program at a time. No sharing.

Fig 1.13 — Running a program in 1955 · one job, one user, one machine

For nearly two decades, this was the entire user experience of computing. The machine was a precious shared resource; the user — a programmer, scientist, or engineer — was a supplicant who handed over a stack of cards and waited. The CPU often sat idle while the operator loaded the next deck. Universities started asking: can the machine run someone else's program while mine waits for I/O? Can two people use it at once? The kernel was the answer.

As computers got faster and programs got longer, this became absurd. The CPU sat idle most of the time, waiting for slow input/output devices like punch card readers or magnetic tape drives. Universities started asking the obvious question: can multiple programs share a computer? Can one program run while another waits for I/O? Can different users be logged in simultaneously?

The answer was yes — but only if something managed the sharing. That something became the operating system kernel. The kernel is the one program that always runs. It owns the hardware. Every other program must ask it for permission to do anything that touches the outside world.

Fig 1.14 — Privilege rings · syscalls cross the boundary

x86 CPUs have four privilege levels (rings 0–3), but in practice operating systems only use two: Ring 0 (kernel mode) and Ring 3 (user mode). Code in Ring 0 can execute any CPU instruction and access any memory. Code in Ring 3 cannot. The boundary is enforced by the CPU hardware itself, not by software. Watch the gold packet: an ordinary user program executes a syscall instruction, the CPU traps into Ring 0, the kernel handles the request, and control returns to user space — typically in less than a microsecond, and a typical desktop performs millions of these per second.

The system call: crossing the boundary

When your Python script opens a file, it doesn't access the disk directly. Your program — running in Ring 3 — has no permission to talk to the disk controller. Instead, it calls open() in Python, which calls fopen() in the C library, which calls a special CPU instruction (syscall on modern x86-64) that triggers a hardware exception. The CPU saves the program's state, switches to Ring 0, and jumps to a kernel entry point. The kernel then checks whether your process has permission to read that file, finds it on disk, and returns a numeric handle (a file descriptor) to your program.

Your program never touched the hardware. It asked the kernel, the kernel decided. This is the entire foundation of operating system security. Without this boundary, every program would have full access to every other program's memory, every file, every network packet. With it, programs are isolated from each other by hardware-enforced rules.

C → kernel

// What you write in C:
FILE *f = fopen("data.csv", "r");

// What the C library does internally:
int fd = syscall(SYS_open, "data.csv", O_RDONLY, 0);

// What happens at the CPU level:
//   1. The syscall instruction triggers a switch from Ring 3 to Ring 0
//   2. The kernel reads the syscall number (SYS_open == 2 on Linux)
//   3. It dispatches to sys_open() inside the kernel
//   4. sys_open checks process permissions against file ownership
//   5. It walks the filesystem (directory tree) to find the file
//   6. It allocates a file descriptor in this process's table
//   7. It switches back to Ring 3 and returns the descriptor (e.g. 3)
//
// Your program sees: fd == 3. It never touched the disk hardware.

UNIX and the kernels we still use

In 1969, at Bell Labs, Ken Thompson and Dennis Ritchie designed an operating system called UNIX. Its design was austere and elegant: everything is a file. A regular file is a file. A keyboard is a file. A network connection is a file. A running process exposes a directory of files describing it. All of them accessed through the same system call interface: open, read, write, close.

UNIX became, directly or indirectly, the ancestor of nearly every operating system in current use. Linux is a UNIX-like kernel written from scratch by Linus Torvalds in 1991. macOS uses a kernel called XNU, derived from a hybrid of Mach and BSD (a UNIX descendant). iOS and Android are both built on UNIX-derived kernels. Even Windows, originally not UNIX-based, now ships with a Linux subsystem. The architectural ideas in UNIX — processes, file descriptors, the system call interface — define what an operating system is.

🛡️

Why this matters for security. A privilege escalation attack is one in which an unprivileged process — your malicious program running in Ring 3 — finds a way to gain Ring 0 access. If it succeeds, it has full control of the machine: it can read any file, watch any keypress, install any rootkit. Every major operating system has had privilege escalation vulnerabilities. The kernel is the most security-critical code on any computer. We will spend significant time on this in Part II's kernel chapter and again in Part IV's unified security chapter.

06 — Memory

The hierarchy of forgetting

Memory in a computer is not one thing. It is a hierarchy of increasingly larger, slower, cheaper storage, managed at different levels by the CPU and the kernel. Each level holds a copy of part of the level below it. Closer to the CPU means faster but smaller. Further away means larger but slower.

The numbers below tell the entire story of why optimizing software is hard. The CPU performs an arithmetic operation in roughly 0.3 nanoseconds. A round trip to main memory takes 60 nanoseconds — two hundred times longer. A read from a fast SSD takes 50,000 nanoseconds — 167,000 times longer than a register access. Most of computer architecture for the past forty years has been about hiding this gap.

Level	Location	Typical size	Access time	Managed by
Registers	Inside the CPU core	~1 KB	1 cycle (~0.3 ns)	Compiler, CPU
L1 Cache	On the CPU die	32–64 KB	4 cycles (~1 ns)	CPU hardware
L2 Cache	On the CPU die	256 KB – 1 MB	12 cycles (~4 ns)	CPU hardware
L3 Cache	On the CPU package	8–64 MB	40 cycles (~13 ns)	CPU hardware
RAM (DRAM)	On the motherboard	8–128 GB	~100 cycles (~60 ns)	OS kernel
SSD (NVMe)	On the PCIe bus	256 GB – 4 TB	~50,000 ns	OS + filesystem
HDD (spinning disk)	SATA bus	1–20 TB	~5,000,000 ns	OS + filesystem

Fig 1.15 — The hierarchy of forgetting · width = capacity, dot speed = latency

The same data as the table above, drawn so the gaps are visible. Each row's width is roughly proportional to its capacity; each row's dot speed is roughly proportional to its access latency. The top three rows tick almost too fast to follow; the SSD dot crawls; the HDD dot is essentially still. Most of computer architecture for the past forty years — caching, prefetching, pipelining, branch prediction, parallelism — exists to hide this gap from the program.

Why caches multiply speed

The hierarchy works because programs do not access memory uniformly. They keep returning to the same regions over and over — looping over an array, calling a function repeatedly, reading the next word in a string. This is called locality of reference, and it is the reason a small fast cache speeds up a much larger slow memory by an enormous factor. The math is unforgiving but simple:

T_avg = h · t_cache + (1 − h) · t_RAM

If 95% of accesses hit L1 cache (1 ns) and only 5% miss into RAM (60 ns), the average access time is 0.95 × 1 + 0.05 × 60 = 3.95 ns. Without the cache, every access would cost 60 ns. The cache makes the whole system roughly fifteen times faster — and it does this with a hit rate that needs to be high but doesn't need to be perfect. The same calculation justifies every layer of the hierarchy: a small fast tier above a larger slow one, exploiting the fact that the next thing a program needs is usually close to the last thing it needed.

Peter Denning's working set theory (1968) made this rigorous, and every CPU since has been a refinement of the idea. We will see the same locality argument resurface in Part II's kernel chapter when we look at virtual memory and page caches, in Part III's web chapter when we examine DNS caching, and in Part IV's data chapter when we examine database buffer pools. The same equation, the same intuition, all the way up the stack.

Virtual memory: the lie the kernel tells

When your program asks for memory, the kernel does not give it a real physical address. It gives a virtual address — a fiction. Your program believes it has its own private address space starting at zero, with gigabytes of memory available. So does every other program. They all think they own the machine.

The CPU's Memory Management Unit (MMU), guided by tables that the kernel maintains, translates these virtual addresses to physical RAM addresses every time a program reads or writes memory. The translation tables — called page tables — divide memory into 4-kilobyte chunks called pages and map each virtual page to either a physical page in RAM or to a location on disk (if RAM is full and the page has been swapped out).

This achieves three things at once: isolation (no program can read another's memory because the translation tables for different programs map to different physical pages), flexibility (the kernel can move pages around in RAM, swap them to disk, or load them on demand from a file), and protection (the page tables also encode permissions — read, write, execute — and the MMU enforces them).

💡

Buffer overflow attacks exploit the memory model directly. If a program writes more data than a buffer can hold, the extra bytes spill into adjacent memory. If that memory contains a return address — an address the CPU will jump to when the current function ends — an attacker who controls the input controls where the CPU jumps next. Stack canaries, ASLR (Address Space Layout Randomization), and DEP (data execution prevention) are the layered hardware and OS defenses against this. We'll see exactly how this works at the assembly level in Chapter 3.

What you now understand

You have followed the chain from sand to software. Transistors — silicon switches controlled by voltage — combine into logic gates, then into integrated circuits, then into CPUs. The CPU runs an endless loop: fetch, decode, execute, writeback. It executes instructions encoded as binary patterns drawn from an instruction set architecture like x86 or ARM. Programs and data live together in the same memory, a design choice — Von Neumann's — that defines what a modern computer is. The kernel mediates between programs and hardware, enforcing isolation and security through privilege rings and virtual memory.

This is the substrate. Every chapter that follows builds on it. Chapter 2 goes deeper into a single layer of this stack — the layer where mathematics and electricity meet. We will look at why a computer must be binary, how George Boole invented the logic that runs on those binary signals, and how arithmetic — the thing computers fundamentally do — emerges from a few simple gates. By the end of Chapter 2, you will understand at a physical level why 0.1 + 0.2 does not equal 0.3 in any modern programming language.

A	B	A · B
0	0	0
0	1	0
1	0	0
1	1	1

A	B	A + B
0	0	0
0	1	1
1	0	1
1	1	1

A	¬A
0	1
1	0

A	B	Sum	Carry
0	0	0	0
0	1	1	0
1	0	1	0
1	1	0	1

01 — ISA

An instruction is a contract

In 1978, Intel released the 8086 — a 16-bit chip whose instruction set has not been removed from any consumer CPU since. Every x86-64 processor sold today, from a laptop to a data-centre rack, still recognises the same core binary patterns Intel encoded that year. A processor doesn't run C, or Python, or JavaScript. It runs a sequence of binary patterns drawn from a specific menu — the instruction set architecture, or ISA. The ISA is the contract between hardware and software: hardware promises to execute every pattern in the menu correctly; software promises to use only patterns from the menu. Everything else — every compiler, every operating system, every program — sits on top of this contract.

Three ISAs dominate the world. x86-64 — the direct descendant of that 1978 Intel line — runs essentially every PC and most servers. It is famously complex, with thousands of instructions accumulated over four decades of backward compatibility. ARM, designed by Acorn in the 1980s and now licensed by basically every phone manufacturer on Earth, is the modern alternative — cleaner, more efficient, and the architecture of every iPhone, every Apple Silicon Mac, and every Android device. RISC-V, from Berkeley in 2010, is an open standard ISA gaining ground in academia, embedded systems, and increasingly in industry — anyone can build a RISC-V chip without paying licensing fees.

Fig 3.1 — Same operation · three ISAs · three encodings

The same logical operation — add 5 to a register — encoded three different ways. x86-64 uses a variable-length instruction with a one-byte opcode-prefix (REX.W marks 64-bit), a one-byte main opcode, an operand byte, and a four-byte little-endian immediate value. ARM64 packs everything into a fixed 32-bit word with carefully laid-out fields. RISC-V also uses a fixed 32-bit word, but its field layout is different again — and there are only six instruction formats total in the entire base ISA. Three architectures, three philosophies; each set of bits is meaningless on the other two CPUs. The contract between hardware and software is exactly which patterns of bits mean what.

They are not interchangeable. An x86 binary will not run on an ARM CPU and vice versa. The shift from Intel x86 to Apple's ARM-based M1 in 2020 required Apple to ship Rosetta 2, a translation layer that converts x86 instructions to ARM on the fly. That such a translator is even possible is itself remarkable.

Fig 3.2 — Rosetta 2 · how a chip pretends to be a different chip

Rosetta 2 is what made the 2020 Intel-to-Apple-Silicon transition feel painless. When a Mac launches an x86 binary on an ARM-based M-series chip, Rosetta 2 — once, on first launch — reads the entire executable, translates each x86-64 instruction into a sequence of ARM64 instructions that produces the same observable result, and saves the translated binary to a cache. Subsequent launches skip the translation step. The translated code runs at roughly 80% the speed of native ARM, sometimes faster than the original on the original Intel hardware. Rosetta 2 is possible only because both ISAs are Turing-complete with the same memory model — every x86 instruction has an equivalent sequence of ARM ones, and a careful translator can find that sequence statically.

RISC vs CISC, and why the war ended in a draw

In the 1980s there was a vigorous debate. CISC — Complex Instruction Set Computers, like x86 — had hundreds of specialized instructions, including multi-step operations like "load from memory, add, and store back" in a single instruction. RISC — Reduced Instruction Set Computers, like ARM, MIPS, SPARC — had a small set of simple, uniform instructions. RISC machines could clock faster because each instruction was simple to decode.

The debate was real. By the 2000s, it was over — and both sides had quietly converged. Modern x86 CPUs internally translate complex CISC instructions into simpler RISC-like micro-operations (μops) and execute those. Modern ARM chips have grown more complex over time, with specialized instructions for cryptography, vector math, and machine learning. The clean ideological lines have blurred. What remains is the binary incompatibility — the fact that the patterns of bits mean different things on different chips.

02 — Registers

The CPU's tiny notebook

Recall from Chapter 1 that registers are the smallest, fastest level of the memory hierarchy — a few dozen tiny storage cells inside the CPU core itself. Every instruction the CPU executes operates primarily on registers. To add two numbers from memory, the CPU must first load them from memory into registers, add the registers, and store the result back to memory. Memory is too slow to operate on directly.

x86-64 has 16 general-purpose registers, each 64 bits wide. Their names carry historical baggage: in 1978 the original 8086 had 16-bit registers named AX, BX, CX, DX. When Intel went to 32 bits in 1985, they prepended an "E" for extended — EAX, EBX, etc. When AMD pushed to 64 bits in 2003, they replaced "E" with "R" — RAX, RBX, RCX, RDX. The names persist; the chip remembers.

Fig 3.3 — The x86-64 register file · 16 + 2 + 45 years of names

x86-64 has sixteen 64-bit general-purpose registers — the original "A, B, C, D" set from the 8086, the "source/destination index" pair RSI/RDI, the stack-frame pair RBP/RSP, and the eight registers R8–R15 added by AMD64 in 2003. Plus two non-general-purpose registers: RIP, the instruction pointer, and RFLAGS, which holds the condition codes after every arithmetic operation. The right panel zooms into RAX. Because every generation of x86 had to remain backward-compatible with the previous one, RAX still contains the bits that were called EAX in 1985, AX in 1978, and AH/AL when split into bytes. Writing AL changes only the bottom 8 bits of RAX, leaving the rest untouched — a quirk that has shaped C compilers, assembly-language style, and the encoding of the entire ISA.

Register	Conventional purpose	Why
RAX	Accumulator — holds return values from function calls	Historical: x86's earliest predecessor used "A" for arithmetic accumulator
RBX	Base — historically pointed to data segments	Now mostly general-purpose; preserved across function calls
RCX	Counter — used for loop counts and shift amounts	The "rep" loop instruction implicitly uses RCX
RDX	Data — second half of multiplication results	"mul" stores 128-bit results split between RAX and RDX
RSP	Stack Pointer — points to the top of the call stack	Modified by push/pop; we'll see this next section
RBP	Base Pointer — points to current function's stack frame	Lets you find local variables at known offsets
RDI, RSI, RDX, RCX, R8, R9	First six function arguments (Linux/Mac)	System V ABI — the calling convention
RIP	Instruction Pointer — address of next instruction	Cannot be modified directly; only by jump/call/return instructions

Notice that RIP — the Program Counter, the address of the next instruction to execute — is itself a register. Changing what RIP points to means changing what the CPU does next. This is the entire mechanism behind a function call, a loop, an if/else branch, an interrupt, and (as we'll see) a memory corruption exploit. If you control RIP, you control the program.

Fig 3.4 — System V calling convention · which register holds which argument

The System V AMD64 ABI (used by Linux, macOS, BSD) lays down which physical registers carry which arguments. The first six integer or pointer arguments go into RDI, RSI, RDX, RCX, R8, R9 — in that order. The seventh argument and beyond are pushed onto the stack. The return value comes back in RAX. Floating-point arguments use a different set of registers (XMM0–XMM7), and structures larger than 16 bytes are passed by hidden pointer. The convention is what makes a compiled function callable from any other compiler, any other language, any other source file — without it, every library linkage on every system would break.

03 — The stack

Memory that grows down

When a program starts, the operating system gives it a chunk of memory laid out in a specific pattern — its virtual address space. Among other regions, the OS reserves an area called the stack: a last-in-first-out buffer that grows automatically as functions are called and shrinks as they return. By convention on x86-64, the stack lives at a high address and grows downward — toward lower addresses — as items are pushed.

Fig 3.5 — A process's virtual address space

A typical x86-64 Linux process address space. The kernel reserves the top half. The user-space portion contains, from high addresses down: the stack (function calls), unused space, the heap (dynamic allocations), the data section (globals), and the text section (read-only executable code). The stack and heap grow toward each other.

The stack is managed automatically by the CPU using two instructions: push writes a value to the address RSP points to, then decrements RSP by 8 (because we're storing 8-byte values on a 64-bit machine). pop reads the value at RSP and increments RSP back. The hardware treats the stack as a hardware-supported data structure.

Each function call creates a stack frame: a contiguous region of the stack containing that function's local variables, saved registers, and bookkeeping data. When the function returns, its frame is discarded by simply moving RSP back up. There is no garbage to clean up — local variables vanish as soon as the function exits.

04 — Function calls

What "calling a function" actually means

A function call is one of the most ordinary operations in programming, and one of the most consequential at the hardware level. Let's trace exactly what happens when you write int result = add(5, 3); in C.

First, the compiler has to follow a calling convention — a contract between caller and callee about who puts arguments where, who saves which registers, and how return values are delivered. On Linux/macOS x86-64, the convention is called System V AMD64 ABI. On Windows, it's the Microsoft x64 calling convention. They are slightly different. Both work, but a function compiled for one ABI cannot be called from a program compiled for the other without a wrapper.

Under System V, the first six integer arguments go in registers RDI, RSI, RDX, RCX, R8, R9, in that order. Additional arguments spill onto the stack. Return values come back in RAX. The caller is responsible for saving any volatile registers it cares about; the callee preserves RBX, RBP, R12-R15.

C source

int add(int a, int b) {
    return a + b;
}

int main() {
    int result = add(5, 3);
    return result;
}

x86-64 assembly (System V)

; ---- function: add(int a, int b) ----
add:
    push   rbp             ; save caller's frame pointer onto stack
    mov    rbp, rsp        ; new frame pointer = current stack top
    mov    eax, edi        ; eax ← first arg (a, was in edi)
    add    eax, esi        ; eax += second arg (b, was in esi)
    pop    rbp             ; restore caller's frame pointer
    ret                    ; pop return address from stack, jump there

; ---- function: main() ----
main:
    push   rbp
    mov    rbp, rsp
    mov    esi, 3          ; second arg goes in esi
    mov    edi, 5          ; first arg goes in edi
    call   add                ; push return addr, jump to "add"
    ; --- after add returns, result is in eax ---
    pop    rbp
    ret

Two instructions are doing magic here: call and ret. call add does two things atomically: it pushes the address of the next instruction (the address of pop rbp in main) onto the stack, and then sets RIP to the address of add. The CPU now begins executing add's instructions. When add eventually reaches ret, the inverse happens: ret pops the saved return address off the stack into RIP, and execution resumes in main at exactly the next instruction. The handoff is invisible to either function — each one experiences a continuous sequence of instructions, and the stack quietly remembers the call hierarchy beneath them.

Fig 3.6 — The stack during a function call

Three snapshots of the stack during a call. Each call instruction pushes a return address onto the stack; ret pops it. The pattern of saved frame pointers forms a linked list back through the entire call hierarchy — which is why a debugger can show you a "stack trace" listing every nested function call leading to the current point.

🔗

Why this matters. The return address is just a number stored in memory. The CPU has no way to verify it is the same number that was pushed there originally. If anything overwrites it between the call and the ret — anything at all — the CPU will obediently jump wherever the new value points. This is the seam from which an entire era of computer security emerged.

05 — Buffer overflow

The bug that defined three decades of security

In November 1988, a 23-year-old Cornell graduate student named Robert Tappan Morris released a small program onto the early internet. It was meant to count machines. Within hours it had brought down roughly ten percent of every computer connected to the network — about six thousand machines, at a time when that was most of the internet. The vulnerability it exploited was a buffer overflow in the UNIX fingerd daemon. Morris became the first person convicted under the Computer Fraud and Abuse Act. The attack class he made famous is still, decades later, among the most exploited bugs in computing.

To see why, we need to look at a tiny C function and trace what happens when its input doesn't fit.

C — vulnerable function

// A function that greets the user. Looks innocuous.
void greet(char *name) {
    char buffer[16];        // 16 bytes on the stack
    strcpy(buffer, name);   // copies until it hits a null byte
    printf("Hello, %s\n", buffer);
}

strcpy is the culprit. It copies bytes from name into buffer one at a time, stopping only when it encounters a null byte (a zero). It does not check whether buffer has room. If name is 17 characters long, the 17th byte gets written one byte past the end of buffer — into whatever happens to live there in memory. And what lives there, on the stack, is the saved RBP. And just past that is the return address.

Fig 3.7 — Stack frame of greet(): normal vs overflow

On the left: normal use — the buffer holds the input, the saved RBP and return address are intact, the function returns to its caller. On the right: an oversized input keeps writing past the buffer, eventually overwriting the saved RBP and the return address. When ret executes, RIP gets loaded with whatever the attacker placed there.

From overwrite to code execution

Overwriting the return address is only half the attack. The other half is choosing what to overwrite it with. The classic technique, perfected in the 1990s, is called shellcode injection: place machine code for a small payload (traditionally one that spawns a shell — hence "shellcode") into the same buffer being overflowed, and overwrite the return address with the address of that shellcode. When the function returns, the CPU jumps into the buffer and starts executing the attacker's instructions.

The whole technique was popularized — turned from rare expert knowledge into a cookbook — by an essay published in 1996 in the underground hacking magazine Phrack. It changed the landscape of computer security.

"The objective of this paper is to show how to write buffer overflow exploits, using as an example a vulnerability in a real program. … The reader is expected to be familiar with C and assembly under x86 systems."

— Aleph One, "Smashing the Stack for Fun and Profit," Phrack 49, 1996

Why C lets this happen

C, the language we will spend an entire chapter on later, was designed in 1972 as a portable assembler. It exposes raw memory, raw pointers, and gives the programmer complete control. It does not, by default, check that array accesses stay within bounds. strcpy, gets, scanf("%s"), and many other standard-library functions assume the programmer has guaranteed the buffer is large enough. When they're wrong, the corruption is silent — bytes get overwritten, the program keeps running, and the consequence may not appear until a function returns into junk.

This is the price of memory unsafety. C and C++ are memory-unsafe by design — they trade safety for performance and control. Modern languages like Rust, Java, Python, and Go are memory-safe: their runtimes or compilers check bounds at every access. They cannot have classical buffer overflows. That protection is the central reason newer languages exist.

📅

Famous buffer overflow incidents. The Morris Worm (1988, fingerd). The Code Red worm (2001, IIS, infected 359,000 hosts in 14 hours). The SQL Slammer worm (2003, infected 75,000 servers in 10 minutes). The Blaster worm (2003, RPC). Heartbleed (2014, OpenSSL — technically a buffer over-read rather than overflow, exposed private keys of millions of HTTPS servers). For decades, memory-corruption bugs accounted for the majority of critical CVEs (Common Vulnerabilities and Exposures). Microsoft, Google, and others have published data showing that around 70% of severe security bugs in their large C/C++ codebases trace back to memory unsafety. This is why Rust adoption is growing inside operating system kernels — including, as of 2022, Linux itself.

06 — Defenses

The arms race the OS fought back

Operating systems and CPUs did not sit still. Beginning in the late 1990s, a series of layered defenses were added — each one closing a class of attack, each one in turn worked around by attackers, each one followed by a more sophisticated defense. The history of memory-corruption defense is the clearest example in computing of an arms race played out in software.

Defense one — stack canaries

In 1998, Crispin Cowan introduced StackGuard, a compiler modification that placed a random value — a "canary," named after the birds carried into coal mines to detect poison gas — between local variables and the saved return address. Before any function returns, the compiler inserts code that checks whether the canary is still its original value. If it has changed — because a buffer overflow trampled it on the way to the return address — the program aborts immediately.

This works as long as the attacker cannot guess the canary value. Modern compilers generate a random canary at program startup, so guessing is statistically infeasible. An attacker who can read process memory, however — through a separate information leak — can sometimes recover it.

Fig 3.8 — The stack canary · a tripwire between buffer and return address

A stack canary is a randomly-chosen value placed by the compiler between local variables and the saved return address. The function's prologue writes the value; the function's epilogue, just before ret, checks that it has not changed. A buffer overflow that wants to overwrite the saved RIP must walk upward through memory, which means it must overwrite the canary first. When the epilogue's check fails, the program calls __stack_chk_fail and aborts before ret can transfer control. The defense rests on one assumption — that the attacker cannot leak the canary value through some other channel — and is one of several reasons modern programs make information disclosure a serious vulnerability class on its own.

Defense two — non-executable memory (DEP / NX bit)

Classic shellcode injection writes attacker code into the buffer (which lives on the stack) and jumps there. The defense is simple and elegant: mark stack pages as non-executable. The CPU's MMU will refuse to execute instructions from a page marked NX, raising a hardware exception instead. AMD added this to x86 in 2003 as the "NX bit" (No-eXecute); Intel followed; Windows calls it DEP (Data Execution Prevention), Linux calls it the same NX bit. Stack-injected shellcode immediately stopped working on systems with DEP enabled.

Defense three — ASLR

Even without injecting shellcode, attackers could redirect execution by overwriting the return address with an existing function address — say, the C library function system() with a string argument like "/bin/sh". This is called a return-to-libc attack. The countermeasure is ASLR — Address Space Layout Randomization. At each program launch, the operating system loads code, libraries, the heap, and the stack at random addresses. The attacker no longer knows where to point the overwritten return address. ASLR debuted in PaX (a Linux patch) in 2001, and became default on most operating systems by the late 2000s.

Fig 3.9 — ASLR · the same binary, three random load layouts

Three launches of the same executable, on the same machine, within seconds of each other. The .text segment, the heap, the libraries, and the stack each load at a different random address every time. An attacker who has overwritten a return address must guess where to redirect it — and on 64-bit Linux, the entropy is high enough (typically 28–30 bits for libraries, 24–28 bits for the heap, 30–32 bits for the stack) that brute-force guessing crashes the program long before a successful guess. ASLR is not perfect — small entropy pockets, leaks of address fragments, partial overwrites can sometimes be exploited — but it raises the cost of every memory-corruption attack. It is, in a real sense, the reason modern software is roughly survivable.

Defense four — and the attack that still beat it: ROP

Hovav Shacham showed in 2007 that DEP and ASLR were not enough. He introduced Return-Oriented Programming (ROP). Instead of injecting new code, the attacker chains together short fragments of existing code in the program — fragments ending in a ret instruction. Each fragment, called a "gadget," does some tiny operation (load a register, increment a counter, write a byte). By stacking up the right sequence of return addresses on the stack, the attacker can build arbitrary computation out of the program's own code, never executing anything new — defeating DEP. And given large enough libraries, useful gadgets always exist somewhere.

ROP forced another layer: Control Flow Integrity (CFI), shadow stacks (a separate, protected stack that mirrors the call stack and is checked on return), and ARM Pointer Authentication (which cryptographically signs return addresses, making forgery infeasible). Apple Silicon and recent Intel CPUs (CET — Control-flow Enforcement Technology) ship these defenses in hardware.

Fig 3.10 — Return-Oriented Programming · arbitrary computation from existing code

Return-Oriented Programming, demonstrated by Hovav Shacham in 2007. The attacker still smashes the stack, but instead of injecting new code (DEP would forbid it), the new "program" is a sequence of return addresses — each pointing to a tiny gadget: a one-or-two-instruction snippet ending in ret, somewhere in the existing libraries. The first ret after the overflow pops the first gadget address into RIP. That gadget executes its tiny op, then its own ret pops the next gadget address from the stack, and so on. With a sufficiently large library — every program links libc, which has hundreds of thousands of bytes of code — gadgets for any required operation can be found. ROP turned an entire field of defense (mark code non-executable) into a starting point. Modern hardware mitigations (CFI, shadow stacks, ARM PAC, Intel CET) are the response.

Fig 3.11 — The arms race, in one diagram

Each red dot is an attack technique that broke the previous defense. Each green dot is a hardware or compiler change that responded. The arms race continues — and at each step, the cost of exploitation rises while the cost of a single mistake by a programmer in C or C++ falls more slowly.

The deeper lesson

Buffer overflows are not a bug in any one program. They are a structural consequence of a particular language model — one in which memory is raw, addresses are first-class values, and the programmer is responsible for every check. The defenses we just walked through are real and valuable, but each is a layer of mitigation, not a cure. The cure is to use a memory-safe language wherever possible. The fact that enormous amounts of critical infrastructure — operating system kernels, web browsers, database engines, network stacks — are still written in C and C++ is simultaneously a tribute to those languages' performance and a permanent source of risk. We will return to this when we discuss C, C++, and Rust in later chapters.

What you now understand

You have followed the chain from gates to assembly to exploits. The CPU executes a fixed menu of binary instructions defined by its ISA — x86, ARM, RISC-V — and everything else is a translation onto that menu. Registers are the CPU's tiny, fast notebook; the stack is a region of memory that grows downward and stores the bookkeeping of every function call. Calling conventions are the contract between caller and callee. Function calls work by pushing a return address onto the stack and jumping; they return by popping it back. And because the return address is just a number that lives in writable memory, an unchecked write — a buffer overflow — can replace it with anything, handing control of the machine to whoever supplied the input. The defenses against this — canaries, DEP, ASLR, CFI, pointer authentication — have evolved in lockstep with the attacks for thirty years.

What comes next is the seam. We have walked the substrate from voltage to instruction — the transistor, the gate, the adder, the CPU, the stack — and at every step the program has been free to do anything the silicon supports. Nothing yet stops a function from reading another program's memory, from writing to a disk it does not own, from monopolising the CPU until the user gives up. On any real machine, those things are not allowed. The Bridge that closes Part I is about why they are not allowed — what hardware features the silicon must already provide before any program can be made to behave, and how those few features combine into the conditions under which one program can be made to own the machine. We do not yet open the kernel and look inside it. That waits for Part II. We only show what must already be true of the machine before a kernel of any kind can exist.

The PhysicalWorld

The MachineBeneathEverything

Why does any of this exist?

The 1936 paper that changed everything

The war that built the first computers

Everything is sand

What a transistor actually is

Why semiconductors

Von Neumann's insight: instructions are data

The Von Neumann bottleneck

Fetch. Decode. Execute. Repeat.

What an instruction actually is

Pipelining and parallelism inside one core

Why the kernel had to be invented

The system call: crossing the boundary

UNIX and the kernels we still use

The hierarchy of forgetting

Virtual memory: the lie the kernel tells

What you now understand

The Algebraof Switches

Why a computer can only count to two

The Victorian schoolteacher who reduced logic to algebra

Shannon's master's thesis: the most consequential paper of the 20th century

The atoms of computation

The three foundational gates

AND — true only if both inputs are true

OR — true if at least one input is true

NOT — flips the input

Universal gates: NAND alone is enough

How a circuit learns to add

The half adder

Adding two single bits

The full adder, and adding wider numbers

Two's complement, or how to subtract by adding

The trick: complement

Why 0.1 + 0.2 ≠ 0.3

IEEE 754: the universal floating-point standard

Special values: ±∞ and NaN

What you now understand

The Languagethe CPUSpeaks

An instruction is a contract

RISC vs CISC, and why the war ended in a draw

The CPU's tiny notebook

Memory that grows down

What "calling a function" actually means

The bug that defined three decades of security

From overwrite to code execution

Why C lets this happen

The arms race the OS fought back

Defense one — stack canaries

Defense two — non-executable memory (DEP / NX bit)

Defense three — ASLR

Defense four — and the attack that still beat it: ROP

The deeper lesson

What you now understand

What the SiliconMust Provide

One bit, two worlds

The hardware enforces, not the operating system

The one door between the two worlds

The path of one syscall

Traps, interrupts, exceptions — the same mechanism, three names

The hardware that draws the walls between processes

The page-table walker as a hardware unit

The TLB — a cache for translations

The piece of silicon that makes multitasking possible

How the CPU reaches beyond itself

Memory-mapped I/O — devices live at memory addresses

DMA — when the device copies on its own

What the kernel must therefore be

The kernel as the smallest program that arranges the silicon

The Substrate is built.

The Physical
World

The Machine
Beneath
Everything

The Algebra
of Switches

The Language
the CPU
Speaks

What the Silicon
Must Provide