Part One

The Physical
World

Before software, there is sand. Before logic, there is voltage. Before any program runs, a billion tiny switches must already agree on what "true" and "false" mean — and one program, the kernel, must already be standing watch. This is where the story has to begin — in the place most books skip.

CHAPTER 01
The Machine Beneath Everything
Transistors, Von Neumann's insight, the CPU, and why the kernel had to be invented.
CHAPTER 02
The Algebra of Switches
Boolean logic, binary numbers, how arithmetic emerges from electricity, and IEEE 754.
CHAPTER 03
The Language the CPU Speaks
Assembly, registers, the stack, calling conventions, and how buffer overflows actually work.
BRIDGE
The Boundary
The hardware face of the kernel. Privilege rings, the trap mechanism, the MMU, the timer, MMIO, and DMA — what the silicon must provide before any kernel can exist.
Chapter 01

The Machine
Beneath
Everything

Before there was software, there was sand. Before there was the internet, there was a mathematician who died before seeing what he built. This is the story of computer architecture — told in the order civilization needed it.

TopicsCPU · Memory · Kernel · OS
Era covered1936 → 1974
Chapter 01 hero · The Machine Beneath Everything ALU Arithmetic Logic Unit CU Control Unit REGISTERS L1 CACHE CPU CORE Data Bus Addr Bus Ctrl Bus RAM Main Memory I/O Devices DISK Storage KERNEL OS Core
01 — Context

Why does any of this exist?

To understand a modern computer, you have to understand that it was never inevitable. For most of human history, calculation was something humans did. The word computer itself originally referred to a person — usually a woman — employed to perform arithmetic by hand. The machine you are reading this on replaced an entire profession.

The transition from human computer to machine computer required a particular kind of insight: that thinking itself could be mechanized. Not all thinking — Alan Turing was careful about this — but a specific kind. The kind that follows rules. The kind that, given the same input, always produces the same output. Mathematics, in other words. And once you have a machine that can do mathematics, you have a machine that can do anything mathematics can describe. And mathematics, it turns out, can describe a remarkable amount of the world.

The 1936 paper that changed everything

In 1936, a 24-year-old Cambridge mathematician named Alan Turing published a paper titled On Computable Numbers, with an Application to the Entscheidungsproblem. It had nothing to do with machines, ostensibly. It was about a question David Hilbert had posed in 1928: is there a mechanical procedure that, given any mathematical statement, will decide whether the statement is true or false?

Turing's answer was no — but the way he proved it was extraordinary. To show that no such procedure could exist, he had to first define what "mechanical procedure" meant precisely. He invented an imaginary device: an infinite tape divided into cells, a read/write head that could move along the tape, and a finite set of rules that said what to do based on the current symbol and the machine's current state.

He called it an a-machine — automatic machine. We now call it a Turing machine. And he proved something that took decades to fully appreciate: a single, sufficiently complex Turing machine could simulate any other Turing machine, given the right rules on its tape. He called this a universal machine. It is the first formal description of what we now call a computer.

Fig 1.1 — A Turing machine, in motion
Fig 1.1 — A Turing machine, in motion 1 0 1 1 0 1 1 0 _ 1 _ HEAD state: q3 state: q4 state: q5 Rule (q3, 0) → write 1, move right, enter state q4 Rule (q4, 1) → write 0, move right, enter state q5 Rule (q5, _) → write 1, halt and accept

A Turing machine has three parts: an infinite tape of symbols, a head that reads and writes one cell at a time, and a finite table of rules. Each rule says: given the current state and the symbol under the head, write a new symbol, move left or right, and change to a new state. Watch the head crawl along the tape, rewrite cells, and hand off to the next state — three transitions on a nine-second loop. Anything that can be computed, can be computed by a machine like this.

"We can only see a short distance ahead, but we can see plenty there that needs to be done."

— Alan Turing, 1950

The war that built the first computers

Theory met urgency in 1939. Britain needed to break German naval codes — the Enigma cipher — faster than humans could. At Bletchley Park, Turing built the Bombe, an electromechanical machine that searched the Enigma key space mechanically. Later, his colleague Tommy Flowers built Colossus, the first programmable electronic digital computer, used to break the higher-level Lorenz cipher used by Hitler's high command. These machines were not Turing machines in his theoretical sense — they were special-purpose. But they proved that mechanical computation worked at scale.

Fig 1.2 — Enigma · why it had to be broken by machine
Fig 1.2 — Enigma · why it had to be broken by machine ENIGMA — THE CIPHER THAT BUILT THE COMPUTER 3 rotors × 26 positions each × plugboard × daily setting = ~158 quintillion possible keys key press # 1 2 3 type A three rotors · each steps after every press I II III refl. light flashes R Y F SAME KEY · DIFFERENT OUTPUT Press 1: A → R Press 2: A → Y Press 3: A → F Each press steps rotor I one position. The wiring path changes — so does the encrypted letter. ⚙ THE BOMBE — Turing's electromechanical brute-force at Bletchley Park ~210 Bombes ran 24/7 by 1945, ruling out millions of rotor settings per day using guessed plaintext "cribs."

Why Enigma was so hard: the same key pressed three times in a row gives three different letters because the rotors step between presses, changing the internal wiring path. Multiplied across the rotor stack, plugboard pairings, and daily reflector settings, the keyspace was about 158 × 10¹⁸ possibilities. No pencil-and-paper attack could keep up. Turing's Bombe — the gold-tinted strip below the rotors — automated the search by exploiting structural weaknesses in the encryption: it tested rotor positions in parallel, stopping when the math became consistent. It was the first time a machine was built specifically to think through a problem human minds could not.

Historians estimate Bombe and Colossus shortened the war by two years and saved roughly fourteen million lives. The machines worked. The question, after the war, became: how do you build one that isn't purpose-built for a single problem? How do you build a universal machine — Turing's theoretical idea, in actual hardware?

Historical note. Turing was prosecuted for homosexuality in 1952, sentenced to chemical castration by court order, and died in 1954 — officially by cyanide poisoning, likely suicide. He never saw the computer revolution he made possible. Britain issued a formal apology in 2009. He now appears on the £50 note.

02 — The Transistor

Everything is sand

The first general-purpose electronic computer, ENIAC (1945), was built from vacuum tubes — glass bulbs that controlled electrical current using a heated filament inside a vacuum. ENIAC had 17,468 of them. It filled a room thirty meters long, weighed thirty tons, consumed 150 kilowatts of power (enough to dim the lights of Philadelphia when it switched on), and broke down on average every two days because a tube would burn out. The mean time between failures was measured in hours, not years.

This was unsustainable. To make computers smaller, faster, more reliable, and affordable, the vacuum tube had to be replaced. The replacement was invented in 1947 at Bell Labs in New Jersey by three physicists — William Shockley, John Bardeen, and Walter Brattain. They called it the transistor, a contraction of transfer resistor.

Fig 1.3 — The replacement that made everything possible
Fig 1.3 — The replacement that made everything possible VACUUM TUBE · 1906 plate grid filament (heated) ~10 cm tall · burns out ~5 W heat · fragile glass replaced by 41 years later TRANSISTOR · 1947 E B C silicon die ~5 mm · solid state ~5 mW · works for decades ~1,000× smaller · ~1,000× less power · effectively unbreakable

Both devices are amplifiers and switches — both can take a small input signal and use it to control a much larger one. The vacuum tube does it by boiling electrons off a heated filament inside an evacuated glass bulb, then steering them toward a positively-charged plate using a control grid in between. The transistor does the same job by gating electrons through a sliver of doped silicon — no glass, no filament, no heat, no wear. Once you can build one, you can build a billion of them; once you can build a billion, you have a CPU.

What a transistor actually is

A transistor is a switch. Specifically, it is a switch with no moving parts, controlled by electricity rather than by a finger or a relay arm. It has three terminals: a base (or gate, in the modern field-effect design), a collector (or drain), and an emitter (or source). When you apply a small voltage to the base, it lets a much larger current flow between collector and emitter. No voltage at the base, no flow.

That's it. That is the entire foundational mechanism of every computer ever built. Everything else — every program, every webpage, every video game, every neural network — is layers of abstraction built on top of switches turning on and off.

Fig 1.4 — Transistor as a switch
Fig 1.4 — Transistor as a switch BASE = 0V (no signal) Base OPEN — no current flows BASE = 0.7V (signal) Base CLOSED — current flows

A transistor with no voltage at the base is open: no current flows from collector to emitter. With a small voltage at the base, the channel opens and current flows — the green dots above are electrons, drifting through the channel as long as the gate signal is present. A modern CPU contains roughly fifty billion of these switching at gigahertz speeds.

Why semiconductors

The transistor works because it is built from a semiconductor — a material whose conductivity sits between a conductor (like copper) and an insulator (like glass), and which can be precisely tuned by adding tiny amounts of impurities. Almost all modern transistors are built from silicon, which is the second most abundant element in Earth's crust. Sand is mostly silicon dioxide. The entire digital economy is, in a literal sense, built on purified sand.

Fig 1.5 — Inside the switch · isometric cross-section
Fig 1.5 — Inside the switch · isometric cross-section P-type silicon (substrate) SOURCE (n⁺) DRAIN (n⁺) ← inversion channel ← ← gate oxide (SiO₂) ← gate (poly-silicon) V_gate = 0V V_gate = 0.7V S D SCALE — modern gate length: ~3 nanometres · roughly 15 silicon atoms wide

Underneath the schematic symbol is a stack of materials. The substrate is silicon doped with electron-acceptor atoms (P-type). Two regions on either side are doped with electron donors (N⁺) — these are the source and drain. Between them, an insulating layer of silicon dioxide separates the substrate from a conducting gate. With no voltage on the gate, no path exists between source and drain. Apply a small positive voltage and the substrate just under the oxide inverts — a thin layer of free electrons forms a conducting channel, and current flows. This effect — the field-effect — is why these are called field-effect transistors, or MOSFETs (Metal-Oxide-Semiconductor Field-Effect Transistors). Every modern CPU is a stack of these about 3 nanometres wide.

In 1958, Jack Kilby at Texas Instruments etched multiple transistors onto a single piece of germanium, creating the first integrated circuit. Within months, Robert Noyce at Fairchild Semiconductor independently did the same on silicon, with a more manufacturable process. Both filed patents. Both were right. Noyce's process became the basis of the entire industry.

In 1965, Gordon Moore — co-founder of Intel and a former colleague of Noyce — published an observation: the number of transistors that could fit on a single integrated circuit was doubling roughly every twelve months, and would likely continue to do so. He later revised this to every two years. This pattern, Moore's Law, held with remarkable accuracy for fifty years and drove the entire semiconductor industry's roadmap. It is why your phone, with perhaps fifteen billion transistors, exists at all.

Fig 1.6 — Moore's Law · transistors per chip, 1971—2024
Fig 1.6 — Moore's Law · transistors per chip, 1971—2024 LOG SCALE · DOUBLES EVERY ~24 MONTHS 10 B 100 M 1 M 10 K 100 1971 1985 2000 2024 4004 · 2,300 8086 · 29K 386 · 275K Pentium · 3.1M P4 · 42M i7 · 731M Epyc · 19.2B M4 · 28B Each step on the y-axis is a 10× jump. A straight line on this chart means exponential growth in reality. From 2,300 transistors to 28 billion in 53 years — twelve million times more, on chips that fit on a fingernail.

Moore's Law plotted on a logarithmic y-axis. The points fall almost on a straight line, which is what exponential growth looks like in log space — each gridline is ten times the previous. The slope corresponds to a doubling roughly every two years. The pattern began to slow after 2015 as transistor sizes approached the size of single atoms, but it has lasted longer than nearly any technological prediction in history.

1947
The transistor invented at Bell Labs

Shockley, Bardeen, and Brattain demonstrate the first working transistor on December 23, 1947. Nobel Prize 1956. The replacement of vacuum tubes makes miniaturization possible.

1958
First integrated circuit

Kilby (TI) and Noyce (Fairchild) independently put multiple transistors on one chip. Kilby uses germanium, Noyce uses silicon. Silicon wins because it forms a superior native oxide insulator.

1965
Moore's Law stated

Gordon Moore predicts transistor density doubles every year (later revised to two years). The prediction becomes a self-fulfilling industrial roadmap.

1971
Intel 4004 — first commercial microprocessor

2,300 transistors on a 10μm process. 740 kHz clock. Originally designed for a Japanese calculator manufacturer. Marked the transition from "computers fill rooms" to "computers fit in pockets."

2024
Apple M4 chip

28 billion transistors. 3-nanometer process. Each transistor smaller than the width of a flu virus. Moore's Law is slowing — physical limits are being reached — but it has lasted over half a century.

Fig 1.7 — The same idea, 79 years apart
Fig 1.7 — The same idea, 79 years apart ENIAC · 1945 ~30 metres 17,468 vacuum tubes 30 tons · 150 kW fails every two days 79 years APPLE M4 · 2024 fingertip · for scale M4 ~13 mm 28,000,000,000 transistors 3 g · 8 W runs for years uninterrupted

Both machines compute. ENIAC filled a thirty-metre hall; the M4 fits under your thumbnail and contains roughly twelve million times as many switches. Both fundamentally do the same thing — flip switches according to instructions — but the engineering distance between them is the entire history of the integrated circuit. The figures and silhouettes above are drawn at the same on-screen size, but the actual size ratio is closer to 2,000 to 1.

03 — The Architecture

Von Neumann's insight: instructions are data

Early computers were hardwired. ENIAC, to be reprogrammed, had to be physically rewired — operators (almost all women, called "the ENIAC girls") would unplug and replug cables for days to set up a new calculation. The hardware encoded the program. To change what the machine did, you changed the machine.

In 1945, John von Neumann — a Hungarian-American polymath who had worked on the Manhattan Project and had a hand in nearly every important mathematical development of the mid-20th century — circulated a paper called the First Draft of a Report on the EDVAC. It described a different kind of architecture, one that would become the dominant design for every computer built afterward.

His insight, distilled: store the program itself in memory, alongside the data it operates on. Instructions are just patterns of bits. Data is just patterns of bits. They can live in the same memory. The CPU reads instructions from memory exactly the same way it reads data — which means a program can read another program, modify it, and then run it. A machine can write programs.

This is not a small idea. It is the difference between a machine that does one thing very well and a machine that, given the right instructions, can do anything. The compiler that turns C code into machine code is a program. The web browser running on your computer is a program. The kernel that runs the browser is a program. They all live as data in memory until the CPU reads them as instructions.

Fig 1.8 — Von Neumann architecture, in motion
Fig 1.8 — Von Neumann architecture, in motion MEMORY Instructions Data shared address space fetch INST CPU ALU + − × ÷ AND OR XOR NOT Control Fetch Decode / Execute REGISTERS PC, SP, AX, BX, CX, DX, ... L1 Cache Internal Bus processor DAT I/O PERIPHERALS Keyboard Display Network Disk / SSD GPU outside world

Von Neumann architecture in its essence: one shared memory holds both program instructions and data. The CPU fetches an instruction (gold packet flowing right on the bus), decodes it, executes it on the ALU, and writes results back to memory or registers (blue packet flowing left). Then it fetches the next instruction. Nearly every computer built since 1945 follows this design — and the diagram above plays out one cycle of it on a continuous loop.

The Von Neumann bottleneck

The architecture has one famous weakness, and it is a consequence of its core design choice. Because instructions and data share the same memory, they share the same bus — the same physical wires between the CPU and main memory. The CPU can only do one transfer at a time. And while CPUs got dramatically faster over the decades (clock speeds rose from kilohertz to gigahertz), main memory became faster much more slowly. Today, a modern CPU can perform a basic operation in under a nanosecond. A round trip to main RAM takes about a hundred nanoseconds. The CPU spends most of its time waiting.

The solution is a hierarchy of caches — small, fast memory built directly into the CPU that holds copies of recently used data. We'll cover this in detail later in this chapter. For now, it's enough to know that this bottleneck is one of the deepest design constraints of modern computing, and essentially every CPU optimization since the 1980s has been an attempt to work around it.

04 — The CPU

Fetch. Decode. Execute. Repeat.

Every CPU ever built does the same four things in a loop, billions of times per second. This loop is the heartbeat of every program you have ever run.

The instruction cycle is the fundamental unit of computation at the hardware level. A modern CPU executes billions of these per second on each of its cores. Everything else — your operating system, your browser, your music, this page — is just a particular sequence of instructions that the cycle runs through.

Fig 1.9 — The instruction cycle, in motion
Fig 1.9 — The instruction cycle, in motion PC → 0x4012 0x4016 advances every cycle STEP 01 Fetch read instruction from memory[PC] STEP 02 Decode opcode + operands say what to do STEP 03 Execute ALU performs op add · jump · load STEP 04 Writeback store · advance PC cycle restarts ↶ loops forever — every cycle, the PC advances and a new instruction begins INST

A modern CPU executes billions of these cycles per second on each of its cores. Watch the gold packet circulate above: it represents one instruction making its way through the four stages, then the program counter advances and the next instruction begins. Everything else — your operating system, your browser, this page — is a particular sequence of instructions threaded through this loop.

What an instruction actually is

At the hardware level, every instruction is simply a binary number — a specific pattern of ones and zeros. The CPU's control unit is a circuit that, when given a particular pattern, activates a particular sequence of internal signals. Different patterns trigger different operations. The mapping from binary patterns to operations is called the instruction set architecture, or ISA. Intel and AMD CPUs use the x86 ISA. Apple Silicon, your phone, the Raspberry Pi, and every modern Mac use ARM. They are mutually incompatible — code compiled for one will not run on the other unless translated.

Assembly language is just a human-readable label for these patterns. The mnemonic mov eax, 5 is the assembler's name for the binary instruction 10111000 00000101 00000000 00000000 00000000 on x86. They mean exactly the same thing — assembly is a one-to-one translation. We will spend all of Chapter 3 on this.

x86 assembly
; A simple addition program
; This is roughly what C's "int x = 5 + 3;" compiles to

mov  eax, 5        ; load the value 5 into register EAX
mov  ebx, 3        ; load the value 3 into register EBX
add  eax, ebx      ; EAX = EAX + EBX, result is now 8

; In memory, "add eax, ebx" is just two bytes: 01 D8
; The CPU's decoder maps this pattern to "ALU add" with
; source EBX and destination EAX. Then the ALU performs it.
Fig 1.10 — From assembly to binary · one instruction decoded
Fig 1.10 — From assembly to binary · one instruction decoded what you write mov eax, 5 assembler emits what the CPU sees · 5 bytes in memory B8 10111000 05 00000101 00 00000000 00 00000000 00 00000000 opcode — immediate value (32-bit, little-endian) — OPCODE BYTE — split into two fields 1 0 1 1 1 0 0 0 "MOV r32, imm32" 5-bit opcode prefix register = EAX 3-bit selector (000) IMMEDIATE — bytes reverse to one 32-bit integer 00000000 00000000 00000000 00000101 = 5 stored low-byte first; the CPU reverses them at decode "Take the integer 5 and put it into register EAX." Five bytes. One instruction. Roughly 0.3 nanoseconds.

A single x86 instruction in memory, taken apart byte-by-byte. The first byte (gold) is the opcode — but it isn't atomic; its top five bits encode the operation type ("move 32-bit immediate into register"), and its bottom three bits encode which register (000 = EAX). The next four bytes (blue) are the literal value 5, stored in little-endian order — the lowest byte first, which is why 05 appears closest to the opcode and the zero-padding follows. The CPU's instruction decoder is a circuit that recognises these bit patterns and routes them to the appropriate units in microscopic time.

Pipelining and parallelism inside one core

A naive CPU would do each step of the instruction cycle one at a time, finishing one instruction completely before starting the next. Modern CPUs do not. They use pipelining — overlapping the stages of different instructions, like an assembly line. While instruction 3 is being executed, instruction 4 is being decoded, and instruction 5 is being fetched. A modern pipeline may have 14 to 20 stages.

Fig 1.11 — Pipelined execution · five instructions, five stages
Fig 1.11 — Pipelined execution · five instructions, five stages PIPELINE — STAGES OVERLAP, ONE INSTRUCTION COMPLETES PER CYCLE cycle 1 2 3 4 5 6 7 8 9 inst 1 inst 2 inst 3 inst 4 inst 5 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF · fetch ID · decode EX · execute MEM · memory access WB · writeback Without pipelining: 5 instructions × 5 cycles each = 25 cycles. With pipelining: 9 cycles. Throughput: ~1 instruction per cycle.

Each instruction still takes five cycles to fully complete — but the CPU works on five instructions simultaneously, each in a different stage. The diagonal pattern is the signature of a pipeline: a new instruction enters the IF stage every cycle, and a finished instruction leaves the WB stage every cycle. Modern CPUs may have 14 to 20 pipeline stages, multiple of each kind, and execute several instructions in parallel per cycle. The cursor above sweeps across one cycle at a time so you can see what's happening inside the chip at each tick.

They also use out-of-order execution — if instruction 5 doesn't depend on instruction 4's result, the CPU may execute 5 first while 4 waits for memory. And speculative execution: if there's a branch (an if statement), the CPU guesses which way it will go and starts executing that path before the condition has been computed. If it guesses right, it saves time. If it guesses wrong, it discards the work. Modern CPUs guess right about 95% of the time.

Fig 1.12 — Speculative execution · the CPU guesses, then checks
Fig 1.12 — Speculative execution · the CPU guesses, then checks EVERY BRANCH · A FORK IN TIME at a conditional, the CPU starts down one path before the condition is known jne loop_start condition not yet evaluated PREDICTED · EXECUTING SPECULATIVELY add eax, 1 load [ebx], ecx cmp ecx, 0 ALTERNATE · IGNORED UNLESS GUESS WRONG mov eax, 0 jmp exit ✓ guess correct (~95%) — commit ✗ guess wrong (~5%) — discard but: even discarded speculative work leaves traces in the cache — Spectre & Meltdown exploit this

When the CPU reaches a conditional branch, it can't afford to wait for the condition to be computed — that would idle the pipeline for many cycles. So it guesses, based on the branch's history, and starts executing one path immediately. If the guess turns out right, the speculative work is committed and time was saved. If wrong, the work is discarded and the alternate path begins. Modern CPUs guess right about 95% of the time. The catch — exposed in 2018 — is that "discarded" only means the architectural state is rolled back. Microarchitectural side effects, especially in caches, persist. Read code that "shouldn't have run" leaves a fingerprint, and a careful attacker can read the fingerprint.

🔬

Spectre and Meltdown (2018). Two catastrophic CPU vulnerabilities discovered in nearly every processor made since 1995. They exploited speculative execution: when the CPU guessed wrong and discarded the work, traces of that work remained in the cache — traces that an attacker could measure to read memory they should not have access to. Hardware has bugs too, and they are far harder to fix than software bugs. We will return to this in Chapter 15.

05 — The Kernel

Why the kernel had to be invented

In the early 1950s, running a program meant booking the entire computer for yourself. You brought your stack of punched cards to the machine room, the operator loaded them, the machine ran your program, printed the output, and you came back an hour later to read it. One program at a time. No sharing.

Fig 1.13 — Running a program in 1955 · one job, one user, one machine
Fig 1.13 — Running a program in 1955 · one job, one user, one machine PUNCH-CARD ERA — BATCH PROCESSING your program ~80 columns × 12 rows one card ≈ one line submit to operator mainframe (one user) card reader running… CPU 100% on your job ~1 hour later come back your output printout If your program had a bug, you wouldn't know for an hour. Then you'd punch new cards and queue again.

For nearly two decades, this was the entire user experience of computing. The machine was a precious shared resource; the user — a programmer, scientist, or engineer — was a supplicant who handed over a stack of cards and waited. The CPU often sat idle while the operator loaded the next deck. Universities started asking: can the machine run someone else's program while mine waits for I/O? Can two people use it at once? The kernel was the answer.

As computers got faster and programs got longer, this became absurd. The CPU sat idle most of the time, waiting for slow input/output devices like punch card readers or magnetic tape drives. Universities started asking the obvious question: can multiple programs share a computer? Can one program run while another waits for I/O? Can different users be logged in simultaneously?

The answer was yes — but only if something managed the sharing. That something became the operating system kernel. The kernel is the one program that always runs. It owns the hardware. Every other program must ask it for permission to do anything that touches the outside world.

Fig 1.14 — Privilege rings · syscalls cross the boundary
Fig 1.14 — Privilege rings · syscalls cross the boundary HARDWARE — CPU · RAM · DISK · GPU · NETWORK INTERFACE KERNEL — Ring 0 — Privileged memory management · process scheduling · device drivers · filesystems interrupt handling · system calls · networking stack syscall trap return to user USER SPACE — Ring 3 — Unprivileged your program browser database SSH daemon PROTECTION RINGS hardware-enforced privilege levels — physically built into the CPU user code running in Ring 3 ▼ syscall instruction → CPU switches to Ring 0 kernel handler runs · checks permissions · talks to hardware ▲ return to user · result delivered

x86 CPUs have four privilege levels (rings 0–3), but in practice operating systems only use two: Ring 0 (kernel mode) and Ring 3 (user mode). Code in Ring 0 can execute any CPU instruction and access any memory. Code in Ring 3 cannot. The boundary is enforced by the CPU hardware itself, not by software. Watch the gold packet: an ordinary user program executes a syscall instruction, the CPU traps into Ring 0, the kernel handles the request, and control returns to user space — typically in less than a microsecond, and a typical desktop performs millions of these per second.

The system call: crossing the boundary

When your Python script opens a file, it doesn't access the disk directly. Your program — running in Ring 3 — has no permission to talk to the disk controller. Instead, it calls open() in Python, which calls fopen() in the C library, which calls a special CPU instruction (syscall on modern x86-64) that triggers a hardware exception. The CPU saves the program's state, switches to Ring 0, and jumps to a kernel entry point. The kernel then checks whether your process has permission to read that file, finds it on disk, and returns a numeric handle (a file descriptor) to your program.

Your program never touched the hardware. It asked the kernel, the kernel decided. This is the entire foundation of operating system security. Without this boundary, every program would have full access to every other program's memory, every file, every network packet. With it, programs are isolated from each other by hardware-enforced rules.

C → kernel
// What you write in C:
FILE *f = fopen("data.csv", "r");

// What the C library does internally:
int fd = syscall(SYS_open, "data.csv", O_RDONLY, 0);

// What happens at the CPU level:
//   1. The syscall instruction triggers a switch from Ring 3 to Ring 0
//   2. The kernel reads the syscall number (SYS_open == 2 on Linux)
//   3. It dispatches to sys_open() inside the kernel
//   4. sys_open checks process permissions against file ownership
//   5. It walks the filesystem (directory tree) to find the file
//   6. It allocates a file descriptor in this process's table
//   7. It switches back to Ring 3 and returns the descriptor (e.g. 3)
//
// Your program sees: fd == 3. It never touched the disk hardware.

UNIX and the kernels we still use

In 1969, at Bell Labs, Ken Thompson and Dennis Ritchie designed an operating system called UNIX. Its design was austere and elegant: everything is a file. A regular file is a file. A keyboard is a file. A network connection is a file. A running process exposes a directory of files describing it. All of them accessed through the same system call interface: open, read, write, close.

UNIX became, directly or indirectly, the ancestor of nearly every operating system in current use. Linux is a UNIX-like kernel written from scratch by Linus Torvalds in 1991. macOS uses a kernel called XNU, derived from a hybrid of Mach and BSD (a UNIX descendant). iOS and Android are both built on UNIX-derived kernels. Even Windows, originally not UNIX-based, now ships with a Linux subsystem. The architectural ideas in UNIX — processes, file descriptors, the system call interface — define what an operating system is.

🛡️

Why this matters for security. A privilege escalation attack is one in which an unprivileged process — your malicious program running in Ring 3 — finds a way to gain Ring 0 access. If it succeeds, it has full control of the machine: it can read any file, watch any keypress, install any rootkit. Every major operating system has had privilege escalation vulnerabilities. The kernel is the most security-critical code on any computer. We will spend significant time on this in Part II's kernel chapter and again in Part IV's unified security chapter.

06 — Memory

The hierarchy of forgetting

Memory in a computer is not one thing. It is a hierarchy of increasingly larger, slower, cheaper storage, managed at different levels by the CPU and the kernel. Each level holds a copy of part of the level below it. Closer to the CPU means faster but smaller. Further away means larger but slower.

The numbers below tell the entire story of why optimizing software is hard. The CPU performs an arithmetic operation in roughly 0.3 nanoseconds. A round trip to main memory takes 60 nanoseconds — two hundred times longer. A read from a fast SSD takes 50,000 nanoseconds — 167,000 times longer than a register access. Most of computer architecture for the past forty years has been about hiding this gap.

LevelLocationTypical sizeAccess timeManaged by
RegistersInside the CPU core~1 KB1 cycle (~0.3 ns)Compiler, CPU
L1 CacheOn the CPU die32–64 KB4 cycles (~1 ns)CPU hardware
L2 CacheOn the CPU die256 KB – 1 MB12 cycles (~4 ns)CPU hardware
L3 CacheOn the CPU package8–64 MB40 cycles (~13 ns)CPU hardware
RAM (DRAM)On the motherboard8–128 GB~100 cycles (~60 ns)OS kernel
SSD (NVMe)On the PCIe bus256 GB – 4 TB~50,000 nsOS + filesystem
HDD (spinning disk)SATA bus1–20 TB~5,000,000 nsOS + filesystem
Fig 1.15 — The hierarchy of forgetting · width = capacity, dot speed = latency
Fig 1.15 — The hierarchy of forgetting · width = capacity, dot speed = latency CLOSER TO THE CPU = FASTER · FURTHER AWAY = BIGGER each dot crosses its row at a speed proportional to that level's access time · watch the bottom rows crawl REGISTERS ~1 KB · 0.3 ns L1 CACHE ~32 KB · 1 ns L2 CACHE ~512 KB · 4 ns L3 CACHE ~32 MB · 13 ns RAM (DRAM) ~16 GB · 60 ns SSD (NVMe) ~1 TB · 50,000 ns HDD (spinning disk) FAST few KB SLOW terabytes A register is 17 million times faster than a hard drive. The whole job of modern memory architecture is to hide that gap from you.

The same data as the table above, drawn so the gaps are visible. Each row's width is roughly proportional to its capacity; each row's dot speed is roughly proportional to its access latency. The top three rows tick almost too fast to follow; the SSD dot crawls; the HDD dot is essentially still. Most of computer architecture for the past forty years — caching, prefetching, pipelining, branch prediction, parallelism — exists to hide this gap from the program.

Why caches multiply speed

The hierarchy works because programs do not access memory uniformly. They keep returning to the same regions over and over — looping over an array, calling a function repeatedly, reading the next word in a string. This is called locality of reference, and it is the reason a small fast cache speeds up a much larger slow memory by an enormous factor. The math is unforgiving but simple:

Tavg = h · tcache + (1 − h) · tRAM

If 95% of accesses hit L1 cache (1 ns) and only 5% miss into RAM (60 ns), the average access time is 0.95 × 1 + 0.05 × 60 = 3.95 ns. Without the cache, every access would cost 60 ns. The cache makes the whole system roughly fifteen times faster — and it does this with a hit rate that needs to be high but doesn't need to be perfect. The same calculation justifies every layer of the hierarchy: a small fast tier above a larger slow one, exploiting the fact that the next thing a program needs is usually close to the last thing it needed.

Peter Denning's working set theory (1968) made this rigorous, and every CPU since has been a refinement of the idea. We will see the same locality argument resurface in Part II's kernel chapter when we look at virtual memory and page caches, in Part III's web chapter when we examine DNS caching, and in Part IV's data chapter when we examine database buffer pools. The same equation, the same intuition, all the way up the stack.

Virtual memory: the lie the kernel tells

When your program asks for memory, the kernel does not give it a real physical address. It gives a virtual address — a fiction. Your program believes it has its own private address space starting at zero, with gigabytes of memory available. So does every other program. They all think they own the machine.

The CPU's Memory Management Unit (MMU), guided by tables that the kernel maintains, translates these virtual addresses to physical RAM addresses every time a program reads or writes memory. The translation tables — called page tables — divide memory into 4-kilobyte chunks called pages and map each virtual page to either a physical page in RAM or to a location on disk (if RAM is full and the page has been swapped out).

This achieves three things at once: isolation (no program can read another's memory because the translation tables for different programs map to different physical pages), flexibility (the kernel can move pages around in RAM, swap them to disk, or load them on demand from a file), and protection (the page tables also encode permissions — read, write, execute — and the MMU enforces them).

💡

Buffer overflow attacks exploit the memory model directly. If a program writes more data than a buffer can hold, the extra bytes spill into adjacent memory. If that memory contains a return address — an address the CPU will jump to when the current function ends — an attacker who controls the input controls where the CPU jumps next. Stack canaries, ASLR (Address Space Layout Randomization), and DEP (data execution prevention) are the layered hardware and OS defenses against this. We'll see exactly how this works at the assembly level in Chapter 3.

What you now understand

You have followed the chain from sand to software. Transistors — silicon switches controlled by voltage — combine into logic gates, then into integrated circuits, then into CPUs. The CPU runs an endless loop: fetch, decode, execute, writeback. It executes instructions encoded as binary patterns drawn from an instruction set architecture like x86 or ARM. Programs and data live together in the same memory, a design choice — Von Neumann's — that defines what a modern computer is. The kernel mediates between programs and hardware, enforcing isolation and security through privilege rings and virtual memory.

This is the substrate. Every chapter that follows builds on it. Chapter 2 goes deeper into a single layer of this stack — the layer where mathematics and electricity meet. We will look at why a computer must be binary, how George Boole invented the logic that runs on those binary signals, and how arithmetic — the thing computers fundamentally do — emerges from a few simple gates. By the end of Chapter 2, you will understand at a physical level why 0.1 + 0.2 does not equal 0.3 in any modern programming language.

Chapter 02

The Algebra
of Switches

In 1847, a self-taught English schoolteacher published a book reducing logic to algebra. He died before electricity was understood, never imagining a circuit. A century later, his algebra became the only mathematics that machines can do.

TopicsBoolean logic · Binary · Gates · Floats
Era covered1847 → 1985
Chapter 02 hero · The Algebra of Switches A · B + ¬A · B AND A B OR A B NOT A 1 0 1 1 0 0 1 0 the only language electricity speaks
01 — Why binary

Why a computer can only count to two

You count to ten because you have ten fingers. The Babylonians counted to sixty because they used the joints of the four fingers on one hand (twelve) times the five fingers of the other (sixty). The Mayans used twenty, counting their toes. Every counting system in human history was shaped by the body using it. Computers count to two — because they are made of switches, and a switch only has two states.

A common but wrong intuition is that computers use binary because it is somehow simpler or more elegant. The real reason is purely physical: electricity reliably represents two states, but not many more. A wire either has a voltage above some threshold (let's call this 1) or it doesn't (let's call this 0). The CPU can detect "above the threshold" and "below the threshold" with very high reliability, even when there is electrical noise, voltage fluctuations, or manufacturing variation.

Could you build a computer that used three voltage levels — low, medium, high — to represent three digits? Yes. The Soviet Union built one. The Setun, designed in 1958 at Moscow State University by Nikolay Brusentsov, used balanced ternary (digits −1, 0, +1) and was, in many mathematical respects, more elegant than binary. But it required precisely tuned voltage thresholds that were difficult to maintain reliably as components aged. Binary won not because it was theoretically superior, but because it was physically robust. A small voltage drift will turn a 1 into a 0; it takes a much larger drift to make a 0 look like a 2.

Fig 2.1 — Why binary survives noise
Fig 2.1 — Why binary survives noise BINARY — two well-separated levels threshold 0 1 0 1 0 1 0 noise margin: huge — small drift cannot flip the bit TERNARY — three closely-spaced levels noise margin: tiny — small drift can confuse adjacent levels

A binary signal has two levels with a wide gap between them — small voltage variations cannot flip a 1 into a 0. A ternary signal squeezes three levels into the same voltage range, leaving narrow tolerance bands. As components age and temperature varies, ternary fails more often. Engineering, not mathematics, killed it.

So we have a wire that can be on or off. With one wire we can express two values. With two wires, four. With three, eight. With n wires, we can express 2n values. This is a bit — short for binary digit. Eight bits make a byte, enough to represent 256 distinct values. A modern CPU works internally with words that are 64 bits wide, capable of representing 264 ≈ 18 quintillion values per word.

A note on counting

In any base b, an n-digit number can represent bn distinct values. A 3-digit decimal number ranges from 000 to 999 — that's 10³ = 1,000 values. A 3-digit binary number ranges from 000 to 111 — 2³ = 8 values. The exponential growth in either base is the same; only the base of the exponent differs.

This is why the size of computer memory grows so quickly with each added bit. Adding one wire doubles the addressable space.

02 — Boole

The Victorian schoolteacher who reduced logic to algebra

In 1815, in Lincoln, England, a shoemaker's son was born named George Boole. He had little formal education — he taught himself Latin at twelve, Greek at fourteen, and the major European languages by his late teens. By twenty-six, with no university degree, he was running his own school. By thirty-two, he was a self-taught mathematician publishing original papers in differential equations.

In 1847, he published a short book titled The Mathematical Analysis of Logic. Seven years later, he expanded it into An Investigation of the Laws of Thought. The thesis of these two works was, at the time, almost incomprehensible: logic — the discipline of valid reasoning — was a branch of algebra. The same kind of algebra you do in school, with symbols and equations, but with different rules.

In ordinary algebra, x + y = y + x and xy = yx. In Boole's algebra, the variables represent not numbers but truth values — propositions that are either true or false. The operations are not addition and multiplication but OR (logical disjunction), AND (logical conjunction), and NOT (negation). And there are only two values a variable can take: 0 (false) and 1 (true).

"The design of the following treatise is to investigate the fundamental laws of those operations of the mind by which reasoning is performed; to give expression to them in the symbolical language of a Calculus…"

— George Boole, An Investigation of the Laws of Thought, 1854

Boole died in 1864 at age 49 from pneumonia. Walking three miles in heavy rain to give a lecture, then lecturing in soaking wet clothes — and then being put to bed by his wife who, on the principle that "like cures like," allegedly threw buckets of cold water on him. He never saw electricity used in computing. He died twenty years before the electron itself was discovered. His algebra was, for seventy years, considered a beautiful but useless curiosity in the history of logic.

Shannon's master's thesis: the most consequential paper of the 20th century

Then, in 1937, a 21-year-old MIT graduate student named Claude Shannon wrote his master's thesis. He was working with electrical relay circuits — telephone-switch-like devices that could be open (no current) or closed (current flowing) based on another control signal. He noticed something: the behavior of these circuits could be described exactly by Boole's algebra. An open relay was 0. A closed one was 1. Two relays in series implemented AND. Two in parallel implemented OR. A relay that opened when its control signal was on implemented NOT.

Shannon showed that you could take any logical proposition expressible in Boolean algebra and build a circuit of relays that physically computed it. And that you could simplify circuits by simplifying the corresponding Boolean expressions. Logic and electricity were the same thing.

The thesis is widely considered the most important master's thesis ever written. It is the bridge between Boole's pure mathematics and modern computing. Without it, the universal computer Turing imagined could not have been built. With it, every modern computer is a physical realization of a Boolean expression.

Fig 2.2 — Boole's algebra · Shannon's gate · the same idea, ninety years apart
Fig 2.2 — Boole's algebra · Shannon's gate · the same idea, ninety years apart A · B = A ∩ B = AN AND GATE Boole, 1847 propositions as sets A B A · B true only when both propositions hold Shannon, 1937 "the master's thesis" Shannon, 1937 propositions as relays AND A B A · B current flows only when both relays are closed Boole's set intersection · Shannon's relay AND · the same algebra, ninety years apart

In 1847 George Boole proposed that logical reasoning could be written as algebra over two values, and that the conjunction of two propositions — written A · B — corresponded to the intersection of their truth-sets. In 1937 Claude Shannon, working with electrical relays at MIT, proved that the very same algebra described how series-connected switches behave: current flows through both only when both are closed. The expression Boole drew as overlapping circles, Shannon drew as overlapping currents. Every logic gate in every modern CPU is, in this exact sense, a piece of mathematics from 1847 made physical.

📐

The deeper claim. Anything that can be computed — any function that takes some input and produces some output by following a finite procedure — can be expressed as a Boolean formula and built as a circuit of AND, OR, and NOT gates. This is the practical version of Turing's universality theorem. It is why the same hardware can run a calculator, a video game, and a neural network. They are all, at the bottom, Boolean expressions.

03 — Logic gates

The atoms of computation

A logic gate is a small circuit — a few transistors arranged in a particular way — that takes one or two input bits and produces one output bit according to a Boolean rule. There are only a handful of fundamental gates, and every computation a computer performs reduces to combinations of them.

The three foundational gates

AND — true only if both inputs are true

ABA · B
000
010
100
111

OR — true if at least one input is true

ABA + B
000
011
101
111

NOT — flips the input

A¬A
01
10

The Boolean operators are written multiple ways depending on context: AND as · or , OR as + or , NOT as ¬ or an overbar. In code, programmers often write &&, ||, and ! in C/Java/JavaScript, or and, or, not in Python. They all mean the same thing.

Fig 2.3 — The three fundamental gates
Fig 2.3 — The three fundamental gates AND A B Q = A·B OR A B Q = A+B NOT A Q = ¬A

The standard symbols for the three foundational logic gates. The shape encodes the operation: AND is a flat-backed shield, OR is a curved-back arrowhead, NOT is a triangle with a small inversion bubble. These shapes have been standard since the 1960s and are recognized by every electrical engineer in the world.

Universal gates: NAND alone is enough

A remarkable result, due to Henry Sheffer in 1913: you don't actually need three different kinds of gates. A single gate type — NAND (NOT-AND) — can be combined to implement any Boolean function. NOT, AND, OR, everything. This is called functional completeness, and NAND is a universal gate.

Why this matters in practice: chip manufacturers can build their fabrication process around a single gate type and combine NAND gates in different patterns to make any other gate. Modern CPUs aren't literally made of just NANDs — there are also NOR, XOR, and other compound gates for efficiency — but the principle stands. One gate, repeated billions of times in different arrangements, is enough to compute anything computable.

Fig 2.4 — NAND universality · the only gate you actually need
Fig 2.4 — NAND universality · the only gate you actually need NOT, AND, AND OR — ALL FROM A SINGLE NAND NOT one NAND, inputs tied NAND A ¬A A¬A 01 10 AND two NANDs · output inverted A B A · B ABA·B 000 010 100 111 OR DeMorgan: ¬(¬A · ¬B) A B A + B ABA+B 000 011 101 111 Sheffer 1913 · functional completeness · one gate, anything computable

Three foundational gates — NOT, AND, OR — each rebuilt from the single NAND primitive. Tying both inputs of a NAND together gives a NOT. An AND is a NAND followed by a NOT (which is itself a NAND). And by DeMorgan's laws, OR is the negation of "neither A nor B," which is three NANDs arranged so the inputs are individually inverted before a final NAND combines them. Every truth table on every panel matches the conventional gate it replaces. From this single primitive — and silicon really is, in its bottom layers, fields of NAND-style transistor pairs — every computation any computer has ever performed can be assembled.

DeMorgan's Laws

Two fundamental identities of Boolean algebra, both due to Augustus De Morgan, a contemporary of Boole:

¬(A · B) = ¬A + ¬B

¬(A + B) = ¬A · ¬B

These laws say that you can move a NOT through an AND or OR by flipping the operator. They are how you simplify Boolean expressions, and they are how chip designers convert one kind of gate-level circuit into another. They will reappear later when we discuss filtering in databases (Chapter 13) and access control in security (Chapter 15) — Boolean algebra is the underlying logic of nearly every formal system in computing.

04 — Arithmetic from logic

How a circuit learns to add

A computer is fundamentally a machine for arithmetic — Charles Babbage called his 19th-century mechanical proto-computer the Analytical Engine for a reason. But at the level of voltages and gates, the machine has no concept of "five" or "twelve." It only knows ones and zeros and Boolean operations on them. So how do we get from NAND to 5 + 3?

The answer is constructive. We build it up, one bit at a time. Let's start with adding two single bits.

The half adder

In binary, single-bit addition has four cases:

Adding two single bits

ABSumCarry
0000
0110
1010
1101

Look closely at those output columns. The Sum column matches an operation we don't yet have a name for: it's true when exactly one input is true. This is called XOR (exclusive OR). The Carry column is exactly an AND. So a single-bit adder — called a half adder — is just an XOR combined with an AND, sharing inputs. Two gates produce two output bits: the sum and the carry.

The full adder, and adding wider numbers

To add multi-bit numbers, you need to handle the carry that propagates between bit positions. This requires a full adder: a circuit that takes three input bits — A, B, and a carry-in from the previous position — and produces a sum bit and a carry-out. A full adder is built from two half adders and an OR gate.

To add two 8-bit numbers, you chain together eight full adders. The carry-out of each becomes the carry-in of the next. To add 64-bit numbers, you chain 64 of them. This design is called a ripple-carry adder, and while modern CPUs use faster variants (carry-lookahead adders), the principle is the same: arithmetic emerges from chaining simple gates.

The Arithmetic Logic Unit — the ALU you saw inside the CPU diagram in Chapter 1 — is, fundamentally, a circuit of adders, plus some additional logic for subtraction, multiplication, and bitwise operations. Every numerical operation a computer performs is a Boolean expression.

Fig 2.5 — Adding 5 + 3 = 8 in binary
Fig 2.5 — Adding 5 + 3 = 8 in binary A = 5 0 1 0 1 B = 3 0 0 1 1 FA ×4 SUM = 8 1 0 0 0 0101 (5) + 0011 (3) 1000 (8)

Adding 5 + 3 in binary. The four-bit numbers 0101 and 0011 are summed bit by bit, with each full adder handling one bit position and passing carries forward. The result is 1000 — eight in decimal. The same principle scales to 64-bit numbers; you just need 64 full adders.

Fig 2.6 — Inside one full-adder · the half-adder primitive, twice
Fig 2.6 — Inside one full-adder · the half-adder primitive, twice XOR + AND → HALF-ADDER → FULL-ADDER → RIPPLE CHAIN Half-adder two inputs, no carry-in A B XOR Sum AND Carry add carry-in cascade twice Full-adder two half-adders + an OR A B Cin HA₁ A ⊕ B A·B HA₂ + Cin Sum OR Cout RIPPLE-CARRY ADDER · CHAIN OF FULL-ADDERS FA₀ bit 0 FA₁ bit 1 FA₂ bit 2 FA₃ bit 3 Cout 0

A full-adder isn't a new primitive — it is two half-adders and an OR gate, in a specific arrangement. The first half-adder combines A and B; the second combines that result with the incoming carry from the previous bit position; the OR gate produces the outgoing carry from either of the two intermediate carries. Stacking them in series builds an n-bit ripple-carry adder, where each stage's carry-out feeds the next stage's carry-in. The four-stage chain at the bottom is exactly what produced the 5 + 3 = 8 result above. Modern CPUs use carry-lookahead and carry-select adders that compute carries in parallel rather than rippling them, but the gate-level primitives are unchanged.

05 — Negative numbers

Two's complement, or how to subtract by adding

Computers don't natively understand negative numbers. There is no transistor that represents "minus." So how does your laptop store the value −5?

The naive approach — sometimes called signed magnitude — is to reserve one bit (usually the leftmost) as a sign indicator. 0 means positive, 1 means negative; the remaining bits represent the magnitude. So in 8 bits, 00000101 is +5 and 10000101 is −5. Simple, intuitive, and almost never used.

Why not? Two reasons. First, you end up with two zeros — 00000000 and 10000000, "positive zero" and "negative zero" — which makes equality testing awkward. Second, and far worse, you cannot use the same adder circuit to subtract. Subtraction would require a separate, more complex circuit. Hardware engineers hate this: every extra circuit is more transistors, more area, more heat, more cost.

The trick: complement

The solution, used by every modern computer, is called two's complement. To represent a negative number in n bits:

Step 1Take the binary representation of the positive number
Step 2Flip every bit (0→1, 1→0). This is "one's complement."
Step 3Add 1 to the result. This is "two's complement."

So in 8 bits, +5 is 00000101. To get −5: flip the bits to get 11111010, then add 1 to get 11111011. That's −5 in two's complement.

The magic is that with this representation, subtraction is just addition. To compute 7 − 5, the CPU computes 7 + (−5) using the exact same adder circuit it uses for any other addition. The bits work out — the carries and overflows cancel in exactly the right way to produce the correct answer.

Fig 2.7 — Computing 7 − 5 = 2 with two's complement
Fig 2.7 — Computing 7 − 5 = 2 with two's complement 7 = 00000111 +(-5) = 11111011 ← two's complement of 5 1 00000010 ← 9 bits! the leading 1 is overflow 00000010 ← discard overflow → result is 2 ✓ The 9th bit overflows the register and is silently dropped. What's left is exactly the correct result: 2. No special subtraction circuit needed.

Subtraction by addition. The CPU adds two numbers using the same adder circuit it always uses; the magic of two's complement is that the leading 1 (the overflow) ends up being precisely the bit that needs to be discarded for the math to work. There is no separate subtractor in any modern CPU.

In an n-bit two's complement representation, the range of values is −2n−1 to 2n−1 − 1. For 32-bit signed integers, that's roughly −2.1 billion to +2.1 billion. The asymmetry — one more negative number than positive — is because 0 takes up one of the "positive" slots.

Fig 2.8 — The two's-complement wheel · where addition wraps
Fig 2.8 — The two's-complement wheel · where addition wraps 8-BIT SIGNED · 256 SLOTS · ADDITION ROTATES THE POINTER 0 00000000 +127 01111111 −128 10000000 −1 11111111 +64 −64 overflow edge 8-bit signed int8_t +1 AT THE EDGE +127 + 1 = 01111111 + 1 = 10000000 = −128 (!) a positive plus a positive becomes negative · the silent overflow AT ZERO −1 + 1 = 11111111 + 1 = 100000000 = 0 (carry dropped) subtraction by addition the overflow IS the answer — see Fig 2.7 addition is rotation · subtraction is the same rotation, the other way

Two's complement is a wheel, not a line. The 256 possible 8-bit values arrange around the dial: 0 at the top, positive numbers to the right, negative numbers to the left, and a single discontinuity at the bottom where +127 meets −128. Adding 1 rotates the pointer one notch clockwise; subtracting 1 rotates it counter-clockwise. Most of the time this matches our ordinary intuition about numbers — a small positive plus a small positive gives a larger positive — but at the red edge, the wheel betrays it. +127 + 1 silently becomes −128. The same wrap that makes subtraction free in Fig 2.7 makes integer overflow inevitable. 32-bit and 64-bit signed integers have the same wheel; the slots are just exponentially more numerous.

⚠️

Integer overflow vulnerabilities. Because the range is finite, adding two large positive numbers can produce a negative result — the carry overflows the most significant bit and changes the sign. In 1996, an Ariane 5 rocket exploded 37 seconds after launch because a 64-bit floating-point velocity value was converted to a 16-bit signed integer that overflowed. Cost: $370 million. In security, integer overflow bugs are a classic source of buffer overflows: a length check using a signed integer can be tricked into accepting an enormous "negative" length that bypasses bounds checking. We will see exact exploits of this in Part IV's security chapter.

Fig 2.9 — How a signed overflow becomes a buffer overflow
Fig 2.9 — How a signed overflow becomes a buffer overflow SIGNED ARITHMETIC + UNSIGNED CAST = MILLIONS OF BYTES WRITTEN VULNERABLE C CODE — A CHECK THAT LOOKS CORRECT int copy(char *dst, char *src, int n) { if (n > 256) return -1; // looks safe memcpy(dst, src, n); // n promoted to size_t } ATTACKER CALLS copy(buf, evil, −1) step 1 · the check n > 256 −1 > 256 → false → check passes ✓ step 2 · the cast memcpy(... , n) int(−1) → size_t = 18 446 744 073 709 551 615 step 3 · the write copy 18 EB into a 256-byte buffer → heap obliterated SAME 32 BITS · TWO INTERPRETATIONS 11111111 11111111 11111111 11111111 as int = −1 11111111 11111111 11111111 11111111 as size_t = 4 294 967 295 The bug is not in the bits — they are identical. The bug is that two pieces of code interpret the same bits differently.

A canonical integer-overflow exploit. The function checks if (n > 256) return -1 — apparently a safe upper bound. But n is a signed int, and the attacker passes -1. The check evaluates -1 > 256 as false (because -1 is less than 256 in the signed comparison), so the early return is skipped. Then memcpy's third argument is size_t — unsigned — and the same bits that meant -1 as a signed integer mean 4 billion as an unsigned one. The library function happily copies that many bytes into a 256-byte buffer, smashing every adjacent allocation. Two's complement makes this elegant; type confusion makes it lethal. The same class of bug appears in production systems every year and accounts for a substantial share of memory-corruption CVEs.

06 — Floating point

Why 0.1 + 0.2 ≠ 0.3

Open any programming language. Type 0.1 + 0.2. The answer you get will not be 0.3. It will be 0.30000000000000004. This is not a bug in any specific language. It is a fundamental consequence of how computers represent fractional numbers. The same answer comes out of Python, JavaScript, C, Java, Rust, your calculator app, and the Mars rover.

The cause is binary representation of fractions. In decimal, 0.1 is a clean, finite digit string. In binary, it is not. 0.1 in binary is 0.0001100110011001100... — the pattern 0011 repeating forever. Just like 1/3 in decimal is 0.333... repeating forever, certain decimal fractions become repeating binary fractions. And computer memory is finite, so the binary representation has to be cut off. The remainder becomes a tiny error.

IEEE 754: the universal floating-point standard

In 1985, the IEEE published standard 754, defining how all modern processors represent fractional numbers. Before 1985, every CPU did it differently, and numerical code was a portability nightmare. The standard solved that.

A 64-bit "double precision" float divides its 64 bits into three fields, in a structure that mimics scientific notation:

Fig 2.10 — IEEE 754 double-precision layout
Fig 2.10 — IEEE 754 double-precision layout S 1 bit sign 0=pos, 1=neg E (exponent) 11 bits scale factor range ≈ 10⁻³⁰⁸ to 10³⁰⁸ M (mantissa / fraction) 52 bits precision (significand) about 15–17 decimal digits value = (−1)S · 1.M · 2E−1023

An IEEE 754 double divides 64 bits into a sign bit, an 11-bit exponent (with a bias of 1023, so it can represent both positive and negative exponents), and a 52-bit mantissa. The actual value is computed using the formula at the bottom — essentially binary scientific notation. The same structure with smaller fields gives the 32-bit "single precision" float (1, 8, 23 bits).

The 52 bits of mantissa give about 15 to 17 decimal digits of precision. The 11-bit exponent gives a dynamic range of about 10−308 to 10308. You can represent extremely small and extremely large numbers — but only with finite precision in their digits. Most decimal fractions cannot be represented exactly. They are stored as the nearest representable binary fraction, with a tiny rounding error.

When you compute 0.1 + 0.2, the CPU adds two slightly-imprecise binary representations of 0.1 and 0.2, producing a slightly-imprecise binary representation of 0.3 — but not the same one you'd get if you stored 0.3 directly. The errors don't cancel. Hence the famous result.

Python (or any language)
>>> 0.1 + 0.2
0.30000000000000004

>>> 0.1 + 0.2 == 0.3
False

# In binary, 0.1 is actually approximately:
>>> format(0.1, '.20f')
'0.10000000000000000555'

# The "extra" 555... is the rounding error from cutting off
# the infinite repeating binary fraction.
Fig 2.11 — 0.1 + 0.2 in binary · where the famous error lives
Fig 2.11 — 0.1 + 0.2 in binary · where the famous error lives DECIMAL FRACTION → REPEATING BINARY → ROUNDED MANTISSA decimal exact binary fraction (repeats forever) rounded to 52 bits 0.1 0.0001 1001 1001 1001 1001… cut here 1010 ≈ 0.10000000000000000555… 0.2 0.0011 0011 0011 0011 0011… 0011 ≈ 0.20000000000000001110… + = 0.0100 1100 1100 1100 1100 round bit 1101 = 0.30000000000000004 — not 0.3, but a few quadrillionths above it repeating quartet of 0.1 repeating quartet of 0.2 where the rounding error appears Both inputs are already wrong before addition starts. The sum is wrong because both operands are.

Decimal 0.1 has no exact binary representation — its binary expansion repeats forever, just as 1/3 in decimal is 0.3333… The IEEE-754 double rounds it to 52 mantissa bits, introducing a tiny error in the last digit. The same happens to 0.2. When the CPU adds the two rounded values, both errors carry through into the result. The output is not 0.3 — it is the next representable binary fraction above 0.3, which prints as 0.30000000000000004. The bug is not in the addition; the bug was already there in the inputs. Every floating-point computation, on every modern CPU, accumulates these tiny rounding errors at every step. The discipline of numerical analysis is the study of how to keep them from compounding into something dangerous.

Special values: ±∞ and NaN

IEEE 754 reserves certain bit patterns for special values. When the exponent is all ones and the mantissa is zero, the value is infinity (positive or negative depending on the sign bit). When the exponent is all ones and the mantissa is non-zero, the value is NaN — Not a Number, the result of operations like 0.0 / 0.0 or sqrt(-1).

NaN has a remarkable property: it is not equal to anything, including itself. NaN == NaN evaluates to false in every IEEE 754-compliant language. This is the standard test to detect a NaN. It is mathematically odd and has caught countless programmers off guard.

💸

Real-world consequences. Financial software cannot use floating point for currency, because rounding errors compound across millions of transactions. Banks use either integer cents or special decimal types instead. The 1991 Patriot missile failure during the Gulf War — which killed 28 American soldiers — was traced to a floating-point timing error that accumulated over 100 hours of operation, causing the missile to miscalculate the position of an incoming Scud. Floating point is precise enough for almost everything, but its errors are real, and they compound. Knowing when to use floats and when not to is a basic competency in software engineering.

Fig 2.12 — The Patriot missile · 0.0000000953 s × 100 hours = 28 lives
Fig 2.12 — The Patriot missile · 0.0000000953 s × 100 hours = 28 lives DHAHRAN, 28 FEBRUARY 1991 · A ROUNDING ERROR THAT KILLED 340 ms 250 170 85 0 ms CLOCK DRIFT 0 h 25 50 75 100 h HOURS BATTERY POWERED ON drift × velocity = position error 0.34 s × 1676 m/s ≈ 600 m Dhahran intercept missile failed to launch 25 h · ≈55 m drift · still tracks THE ROUNDING THAT DID IT 0.1 s clock tick · binary fraction repeats forever · stored 24-bit → each tick is short by ~0.0000000953 s × 100 h × 60 min × 60 s × 10 ticks/s = 3.6 million ticks → cumulative drift ≈ 0.34 s ≈ 600 m of position error at SCUD velocity

A Patriot air-defence battery near Dhahran, Saudi Arabia, had been powered on continuously for about 100 hours. Its targeting code stored elapsed time as a count of 0.1-second ticks — but 0.1 has no exact binary representation, and the on-board floating-point format truncated it to 24 bits. Each tick was thus shorter than 0.1 s by a tiny amount (about 0.0000000953 s). After 100 hours that error had compounded to roughly 0.34 s. An incoming SCUD missile travelling at Mach 5 covers about 600 m in 0.34 s — well outside the Patriot's tracking gate. The radar looked where the SCUD should have been, found nothing, and concluded there was no target. The SCUD struck a US Army barracks and killed 28 soldiers. Floating-point error is not a curiosity; under the wrong circumstances it is a weapon.

What you now understand

You have followed the layered chain from electricity to mathematics. A wire that can be on or off gives you a bit. Combinations of bits give you binary numbers. Boolean algebra — invented in 1847 by Boole, applied to circuits in 1937 by Shannon — describes operations on those bits. Logic gates are physical implementations of Boolean operations: AND, OR, NOT, XOR. Combinations of gates implement arithmetic — half adders, full adders, ALUs. Two's complement makes subtraction free. IEEE 754 extends the system to fractional numbers, with the unavoidable cost of finite precision.

Every higher-level abstraction in computing — from a Python list to a TLS handshake to a neural network — is built on top of these primitives. Chapter 3 takes us up one level: from the gate-and-wire view to the instruction view. We will look at how the CPU actually executes a program, how function calls work at the metal, what a "stack" really is, and how attackers exploit the gap between what programmers intend and what their compiled code actually does.

Chapter 03

The Language
the CPU
Speaks

Below every Python script, every JavaScript framework, every neural network, there is a sequence of instructions written in the only language a processor truly understands. This chapter is about what your code becomes when it finally meets the machine.

TopicsAssembly · Stack · ABI · Exploits
Era covered1949 → present
Chapter 03 hero · The Language the CPU Speaks THE STACK main() return addr · saved RBP add(a, b) a=5 · b=3 helper() RSP → high addr low addr REGISTERS RAX0x08 RBX0x00 RCX0x10 RIP0x4012 RSP0x7FFE RBP0x7FF8 RDX0x05
01 — ISA

An instruction is a contract

In 1978, Intel released the 8086 — a 16-bit chip whose instruction set has not been removed from any consumer CPU since. Every x86-64 processor sold today, from a laptop to a data-centre rack, still recognises the same core binary patterns Intel encoded that year. A processor doesn't run C, or Python, or JavaScript. It runs a sequence of binary patterns drawn from a specific menu — the instruction set architecture, or ISA. The ISA is the contract between hardware and software: hardware promises to execute every pattern in the menu correctly; software promises to use only patterns from the menu. Everything else — every compiler, every operating system, every program — sits on top of this contract.

Three ISAs dominate the world. x86-64 — the direct descendant of that 1978 Intel line — runs essentially every PC and most servers. It is famously complex, with thousands of instructions accumulated over four decades of backward compatibility. ARM, designed by Acorn in the 1980s and now licensed by basically every phone manufacturer on Earth, is the modern alternative — cleaner, more efficient, and the architecture of every iPhone, every Apple Silicon Mac, and every Android device. RISC-V, from Berkeley in 2010, is an open standard ISA gaining ground in academia, embedded systems, and increasingly in industry — anyone can build a RISC-V chip without paying licensing fees.

Fig 3.1 — Same operation · three ISAs · three encodings
Fig 3.1 — Same operation · three ISAs · three encodings "ADD 5 TO A REGISTER" — DIFFERENT BITS, SAME EFFECT x86-64 Intel · 1978 → present add rax, 5 7 bytes · variable length REX.W 48 opcode 81 ModR/M C0 imm32 05 00 00 00 prefix · opcode · operand byte · 4-byte little-endian immediate ARM64 (AArch64) Acorn 1985 · phones, M-series Macs add x0, x0, #5 4 bytes · always opcode 10010001 8 bits shift 00 2 bits imm12 000000000101 12 bits = 5 Rn (src) 00000 5 bits = x0 Rd (dst) 00000 5 bits = x0 fixed 32-bit format · always opcode | shift | immediate | source | destination RISC-V (RV64I) Berkeley 2010 · open standard addi a0, a0, 5 4 bytes · always · ~50 instructions in base ISA imm12 000000000101 12 bits = 5 rs1 (src) 01010 5 bits = a0 funct3 000 3 bits rd (dst) 01010 5 bits = a0 opcode 0010011 7 bits "I-type" format · immediate | rs1 | funct3 | rd | opcode opcode register field immediate value misc / format-specific

The same logical operation — add 5 to a register — encoded three different ways. x86-64 uses a variable-length instruction with a one-byte opcode-prefix (REX.W marks 64-bit), a one-byte main opcode, an operand byte, and a four-byte little-endian immediate value. ARM64 packs everything into a fixed 32-bit word with carefully laid-out fields. RISC-V also uses a fixed 32-bit word, but its field layout is different again — and there are only six instruction formats total in the entire base ISA. Three architectures, three philosophies; each set of bits is meaningless on the other two CPUs. The contract between hardware and software is exactly which patterns of bits mean what.

They are not interchangeable. An x86 binary will not run on an ARM CPU and vice versa. The shift from Intel x86 to Apple's ARM-based M1 in 2020 required Apple to ship Rosetta 2, a translation layer that converts x86 instructions to ARM on the fly. That such a translator is even possible is itself remarkable.

Fig 3.2 — Rosetta 2 · how a chip pretends to be a different chip
Fig 3.2 — Rosetta 2 · how a chip pretends to be a different chip x86 BINARY → AHEAD-OF-TIME TRANSLATION → ARM BINARY → CACHE x86-64 INPUT a Mac App built for Intel push rbp mov rbp, rsp sub rsp, 32 mov [rbp-4], 5 add [rbp-4], 3 mov rax, [rbp-4] leave ret variable-length · 8 instructions ROSETTA 2 macOS, 2020 · translation engine decode x86 instructions parse REX, ModR/M, immediates map x86 semantics → ARM model RFLAGS, segment regs in software emit ARM64 native code store in /var/db/oah cache runs once per binary on first launch ARM64 OUTPUT native on Apple Silicon stp x29, x30 mov x29, sp sub sp, sp, #32 mov w8, #5 str w8, [...] ldr w8, [...] add w8, w8, #3 ret fixed 32-bit · runs at near-native speed x86 binary in · ARM binary out · same program · ~80% native performance

Rosetta 2 is what made the 2020 Intel-to-Apple-Silicon transition feel painless. When a Mac launches an x86 binary on an ARM-based M-series chip, Rosetta 2 — once, on first launch — reads the entire executable, translates each x86-64 instruction into a sequence of ARM64 instructions that produces the same observable result, and saves the translated binary to a cache. Subsequent launches skip the translation step. The translated code runs at roughly 80% the speed of native ARM, sometimes faster than the original on the original Intel hardware. Rosetta 2 is possible only because both ISAs are Turing-complete with the same memory model — every x86 instruction has an equivalent sequence of ARM ones, and a careful translator can find that sequence statically.

RISC vs CISC, and why the war ended in a draw

In the 1980s there was a vigorous debate. CISC — Complex Instruction Set Computers, like x86 — had hundreds of specialized instructions, including multi-step operations like "load from memory, add, and store back" in a single instruction. RISC — Reduced Instruction Set Computers, like ARM, MIPS, SPARC — had a small set of simple, uniform instructions. RISC machines could clock faster because each instruction was simple to decode.

The debate was real. By the 2000s, it was over — and both sides had quietly converged. Modern x86 CPUs internally translate complex CISC instructions into simpler RISC-like micro-operations (μops) and execute those. Modern ARM chips have grown more complex over time, with specialized instructions for cryptography, vector math, and machine learning. The clean ideological lines have blurred. What remains is the binary incompatibility — the fact that the patterns of bits mean different things on different chips.

02 — Registers

The CPU's tiny notebook

Recall from Chapter 1 that registers are the smallest, fastest level of the memory hierarchy — a few dozen tiny storage cells inside the CPU core itself. Every instruction the CPU executes operates primarily on registers. To add two numbers from memory, the CPU must first load them from memory into registers, add the registers, and store the result back to memory. Memory is too slow to operate on directly.

x86-64 has 16 general-purpose registers, each 64 bits wide. Their names carry historical baggage: in 1978 the original 8086 had 16-bit registers named AX, BX, CX, DX. When Intel went to 32 bits in 1985, they prepended an "E" for extended — EAX, EBX, etc. When AMD pushed to 64 bits in 2003, they replaced "E" with "R" — RAX, RBX, RCX, RDX. The names persist; the chip remembers.

Fig 3.3 — The x86-64 register file · 16 + 2 + 45 years of names
Fig 3.3 — The x86-64 register file · 16 + 2 + 45 years of names SIXTEEN 64-BIT GENERAL-PURPOSE REGISTERS · NAMING SHOWS THEIR AGE GP REGISTERS · 64 BITS EACH RAX accumulator return value RBX base callee-saved RCX counter arg 4 RDX data arg 3 RSI source idx arg 2 RDI dest idx arg 1 RBP base ptr stack frame RSP stack ptr top of stack R8 arg 5 R9 arg 6 R10 scratch R11 scratch R12 callee-saved R13 callee-saved R14 callee-saved R15 callee-saved SPECIAL — NOT GENERAL-PURPOSE RIP instruction pointer · "if you control RIP, you control the program" RFLAGS CF · ZF · SF · OF · IF · DF · … status of last operation RAX · NESTED LEGACY RAX (64 bits) AMD 2003 · "R" for register EAX (32) Intel 1985 · "E" for extended upper 32 bits unused if EAX written AX (16) Intel 1978 AH AL 8 bits · 8 bits · Intel 1978 "high" / "low" of AX writing AL touches the bottom 8 bits of RAX writing EAX zeros the upper 32 bits of RAX — forty-five years of names in one register All four widths exist simultaneously · the same physical register, different names · backward compatibility, in silicon

x86-64 has sixteen 64-bit general-purpose registers — the original "A, B, C, D" set from the 8086, the "source/destination index" pair RSI/RDI, the stack-frame pair RBP/RSP, and the eight registers R8–R15 added by AMD64 in 2003. Plus two non-general-purpose registers: RIP, the instruction pointer, and RFLAGS, which holds the condition codes after every arithmetic operation. The right panel zooms into RAX. Because every generation of x86 had to remain backward-compatible with the previous one, RAX still contains the bits that were called EAX in 1985, AX in 1978, and AH/AL when split into bytes. Writing AL changes only the bottom 8 bits of RAX, leaving the rest untouched — a quirk that has shaped C compilers, assembly-language style, and the encoding of the entire ISA.

RegisterConventional purposeWhy
RAXAccumulator — holds return values from function callsHistorical: x86's earliest predecessor used "A" for arithmetic accumulator
RBXBase — historically pointed to data segmentsNow mostly general-purpose; preserved across function calls
RCXCounter — used for loop counts and shift amountsThe "rep" loop instruction implicitly uses RCX
RDXData — second half of multiplication results"mul" stores 128-bit results split between RAX and RDX
RSPStack Pointer — points to the top of the call stackModified by push/pop; we'll see this next section
RBPBase Pointer — points to current function's stack frameLets you find local variables at known offsets
RDI, RSI, RDX, RCX, R8, R9First six function arguments (Linux/Mac)System V ABI — the calling convention
RIPInstruction Pointer — address of next instructionCannot be modified directly; only by jump/call/return instructions

Notice that RIP — the Program Counter, the address of the next instruction to execute — is itself a register. Changing what RIP points to means changing what the CPU does next. This is the entire mechanism behind a function call, a loop, an if/else branch, an interrupt, and (as we'll see) a memory corruption exploit. If you control RIP, you control the program.

Fig 3.4 — System V calling convention · which register holds which argument
Fig 3.4 — System V calling convention · which register holds which argument SYSTEM V AMD64 ABI · LINUX, MACOS, BSD · FIRST 6 ARGS GO IN REGISTERS CALLER WRITES C int result = work ( a , b , c , d , e , f , g ); // 7 arguments COMPILER PLACES ARGS IN REGISTERS · 7TH AND BEYOND ON THE STACK RDI arg 1 — a RSI arg 2 — b RDX arg 3 — c RCX arg 4 — d R8 arg 5 — e R9 arg 6 — f [rsp] arg 7 — g work(a, b, c, d, e, f, g) reads args from registers and stack computes · places result in RAX RAX ← return value caller reads RAX Windows uses a different convention: RCX, RDX, R8, R9 — only four register-passed args. ABIs are local laws.

The System V AMD64 ABI (used by Linux, macOS, BSD) lays down which physical registers carry which arguments. The first six integer or pointer arguments go into RDI, RSI, RDX, RCX, R8, R9 — in that order. The seventh argument and beyond are pushed onto the stack. The return value comes back in RAX. Floating-point arguments use a different set of registers (XMM0–XMM7), and structures larger than 16 bytes are passed by hidden pointer. The convention is what makes a compiled function callable from any other compiler, any other language, any other source file — without it, every library linkage on every system would break.

03 — The stack

Memory that grows down

When a program starts, the operating system gives it a chunk of memory laid out in a specific pattern — its virtual address space. Among other regions, the OS reserves an area called the stack: a last-in-first-out buffer that grows automatically as functions are called and shrinks as they return. By convention on x86-64, the stack lives at a high address and grows downward — toward lower addresses — as items are pushed.

Fig 3.5 — A process's virtual address space
Fig 3.5 — A process's virtual address space 0xFFFF... 0x0000 KERNEL SPACE protected from user processes STACK function call frames · grows downward ↓ — unused memory between stack and heap — HEAP malloc/new allocations · grows upward ↑ DATA — global / static variables TEXT — your program's machine code (read-only, executable)

A typical x86-64 Linux process address space. The kernel reserves the top half. The user-space portion contains, from high addresses down: the stack (function calls), unused space, the heap (dynamic allocations), the data section (globals), and the text section (read-only executable code). The stack and heap grow toward each other.

The stack is managed automatically by the CPU using two instructions: push writes a value to the address RSP points to, then decrements RSP by 8 (because we're storing 8-byte values on a 64-bit machine). pop reads the value at RSP and increments RSP back. The hardware treats the stack as a hardware-supported data structure.

Each function call creates a stack frame: a contiguous region of the stack containing that function's local variables, saved registers, and bookkeeping data. When the function returns, its frame is discarded by simply moving RSP back up. There is no garbage to clean up — local variables vanish as soon as the function exits.

04 — Function calls

What "calling a function" actually means

A function call is one of the most ordinary operations in programming, and one of the most consequential at the hardware level. Let's trace exactly what happens when you write int result = add(5, 3); in C.

First, the compiler has to follow a calling convention — a contract between caller and callee about who puts arguments where, who saves which registers, and how return values are delivered. On Linux/macOS x86-64, the convention is called System V AMD64 ABI. On Windows, it's the Microsoft x64 calling convention. They are slightly different. Both work, but a function compiled for one ABI cannot be called from a program compiled for the other without a wrapper.

Under System V, the first six integer arguments go in registers RDI, RSI, RDX, RCX, R8, R9, in that order. Additional arguments spill onto the stack. Return values come back in RAX. The caller is responsible for saving any volatile registers it cares about; the callee preserves RBX, RBP, R12-R15.

C source
int add(int a, int b) {
    return a + b;
}

int main() {
    int result = add(5, 3);
    return result;
}
x86-64 assembly (System V)
; ---- function: add(int a, int b) ----
add:
    push   rbp             ; save caller's frame pointer onto stack
    mov    rbp, rsp        ; new frame pointer = current stack top
    mov    eax, edi        ; eax ← first arg (a, was in edi)
    add    eax, esi        ; eax += second arg (b, was in esi)
    pop    rbp             ; restore caller's frame pointer
    ret                    ; pop return address from stack, jump there

; ---- function: main() ----
main:
    push   rbp
    mov    rbp, rsp
    mov    esi, 3          ; second arg goes in esi
    mov    edi, 5          ; first arg goes in edi
    call   add                ; push return addr, jump to "add"
    ; --- after add returns, result is in eax ---
    pop    rbp
    ret

Two instructions are doing magic here: call and ret. call add does two things atomically: it pushes the address of the next instruction (the address of pop rbp in main) onto the stack, and then sets RIP to the address of add. The CPU now begins executing add's instructions. When add eventually reaches ret, the inverse happens: ret pops the saved return address off the stack into RIP, and execution resumes in main at exactly the next instruction. The handoff is invisible to either function — each one experiences a continuous sequence of instructions, and the stack quietly remembers the call hierarchy beneath them.

Fig 3.6 — The stack during a function call
Fig 3.6 — The stack during a function call STACK GROWS DOWNWARD ↓ BEFORE call add main's locals · saved rbp RSP → main's frame AFTER call add (return addr pushed) main's locals · saved rbp return address → main+5 RSP → INSIDE add (after push rbp) main's locals · saved rbp return address → main+5 saved rbp (main's frame ptr) RSP → add's frame When add hits ret: RSP rises, return address pops back into RIP, control returns to main.

Three snapshots of the stack during a call. Each call instruction pushes a return address onto the stack; ret pops it. The pattern of saved frame pointers forms a linked list back through the entire call hierarchy — which is why a debugger can show you a "stack trace" listing every nested function call leading to the current point.

🔗

Why this matters. The return address is just a number stored in memory. The CPU has no way to verify it is the same number that was pushed there originally. If anything overwrites it between the call and the ret — anything at all — the CPU will obediently jump wherever the new value points. This is the seam from which an entire era of computer security emerged.

05 — Buffer overflow

The bug that defined three decades of security

In November 1988, a 23-year-old Cornell graduate student named Robert Tappan Morris released a small program onto the early internet. It was meant to count machines. Within hours it had brought down roughly ten percent of every computer connected to the network — about six thousand machines, at a time when that was most of the internet. The vulnerability it exploited was a buffer overflow in the UNIX fingerd daemon. Morris became the first person convicted under the Computer Fraud and Abuse Act. The attack class he made famous is still, decades later, among the most exploited bugs in computing.

To see why, we need to look at a tiny C function and trace what happens when its input doesn't fit.

C — vulnerable function
// A function that greets the user. Looks innocuous.
void greet(char *name) {
    char buffer[16];        // 16 bytes on the stack
    strcpy(buffer, name);   // copies until it hits a null byte
    printf("Hello, %s\n", buffer);
}

strcpy is the culprit. It copies bytes from name into buffer one at a time, stopping only when it encounters a null byte (a zero). It does not check whether buffer has room. If name is 17 characters long, the 17th byte gets written one byte past the end of buffer — into whatever happens to live there in memory. And what lives there, on the stack, is the saved RBP. And just past that is the return address.

Fig 3.7 — Stack frame of greet(): normal vs overflow
Fig 3.7 — Stack frame of greet(): normal vs overflow NORMAL — name = "Tiger" "Tiger\0" + 11 unused bytes buffer[16] saved rbp return address → caller function returns normally ATTACK — name = 16 'A's + 8 bytes + addr "AAAAAAAAAAAAAAAA" buffer[16] — full "BBBBBBBB" — overwrote rbp 0x4011e7 → attacker's code on ret — CPU jumps wherever attacker said ANATOMY OF THE OVERWRITE strcpy writes: A A A A A A A A A A A A A A A A B B B B B B B B 0x4011e7 ↑ 16-byte buffer ↑ saved rbp ↑ return address No bounds check. strcpy keeps writing. The return address is just bytes — and bytes can be replaced.

On the left: normal use — the buffer holds the input, the saved RBP and return address are intact, the function returns to its caller. On the right: an oversized input keeps writing past the buffer, eventually overwriting the saved RBP and the return address. When ret executes, RIP gets loaded with whatever the attacker placed there.

From overwrite to code execution

Overwriting the return address is only half the attack. The other half is choosing what to overwrite it with. The classic technique, perfected in the 1990s, is called shellcode injection: place machine code for a small payload (traditionally one that spawns a shell — hence "shellcode") into the same buffer being overflowed, and overwrite the return address with the address of that shellcode. When the function returns, the CPU jumps into the buffer and starts executing the attacker's instructions.

The whole technique was popularized — turned from rare expert knowledge into a cookbook — by an essay published in 1996 in the underground hacking magazine Phrack. It changed the landscape of computer security.

"The objective of this paper is to show how to write buffer overflow exploits, using as an example a vulnerability in a real program. … The reader is expected to be familiar with C and assembly under x86 systems."

— Aleph One, "Smashing the Stack for Fun and Profit," Phrack 49, 1996

Why C lets this happen

C, the language we will spend an entire chapter on later, was designed in 1972 as a portable assembler. It exposes raw memory, raw pointers, and gives the programmer complete control. It does not, by default, check that array accesses stay within bounds. strcpy, gets, scanf("%s"), and many other standard-library functions assume the programmer has guaranteed the buffer is large enough. When they're wrong, the corruption is silent — bytes get overwritten, the program keeps running, and the consequence may not appear until a function returns into junk.

This is the price of memory unsafety. C and C++ are memory-unsafe by design — they trade safety for performance and control. Modern languages like Rust, Java, Python, and Go are memory-safe: their runtimes or compilers check bounds at every access. They cannot have classical buffer overflows. That protection is the central reason newer languages exist.

📅

Famous buffer overflow incidents. The Morris Worm (1988, fingerd). The Code Red worm (2001, IIS, infected 359,000 hosts in 14 hours). The SQL Slammer worm (2003, infected 75,000 servers in 10 minutes). The Blaster worm (2003, RPC). Heartbleed (2014, OpenSSL — technically a buffer over-read rather than overflow, exposed private keys of millions of HTTPS servers). For decades, memory-corruption bugs accounted for the majority of critical CVEs (Common Vulnerabilities and Exposures). Microsoft, Google, and others have published data showing that around 70% of severe security bugs in their large C/C++ codebases trace back to memory unsafety. This is why Rust adoption is growing inside operating system kernels — including, as of 2022, Linux itself.

06 — Defenses

The arms race the OS fought back

Operating systems and CPUs did not sit still. Beginning in the late 1990s, a series of layered defenses were added — each one closing a class of attack, each one in turn worked around by attackers, each one followed by a more sophisticated defense. The history of memory-corruption defense is the clearest example in computing of an arms race played out in software.

Defense one — stack canaries

In 1998, Crispin Cowan introduced StackGuard, a compiler modification that placed a random value — a "canary," named after the birds carried into coal mines to detect poison gas — between local variables and the saved return address. Before any function returns, the compiler inserts code that checks whether the canary is still its original value. If it has changed — because a buffer overflow trampled it on the way to the return address — the program aborts immediately.

This works as long as the attacker cannot guess the canary value. Modern compilers generate a random canary at program startup, so guessing is statistically infeasible. An attacker who can read process memory, however — through a separate information leak — can sometimes recover it.

Fig 3.8 — The stack canary · a tripwire between buffer and return address
Fig 3.8 — The stack canary · a tripwire between buffer and return address SAME OVERFLOW · CANARY DETECTS IT BEFORE RIP IS HIJACKED NORMAL · CANARY INTACT saved RIP (return addr) 0x0040 1234 (caller) saved RBP previous frame ptr CANARY · 0x47A1F392B5... random, set at process start char buf[64] "hello\0" — fits in buffer no overflow → canary unchanged prologue stored canary == stack canary? → yes · ret to caller normally ✓ STACK GROWS DOWN OVERFLOW · CANARY CLOBBERED saved RIP (return addr) 0x4141 4141 ← attacker-controlled saved RBP 0x4141 4141 4141 4141 CANARY · 0x41414141... overwritten — does not match char buf[64] "AAAA…AAAA" — 200 bytes, oversize overflow walks upward through the canary prologue stored canary == stack canary? → no · __stack_chk_fail() → abort ✗ OVERFLOW WRITES UPWARD The canary is between the buffer and the saved RIP — the overflow must trample it on the way

A stack canary is a randomly-chosen value placed by the compiler between local variables and the saved return address. The function's prologue writes the value; the function's epilogue, just before ret, checks that it has not changed. A buffer overflow that wants to overwrite the saved RIP must walk upward through memory, which means it must overwrite the canary first. When the epilogue's check fails, the program calls __stack_chk_fail and aborts before ret can transfer control. The defense rests on one assumption — that the attacker cannot leak the canary value through some other channel — and is one of several reasons modern programs make information disclosure a serious vulnerability class on its own.

Defense two — non-executable memory (DEP / NX bit)

Classic shellcode injection writes attacker code into the buffer (which lives on the stack) and jumps there. The defense is simple and elegant: mark stack pages as non-executable. The CPU's MMU will refuse to execute instructions from a page marked NX, raising a hardware exception instead. AMD added this to x86 in 2003 as the "NX bit" (No-eXecute); Intel followed; Windows calls it DEP (Data Execution Prevention), Linux calls it the same NX bit. Stack-injected shellcode immediately stopped working on systems with DEP enabled.

Defense three — ASLR

Even without injecting shellcode, attackers could redirect execution by overwriting the return address with an existing function address — say, the C library function system() with a string argument like "/bin/sh". This is called a return-to-libc attack. The countermeasure is ASLR — Address Space Layout Randomization. At each program launch, the operating system loads code, libraries, the heap, and the stack at random addresses. The attacker no longer knows where to point the overwritten return address. ASLR debuted in PaX (a Linux patch) in 2001, and became default on most operating systems by the late 2000s.

Fig 3.9 — ASLR · the same binary, three random load layouts
Fig 3.9 — ASLR · the same binary, three random load layouts SAME PROGRAM · THREE LAUNCHES · THREE DIFFERENT MAPS RUN 1 · 09:14:22 stack 0x7FFE 9A23 ____ …unmapped… libc.so.6 0x7F4C 3D11 ____ heap 0x55A8 0040 ____ .text · 0x562B 4400 ____ RUN 2 · 09:15:48 stack 0x7FFD C812 ____ …unmapped… libc.so.6 0x7F2D 6E55 ____ heap 0x55C4 0020 ____ .text · 0x55F1 8800 ____ RUN 3 · 09:18:11 stack 0x7FFE F005 ____ …unmapped… libc.so.6 0x7F89 2B40 ____ heap 0x55B9 8000 ____ .text · 0x5577 C200 ____ low high Each region's base is shifted by ~16 bits of randomness on every launch

Three launches of the same executable, on the same machine, within seconds of each other. The .text segment, the heap, the libraries, and the stack each load at a different random address every time. An attacker who has overwritten a return address must guess where to redirect it — and on 64-bit Linux, the entropy is high enough (typically 28–30 bits for libraries, 24–28 bits for the heap, 30–32 bits for the stack) that brute-force guessing crashes the program long before a successful guess. ASLR is not perfect — small entropy pockets, leaks of address fragments, partial overwrites can sometimes be exploited — but it raises the cost of every memory-corruption attack. It is, in a real sense, the reason modern software is roughly survivable.

Defense four — and the attack that still beat it: ROP

Hovav Shacham showed in 2007 that DEP and ASLR were not enough. He introduced Return-Oriented Programming (ROP). Instead of injecting new code, the attacker chains together short fragments of existing code in the program — fragments ending in a ret instruction. Each fragment, called a "gadget," does some tiny operation (load a register, increment a counter, write a byte). By stacking up the right sequence of return addresses on the stack, the attacker can build arbitrary computation out of the program's own code, never executing anything new — defeating DEP. And given large enough libraries, useful gadgets always exist somewhere.

ROP forced another layer: Control Flow Integrity (CFI), shadow stacks (a separate, protected stack that mirrors the call stack and is checked on return), and ARM Pointer Authentication (which cryptographically signs return addresses, making forgery infeasible). Apple Silicon and recent Intel CPUs (CET — Control-flow Enforcement Technology) ship these defenses in hardware.

Fig 3.10 — Return-Oriented Programming · arbitrary computation from existing code
Fig 3.10 — Return-Oriented Programming · arbitrary computation from existing code NO NEW CODE INJECTED · ONLY TINY FRAGMENTS OF LIBC, WIRED TOGETHER SMASHED STACK attacker-controlled return addresses ret → addr of G1 [smashed RIP] value: 0x68 [arg for G1's pop] addr of G2 [next gadget] value: "/bin/sh" [arg for system] addr of G3 [next gadget] addr of system() [final target] …rest of payload… RIP READS DOWN GADGETS · IN libc.so.6 existing code, ending in ret G1: 0x7F4C…2A0 pop rdi ; ret // load arg G2: 0x7F4C…59C pop rsi ; ret // load 2nd arg G3: 0x7F4C…814 xor rdx, rdx ; ret // zero env system() · existing libc function execve("/bin/sh", …) → shell Every gadget ends with ret · ret pops the next address from the stack · the stack is the program

Return-Oriented Programming, demonstrated by Hovav Shacham in 2007. The attacker still smashes the stack, but instead of injecting new code (DEP would forbid it), the new "program" is a sequence of return addresses — each pointing to a tiny gadget: a one-or-two-instruction snippet ending in ret, somewhere in the existing libraries. The first ret after the overflow pops the first gadget address into RIP. That gadget executes its tiny op, then its own ret pops the next gadget address from the stack, and so on. With a sufficiently large library — every program links libc, which has hundreds of thousands of bytes of code — gadgets for any required operation can be found. ROP turned an entire field of defense (mark code non-executable) into a starting point. Modern hardware mitigations (CFI, shadow stacks, ARM PAC, Intel CET) are the response.

Fig 3.11 — The arms race, in one diagram
Fig 3.11 — The arms race, in one diagram 1988 today ATTACKMorris Worm1988 · stack overflow DEFENSEStackGuard1998 · canaries DEFENSEDEP / NX bit2003 · non-exec stack DEFENSEASLR2001+ · randomization ATTACKReturn-Oriented Prog.2007 · gadget chains DEFENSECFI · PAC · CET2017+ · hardware-signed

Each red dot is an attack technique that broke the previous defense. Each green dot is a hardware or compiler change that responded. The arms race continues — and at each step, the cost of exploitation rises while the cost of a single mistake by a programmer in C or C++ falls more slowly.

The deeper lesson

Buffer overflows are not a bug in any one program. They are a structural consequence of a particular language model — one in which memory is raw, addresses are first-class values, and the programmer is responsible for every check. The defenses we just walked through are real and valuable, but each is a layer of mitigation, not a cure. The cure is to use a memory-safe language wherever possible. The fact that enormous amounts of critical infrastructure — operating system kernels, web browsers, database engines, network stacks — are still written in C and C++ is simultaneously a tribute to those languages' performance and a permanent source of risk. We will return to this when we discuss C, C++, and Rust in later chapters.

What you now understand

You have followed the chain from gates to assembly to exploits. The CPU executes a fixed menu of binary instructions defined by its ISA — x86, ARM, RISC-V — and everything else is a translation onto that menu. Registers are the CPU's tiny, fast notebook; the stack is a region of memory that grows downward and stores the bookkeeping of every function call. Calling conventions are the contract between caller and callee. Function calls work by pushing a return address onto the stack and jumping; they return by popping it back. And because the return address is just a number that lives in writable memory, an unchecked write — a buffer overflow — can replace it with anything, handing control of the machine to whoever supplied the input. The defenses against this — canaries, DEP, ASLR, CFI, pointer authentication — have evolved in lockstep with the attacks for thirty years.

What comes next is the seam. We have walked the substrate from voltage to instruction — the transistor, the gate, the adder, the CPU, the stack — and at every step the program has been free to do anything the silicon supports. Nothing yet stops a function from reading another program's memory, from writing to a disk it does not own, from monopolising the CPU until the user gives up. On any real machine, those things are not allowed. The Bridge that closes Part I is about why they are not allowed — what hardware features the silicon must already provide before any program can be made to behave, and how those few features combine into the conditions under which one program can be made to own the machine. We do not yet open the kernel and look inside it. That waits for Part II. We only show what must already be true of the machine before a kernel of any kind can exist.

Bridge — The Boundary

What the Silicon
Must Provide

Five small pieces of hardware are what separate a computer from a calculator. A privilege bit. A trap. A page-table walker. A timer. A way to talk to the world. Together they are the conditions under which one program can be made to own the machine — and Part I cannot end without naming them.

TopicsPrivilege · Traps · MMU · Timer · MMIO/DMA
Era covered1965 → present
Bridge hero · What the Silicon Must Provide RING 0 · KERNEL all instructions · all memory · all devices LGDTLIDTMOV CR3INVLPGHLT INOUTWRMSRSYSRETSWAPGS PL 0 · 3 THE BOUNDARY PROCESS A browser PID 4128 PROCESS B editor PID 9117 PROCESS C music PID 2240 RING 3 · USER MODE limited instructions · own memory only · no devices PRIV BIT TRAP/IDT MMU·TLB TIMER MMIO·DMA FIVE PIECES OF SILICON
01 — User vs Kernel

One bit, two worlds

Of all the millions of bits inside a modern CPU, exactly one is special enough to decide whether the next instruction runs at all. It is not set by software in the ordinary sense; it is set by the hardware itself, in response to specific events, and it cannot be flipped by a program in the lower mode. This single bit is the seam at which all of operating-system security begins.

The bit goes by different names on different chips. On x86 it is the bottom two bits of the CS selector — the Current Privilege Level, CPL — with values 0 (most privileged) through 3 (least). On ARM it is the Exception Level, EL0 through EL3. On RISC-V it is the privilege mode field — U, S, M. The number of available levels varies from architecture to architecture, but in practice every operating system we use today reduces them to two: a privileged mode for the kernel, and an unprivileged mode for everything else. Books and conversations call these Ring 0 and Ring 3, after the original IBM and Intel terminology, even on architectures whose hardware has no rings at all.

The hardware enforces, not the operating system

The crucial property — the property that makes this whole arrangement work — is that the privilege bit is enforced by the silicon, not by software. When a program in Ring 3 tries to execute a privileged instruction, the CPU does not ask politely whether that is allowed. It does not consult a kernel table. It does not defer to anyone. It traps. The instruction does not execute; control is yanked from the program and handed to a piece of code at a fixed address, in Ring 0, that the kernel set up at boot. From the program's point of view, the next instruction simply did not happen — though it may now be in the process of being terminated.

Roughly thirty instructions are privileged on x86-64. They are exactly the instructions that would let a program break the operating system's abstractions if run from user space: instructions that load the page-table base register (MOV CR3), the interrupt descriptor table (LIDT), the global descriptor table (LGDT); instructions that read or write model-specific registers (RDMSR, WRMSR); instructions that talk directly to physical I/O ports (IN, OUT); instructions that flush the translation cache (INVLPG); instructions that halt the processor (HLT). User mode is defined by exclusion: it can do everything except these. The diagram below shows the division.

Fig BR.1 — The privilege bit · two worlds, one chip
Fig BR.1 — The privilege bit · two worlds, one chip CURRENT PRIVILEGE LEVEL — TWO STATES, ENFORCED IN SILICON CS · BITS 0—1 CPL = 0 or 3 RING 0 · CPL = 0 "the kernel mode" privileged · permitted: ✓ MOV CR3, rax ✓ LIDT [idtr] ✓ INVLPG [addr] ✓ WRMSR / RDMSR ✓ IN / OUT (I/O ports) ✓ HLT ✓ MOV rax, rbx ✓ ADD / SUB / IMUL ✓ PUSH / POP ✓ CALL / RET ✓ JMP / Jcc ✓ ALL of user's set RING 3 · CPL = 3 "user mode" — all your code lives here attempted · trapped: ✗ MOV CR3, rax → #GP ✗ LIDT [idtr] → #GP ✗ INVLPG [addr] → #GP ✗ WRMSR / RDMSR → #GP ✗ IN / OUT → #GP ✗ HLT → #GP #GP = General Protection Fault — silicon raises a trap, kernel decides what to do

The CPL field of the CS register holds the current privilege level. With CPL=0 (kernel mode) every instruction in the ISA executes normally. With CPL=3 (user mode) about thirty instructions are forbidden — the ones that would let a program reach past the operating system's abstractions. Attempting any of them does not return an error code; it raises a #GP fault in the hardware. The CPU stops the offending instruction mid-flight and jumps to a kernel handler whose address was registered in advance. The decision is made before the instruction takes effect, by the silicon itself.

How does the CPL get set in the first place? Boot starts in the most privileged mode the chip has — there is no operating system yet, so there is no one to grant privileges. The bootloader hands control to the kernel, which is therefore born in Ring 0. The kernel then sets up the data structures the CPU will need to lower privilege when it eventually launches the first user-space program: page tables that map user addresses, an interrupt descriptor table that handles future traps, a Global Descriptor Table whose code and stack segments mark Ring 3. Only after all of this is in place does the kernel execute the very specific instruction — SYSRET on x86-64, ERET on ARM — that drops the CPU into user mode. From that moment on, getting back to Ring 0 requires going through the single mechanism the next section is about: the trap.

One more property is worth noting before we move on. The privilege bit is not just a switch on instructions; it is a switch on memory access. Page-table entries are tagged with a U/S bit (user/supervisor), and pages tagged S are readable only when CPL=0. This is what stops a user program from poking around the kernel's data structures even by accident: the same hardware that enforces virtual-to-physical translation also refuses to translate a user-mode access into a supervisor-only page. We will see the page-table walker that does this in Section 03.

⚙️

Why this is the seam. Every protection the operating system claims to enforce — process isolation, file permissions, network sandboxing, container boundaries — bottoms out in this one bit. If a user-mode program could clear the bit by itself, the entire stack would collapse. It cannot, because the bit lives in silicon: it is set only by a small set of instructions that themselves require Ring 0 to execute. The whole tower of operating-system security stands on the bit's circular self-protection.

02 — The Trap

The one door between the two worlds

A user program eventually needs to do things only the kernel can do. Read a file. Open a socket. Allocate memory. Print to the screen. None of these instructions exist in the user-mode ISA — there is no READ_FILE opcode. What exists is one specific instruction that asks the kernel, on behalf of the user program, to do the thing. The instruction is the system call, and the mechanism that delivers it across the boundary is the trap.

The word "trap" is precise. Unlike a function call — which jumps to an address the caller chose — a trap jumps to an address the kernel registered in advance, and which user code cannot change. The user program does not know where in the kernel its system call will be handled. It only knows the way through is to fire a particular instruction (SYSCALL on x86-64, SVC on ARM, ECALL on RISC-V), and the silicon handles the rest: switch CPL to 0, save user-mode registers, jump to the kernel entry point, run the handler, then return to user mode at the next user instruction. From user space's perspective the syscall looks like any other instruction that took an unusually long time. From the kernel's perspective an entire round trip across the privilege boundary has just happened.

The path of one syscall

Fig BR.2 — The trap mechanism · one round trip across the boundary
Fig BR.2 — The trap mechanism · one round trip across the boundary USER · RING 3 KERNEL · RING 0 PRIVILEGE BOUNDARY user instructions mov · add · push · … SYSCALL RAX = syscall # user instructions resume return value in RAX trap · CPL → 0 RIP, RFLAGS saved SYSRET · CPL → 3 user state restored entry stub MSR_LSTAR syscall_table[N] dispatch by RAX do work read · write · … t = 0 SYSCALL fires handler runs work done resume ~ ONE MICROSECOND TYPICAL · NO SCHEDULING NEEDED · HARDWARE ROUND TRIP

A single system call, traced across the boundary. The user program runs ordinary instructions until it executes SYSCALL with the syscall number in RAX and arguments in the standard ABI registers. The CPU does five things atomically: it sets the privilege bit to 0, saves the user-mode RIP and RFLAGS, loads RIP from a model-specific register (MSR_LSTAR) the kernel set at boot, masks interrupts, and starts running. The kernel's entry stub looks up RAX in a syscall dispatch table, runs the handler, places the return value back in RAX, and executes SYSRET — which restores user state and jumps back to the instruction after the original SYSCALL. The whole round trip is typically under a microsecond. A typical desktop performs millions of these per second.

Traps, interrupts, exceptions — the same mechanism, three names

The system call is one example of a much more general phenomenon. Whenever the CPU needs to stop running the current code and run something the kernel chose instead, it goes through the same path. The names depend only on what triggered it. A system call is a synchronous trap initiated by the program. An exception is a synchronous trap caused by something the program did wrong: division by zero, an invalid opcode, a page fault, a #GP from Section 01. An interrupt is an asynchronous trap from a hardware device: a network card receiving a packet, a disk finishing a read, a key being pressed, the timer firing. All three use the same machinery — save state, raise privilege, jump to a kernel-registered address, run, return — and all three are dispatched through one data structure: the interrupt descriptor table.

Fig BR.3 — The interrupt descriptor table · 256 doors into the kernel
Fig BR.3 — The interrupt descriptor table · 256 doors into the kernel IDT — 256 ENTRIES, EACH A KERNEL-REGISTERED HANDLER ADDRESS CPU · IDTR base + limit INTERRUPT DESCRIPTOR TABLE VECTOR CAUSE HANDLER 0 divide-by-zero do_divide_error 6 invalid opcode do_invalid_op 13 general protection do_general_protection 14 page fault do_page_fault 32 timer (LAPIC) apic_timer_interrupt 33 keyboard (IRQ 1) do_IRQ → driver 128 int 0x80 (legacy syscall) entry_INT80 SOURCE EXCEPTIONS vectors 0—31 CPU itself INTERRUPTS vectors 32—255 timer · NIC · disk SYSCALLS SYSCALL or int 0x80 user request Each entry holds: handler offset · target segment · DPL · gate type — all set up by the kernel at boot.

When any trap fires, the CPU reads the IDT base address from its IDTR register, indexes by vector number, and jumps to the handler the kernel registered for that vector. The same table dispatches CPU exceptions (vectors 0–31), hardware interrupts from devices (vectors 32 and up), and on legacy x86 the int 0x80 system call. Exactly the same machinery underlies every involuntary entry into the kernel.

A subtle but crucial property: each IDT entry carries a DPL — Descriptor Privilege Level — that controls whether user code can cause that vector to fire by software. The kernel sets DPL=3 only on the syscall vector; every other vector has DPL=0, which means a user-mode int 13 instruction does not deliver a fake general-protection fault, it raises a real one. Without this, user code could spoof any kernel handler invocation simply by raising the corresponding interrupt by hand. The hardware enforces who is allowed to ring which doorbell.

🔐

The kernel's authority is registered, not assumed. The CPU does not know what the kernel is. It knows where to jump on each kind of trap, because the kernel told it at boot — by writing the IDT into memory and pointing IDTR at it. After boot, that pointer is in a privileged register that user code cannot change. The kernel's authority is an arrangement the kernel made with the silicon while it was still alone with the machine.

03 — The MMU

The hardware that draws the walls between processes

Memory isolation is not a software policy. If process A could simply read address 0x4000 in process B's space by asking, no firewall and no kernel subroutine could stop it — the read would already be done by the time the kernel found out. Isolation has to happen before any memory access completes. It does, because between the CPU and the RAM sits a small piece of silicon called the memory management unit, and every memory access in the modern world flows through it.

Chapter 1 introduced virtual memory as a high-level idea: each process believes it owns the entire address space, and the kernel maintains the illusion. The Bridge shows the part of the illusion that lives in hardware. The MMU is the hardware. It does not run kernel code; it runs nothing that looks like code at all. It is a circuit. Given a virtual address, it produces — in nanoseconds — a physical address, or it produces a fault.

The page-table walker as a hardware unit

Fig BR.4 — The MMU · silicon between the CPU and the RAM
Fig BR.4 — The MMU · silicon between the CPU and the RAM CPU running user code issues: virt addr 0x7fff_a3c4_0010 MMU · SILICON memory management unit TLB recent · cached PAGE-TABLE WALKER CR3 → L4 → L3 → L2 → L1 L4 PML4 L3 PDPT L2 PD L1 PT PERMISSION CHECK U/S · R/W · NX · present — fail = #PF (vector 14) 0x4_a200_0010 RAM physical addresses returns: data byte EVERY · LOAD · STORE · INSTRUCTION-FETCH · GOES · THROUGH · HERE The kernel writes the page tables. The MMU reads them. The CPU asks the MMU on every access.

The MMU is a piece of silicon between the CPU and the rest of the machine. On every memory access — load, store, or instruction fetch — the CPU hands the MMU a 64-bit virtual address and waits for a physical address back. The MMU first checks the TLB (a small associative cache of recent translations); on a hit, it returns immediately. On a miss, the page-table walker — a hardware finite-state-machine, not the kernel — descends four levels of page tables in main memory, starting from the address in CR3, and produces both a physical address and a set of permission bits. The permission bits are checked. If anything is wrong — page not present, U-tagged page touched from kernel mode, write to a read-only page, NX-tagged page executed — the MMU raises a #PF (page fault, vector 14) and the kernel takes over.

The kernel never runs the walk itself. It only writes the page tables — the data structures the walker reads. This division of labour is essential. Translation happens billions of times per second; if the kernel had to be involved in every access, the machine would not run at all. Instead the kernel pre-arranges the tables, points CR3 at them, and lets the silicon do the lookup. When the kernel switches between processes, it changes one register — CR3 — and from that nanosecond on, every memory access made by the CPU resolves through a different process's tables. This is what process isolation actually is: a different value in CR3 means a different graph of page tables means different physical pages reachable from the same virtual addresses.

The TLB — a cache for translations

A four-level walk costs four memory accesses on top of the one the program asked for: a 5× tax on every load and store. No machine could afford that. The MMU includes a tiny associative cache called the translation lookaside buffer, the TLB, which remembers the recent virtual-to-physical translations. The TLB is small (perhaps 64 to a few thousand entries on modern cores), but locality of reference means a hit rate above 99% is normal. The walker runs only on misses.

Fig BR.5 — The TLB · why the walk usually doesn't happen
Fig BR.5 — The TLB · why the walk usually doesn't happen EVERY ADDRESS GOES THROUGH THE TLB FIRST virtual address 0x7fff_a3c4_0010 TLB recent translations 0x7fff_a3c4_0 → 0x4a2_0 0x7fff_a3c5_0 → 0x4a3_0 0x7fff_a3c6_0 → 0x4a4_0 0x7fff_a3c7_0 → 0x4a5_0 HIT · ~1 cycle · > 99% of accesses physical address returned to CPU MISS · < 1% page-table walk 4 memory loads · ~100 cycles

A virtual address arrives at the MMU. The TLB is checked first; on a hit, translation is essentially free. The page-table walker runs only on a miss. Modern CPUs achieve TLB hit rates above 99% on typical workloads, which means the four-level walk — apparently a 5× tax on memory access — actually happens less than 1% of the time. The TLB is what makes paging affordable.

One final detail. When the kernel changes a page-table entry — for example, because a page got swapped out, a permission was revoked, or a process was killed — the TLB must be told. There is a privileged instruction for exactly this: INVLPG, which invalidates a single TLB entry. After INVLPG, the next access to that virtual page will miss the TLB, forcing a fresh walk through the (now-updated) page tables. This is called a TLB shootdown when extended across all CPUs that share the page tables, and it is one of the most expensive ordinary operations a kernel performs. INVLPG is, of course, privileged — Section 01's #GP applies. A user process cannot mess with the TLB, even as a denial-of-service. The whole arrangement is enforced below, not above, the operating system.

04 — The Timer

The piece of silicon that makes multitasking possible

Imagine for a moment a CPU exactly like the modern ones — privilege bit, traps, MMU, all of it — but with no clock that fires interrupts. A cooperative kernel could still exist: programs would yield voluntarily, the kernel would schedule the next, and the world would mostly work. Until one program had a bug — an infinite loop, a slow algorithm, an attacker — and refused to yield. There would be no way to take the CPU back. The machine would freeze, not because anything was broken, but because the kernel had no way to interrupt a program that did not want to be interrupted.

The piece of silicon that solves this is the hardware timer — a free-running counter wired directly into the interrupt controller. Every modern CPU has at least one. On x86 the local APIC timer ticks at the bus frequency and fires interrupts at a programmable rate; on ARM a generic timer does the same. The kernel programs it once, at boot — say, "interrupt me every millisecond" — and from that moment forward, no matter what user code is doing, the timer fires. Control returns to the kernel. The kernel decides what to do next. Without this one piece of hardware, preemptive multitasking is impossible — and every modern operating system, from Linux to Windows to macOS to a phone OS, depends on it.

Fig BR.6 — The timer interrupt loop · how the kernel takes the CPU back
Fig BR.6 — The timer interrupt loop · how the kernel takes the CPU back PROGRAMMED RATE — TYPICALLY 100—1000 Hz · SET ONCE BY THE KERNEL PROCESS A PROCESS B PROCESS C (loops) PROCESS A B TIMER FIRES t = 0 1 ms 2 ms 3 ms 4 ms LAPIC TIMER free-running counter silicon, on-die → raises IRQ at vector 32 (see Fig BR.3) → scheduler runs · picks next process → runaway Process C cannot escape this

A typical second of CPU time, sliced by timer interrupts. The timer fires at a programmed rate (often 100–1000 Hz). Each tick raises an interrupt at vector 32, which dispatches into the kernel's scheduler. The scheduler chooses what runs next — possibly the same process, possibly a different one. Process C is in an infinite loop and will never yield voluntarily; the timer is the reason that does not freeze the machine. Without this single piece of silicon, the kernel could only ask processes to yield, and any program that didn't would own the CPU forever.

Modern Linux can configure itself in three modes regarding the timer: HZ_PERIODIC (a tick at every fixed interval, the historical default), NO_HZ_IDLE (skip ticks when the CPU is idle, to save power), and NO_HZ_FULL (skip ticks even when running, used for high-frequency trading and scientific computing where each tick is an unwanted disruption). The tunable lives at the millisecond scale; the principle is unchanged. Whatever the tick rate, the underlying mechanism is hardware that interrupts the CPU independently of the running program.

Why the timer is irreducible. Privilege rings, traps, and the MMU all give the kernel authority over what user code can do. The timer is what gives the kernel a chance to exercise that authority. Without it, the kernel has the right to act but no way to ever begin acting if the user code refuses to call it. Multitasking, time-sharing, fairness, deadline scheduling, even the ability to kill a misbehaving program — all rest on this one circuit firing on a schedule the user cannot disable.

05 — Talking to the World

How the CPU reaches beyond itself

The CPU on its own can compute, but it cannot type, see, hear, or speak. To do anything observable in the world, it has to communicate with devices: keyboards, screens, disks, network cards, audio chips, GPUs. The path from CPU to device is hardware too — and like everything else in the Bridge, the kernel is the only software that gets to use it.

Two channels carry every conversation between CPU and device. The first is memory-mapped I/O: device registers are placed at fixed addresses in the physical address space, and reading or writing those addresses is how the CPU talks to the device. The second is direct memory access: the device, on its own initiative, copies data into or out of RAM without asking the CPU on every byte. Both are coordinated by interrupts — the device tells the CPU "I'm done" by raising an IRQ, which goes through the same trap machinery from Section 02.

Memory-mapped I/O — devices live at memory addresses

Fig BR.7 — The physical address space · devices among the RAM
Fig BR.7 — The physical address space · devices among the RAM PHYSICAL ADDRESS SPACE — A SINGLE FLAT 64-BIT MAP DRAM your program's memory page tables, kernel, all processes FRAME BUFFER · GPU PCIe MMIO NIC, NVMe, USB controllers LAPIC · TIMER · IO-APIC interrupt controllers BIOS / UEFI firmware ROM 0xFFFE_0000 0xFEC0_0000 0xFE00_0000 0xA000_0000 0x0000_0000 CPU issues MOV [addr] read/write writing to 0xFFFE_xxxx → hits firmware ROM (read-only at runtime) writing to 0xFEC0_xxxx → programs the IO-APIC (interrupt routing) writing to PCIe MMIO → device register write — NIC sends a packet, NVMe queues an op writing to frame buffer → pixels appear on screen at the next refresh writing to DRAM → ordinary memory store; nothing is sent anywhere Same instruction. Different effect entirely. Determined by the address.

The CPU sees one flat physical address space. Some of those addresses are RAM; some are a video framebuffer; some are device registers on PCIe controllers; some are interrupt-routing chips; some are firmware ROM. The CPU doesn't know which is which — it just issues a load or a store. The chipset routes the access to the right place. To send a packet, the kernel writes to the NIC's MMIO region. To draw a pixel, the kernel writes to the framebuffer. To program the timer's tick rate, the kernel writes to the LAPIC's MMIO region. The same MOV instruction does all of these — only the address differs. And only the kernel can map any of these regions into a process's address space (Section 03), so only the kernel can reach them.

DMA — when the device copies on its own

MMIO is fine for sending small commands ("read sector 4096 into buffer at 0x4a200000"). It is not fine for the data itself. A modern NVMe SSD reads gigabytes per second; if the CPU had to copy each byte one MMIO load at a time, the CPU would do nothing else, and the SSD would still be the slow part. The hardware solution is direct memory access: the device has its own memory-bus master, and once the CPU has told it where to read and where to write, the device shovels data on its own. The CPU is free to run something else; when the device is done it raises an interrupt and the kernel takes over to clean up.

Fig BR.8 — Direct memory access · the device writes RAM by itself
Fig BR.8 — Direct memory access · the device writes RAM by itself CPU programs DMA: src · dst · count then runs other code NVMe SSD reads sector streams bytes via PCIe DMA DRAM buffer at 0x4a20_0000 filled by DMA ① small MMIO write — "send sector 4096 to buf 0x4a20_0000" ② SSD writes RAM directly — CPU is doing other work ③ "done" — IRQ → kernel handler (vector ≥ 32, see Fig BR.3) The CPU touches the data zero times. It only touches the command and the completion.

A typical disk read with DMA. Step 1: the kernel does a small MMIO write to the SSD's command register, telling it where to put the result. Step 2: the SSD reads its sectors and writes them straight into RAM, on the memory bus, without involving the CPU. The CPU is free to run other processes during this transfer. Step 3: when the SSD is done, it raises an IRQ; the trap machinery from Section 02 dispatches the kernel's interrupt handler; the kernel marks the I/O complete and wakes whichever process was waiting. For a multi-megabyte read this is the difference between using one core's full attention and using essentially none of it.

DMA introduces a security wrinkle. A device that can write any address in RAM is a dangerous device — a malicious or buggy network card could overwrite kernel code. Modern systems address this with another piece of silicon, the IOMMU: a second MMU that translates and bounds-checks the addresses devices use, exactly as the regular MMU does for the CPU. With the IOMMU, a NIC can be told it may DMA only into a specific buffer; an attempt to reach kernel memory raises a fault. The IOMMU is the reason VFIO and PCI passthrough into virtual machines is safe. We mention it once here; the rest of the kernel-as-program story belongs to Part II.

06 — Synthesis

What the kernel must therefore be

We can now answer the question that Part I has been building toward — quietly, since Chapter 1 — without ever yet looking at a single line of kernel source. Given everything the silicon provides, what shape must the kernel be? What is the minimum program that turns five hardware features into the abstractions every modern computer offers?

Each abstraction we use without thinking corresponds, almost one to one, to a hardware feature the previous five sections introduced. There is no abstraction without a corresponding piece of silicon to enforce it; and there is no piece of silicon that becomes useful until a program — the kernel — sets it up correctly at boot. The diagram below maps the chain.

Fig BR.9 — Hardware features → kernel → abstractions
Fig BR.9 — Hardware features → kernel → abstractions SILICON · KERNEL · ABSTRACTION — A ONE-TO-ONE CHAIN PRIVILEGE BIT CPL · EL · privilege mode TRAP / IDT syscall · exception · IRQ MMU · TLB page tables · CR3 · NX · U/S TIMER LAPIC · generic timer MMIO · DMA device regs · IOMMU KERNEL the program that arranges the silicon to keep its promises ROOT vs USER file perms · capabilities SYSCALL API read · write · open · … PROCESSES isolation · memory mapping MULTITASKING preemption · scheduling FILES · NET drivers · stacks · sockets REMOVE ANY ONE HARDWARE FEATURE AND THE CORRESPONDING ABSTRACTION DISAPPEARS

Read top to bottom: each hardware feature in the upper row enables one (or more) of the abstractions in the lower row. The kernel is the middle layer — the program that takes the raw silicon primitives and arranges them into something programs can use. Without the privilege bit, there is no root vs user. Without the trap, there is no system call API. Without the MMU, processes share one memory space. Without the timer, multitasking is impossible. Without MMIO and DMA, the CPU cannot reach beyond itself. The whole tower of modern operating systems sits on these five features and the few thousand lines of kernel that bind them together at boot.

The kernel as the smallest program that arranges the silicon

Note what this argument does not say. It does not say the kernel is small, or that it does only what hardware demands. Linux is more than thirty million lines of code; macOS's XNU and Windows's NT are smaller but still vast. They contain filesystems, network stacks, GPU drivers, audio, USB, Bluetooth, virtual machines, container runtimes, schedulers, security frameworks, and the long tail of a thousand specialised subsystems. None of that is required by the silicon. What the silicon requires is much smaller: a kernel that sets up page tables, registers a trap handler, programs a timer, and gets out of the way.

The rest is engineering — and that engineering is what Part II is about. Part II takes the kernel as a real program and asks the questions only a piece of code can answer: how should processes be scheduled? How should virtual memory be organised in software, on top of the hardware tables we now know about? What data structure connects a filename to a sequence of disk blocks? How do processes talk to each other? What goes wrong when the kernel itself has bugs? None of those questions live in silicon. They live in code.

But every one of them assumes — silently, as a matter of course — that the bridge we have just walked across already exists. That somewhere underneath the schedule() function is a timer that called it. That somewhere underneath the read() system call is a SYSCALL instruction and an IDT entry. That somewhere underneath every memory dereference is a four-level page-table walker doing the translation in nanoseconds. The kernel chapter that opens Part II will not stop to mention these things. It does not have to. They are what Part I has built. They are what the Bridge is for.

The kernel is software. The reason the kernel can exist at all is hardware.

— the lesson Part I has been building to

Part I closes here. From sand to gates, from gates to instructions, from instructions to the boundary that makes a kernel possible. Part II opens the kernel and looks inside.

End of Part I

The Substrate is built.

From sand to gates, from gates to instructions. Part II opens the program that owns this hardware — the kernel — and then turns to the languages people write in.