Why does any of this exist?
To understand a modern computer, you have to understand that it was never inevitable. For most of human history, calculation was something humans did. The word computer itself originally referred to a person — usually a woman — employed to perform arithmetic by hand. The machine you are reading this on replaced an entire profession.
The transition from human computer to machine computer required a particular kind of insight: that thinking itself could be mechanized. Not all thinking — Alan Turing was careful about this — but a specific kind. The kind that follows rules. The kind that, given the same input, always produces the same output. Mathematics, in other words. And once you have a machine that can do mathematics, you have a machine that can do anything mathematics can describe. And mathematics, it turns out, can describe a remarkable amount of the world.
The 1936 paper that changed everything
In 1936, a 24-year-old Cambridge mathematician named Alan Turing published a paper titled On Computable Numbers, with an Application to the Entscheidungsproblem. It had nothing to do with machines, ostensibly. It was about a question David Hilbert had posed in 1928: is there a mechanical procedure that, given any mathematical statement, will decide whether the statement is true or false?
Turing's answer was no — but the way he proved it was extraordinary. To show that no such procedure could exist, he had to first define what "mechanical procedure" meant precisely. He invented an imaginary device: an infinite tape divided into cells, a read/write head that could move along the tape, and a finite set of rules that said what to do based on the current symbol and the machine's current state.
He called it an a-machine — automatic machine. We now call it a Turing machine. And he proved something that took decades to fully appreciate: a single, sufficiently complex Turing machine could simulate any other Turing machine, given the right rules on its tape. He called this a universal machine. It is the first formal description of what we now call a computer.
A Turing machine has three parts: an infinite tape of symbols, a head that reads and writes one cell at a time, and a finite table of rules. Each rule says: given the current state and the symbol under the head, write a new symbol, move left or right, and change to a new state. Watch the head crawl along the tape, rewrite cells, and hand off to the next state — three transitions on a nine-second loop. Anything that can be computed, can be computed by a machine like this.
"We can only see a short distance ahead, but we can see plenty there that needs to be done."
— Alan Turing, 1950The war that built the first computers
Theory met urgency in 1939. Britain needed to break German naval codes — the Enigma cipher — faster than humans could. At Bletchley Park, Turing built the Bombe, an electromechanical machine that searched the Enigma key space mechanically. Later, his colleague Tommy Flowers built Colossus, the first programmable electronic digital computer, used to break the higher-level Lorenz cipher used by Hitler's high command. These machines were not Turing machines in his theoretical sense — they were special-purpose. But they proved that mechanical computation worked at scale.
Why Enigma was so hard: the same key pressed three times in a row gives three different letters because the rotors step between presses, changing the internal wiring path. Multiplied across the rotor stack, plugboard pairings, and daily reflector settings, the keyspace was about 158 × 10¹⁸ possibilities. No pencil-and-paper attack could keep up. Turing's Bombe — the gold-tinted strip below the rotors — automated the search by exploiting structural weaknesses in the encryption: it tested rotor positions in parallel, stopping when the math became consistent. It was the first time a machine was built specifically to think through a problem human minds could not.
Historians estimate Bombe and Colossus shortened the war by two years and saved roughly fourteen million lives. The machines worked. The question, after the war, became: how do you build one that isn't purpose-built for a single problem? How do you build a universal machine — Turing's theoretical idea, in actual hardware?
Historical note. Turing was prosecuted for homosexuality in 1952, sentenced to chemical castration by court order, and died in 1954 — officially by cyanide poisoning, likely suicide. He never saw the computer revolution he made possible. Britain issued a formal apology in 2009. He now appears on the £50 note.
Everything is sand
The first general-purpose electronic computer, ENIAC (1945), was built from vacuum tubes — glass bulbs that controlled electrical current using a heated filament inside a vacuum. ENIAC had 17,468 of them. It filled a room thirty meters long, weighed thirty tons, consumed 150 kilowatts of power (enough to dim the lights of Philadelphia when it switched on), and broke down on average every two days because a tube would burn out. The mean time between failures was measured in hours, not years.
This was unsustainable. To make computers smaller, faster, more reliable, and affordable, the vacuum tube had to be replaced. The replacement was invented in 1947 at Bell Labs in New Jersey by three physicists — William Shockley, John Bardeen, and Walter Brattain. They called it the transistor, a contraction of transfer resistor.
Both devices are amplifiers and switches — both can take a small input signal and use it to control a much larger one. The vacuum tube does it by boiling electrons off a heated filament inside an evacuated glass bulb, then steering them toward a positively-charged plate using a control grid in between. The transistor does the same job by gating electrons through a sliver of doped silicon — no glass, no filament, no heat, no wear. Once you can build one, you can build a billion of them; once you can build a billion, you have a CPU.
What a transistor actually is
A transistor is a switch. Specifically, it is a switch with no moving parts, controlled by electricity rather than by a finger or a relay arm. It has three terminals: a base (or gate, in the modern field-effect design), a collector (or drain), and an emitter (or source). When you apply a small voltage to the base, it lets a much larger current flow between collector and emitter. No voltage at the base, no flow.
That's it. That is the entire foundational mechanism of every computer ever built. Everything else — every program, every webpage, every video game, every neural network — is layers of abstraction built on top of switches turning on and off.
A transistor with no voltage at the base is open: no current flows from collector to emitter. With a small voltage at the base, the channel opens and current flows — the green dots above are electrons, drifting through the channel as long as the gate signal is present. A modern CPU contains roughly fifty billion of these switching at gigahertz speeds.
Why semiconductors
The transistor works because it is built from a semiconductor — a material whose conductivity sits between a conductor (like copper) and an insulator (like glass), and which can be precisely tuned by adding tiny amounts of impurities. Almost all modern transistors are built from silicon, which is the second most abundant element in Earth's crust. Sand is mostly silicon dioxide. The entire digital economy is, in a literal sense, built on purified sand.
Underneath the schematic symbol is a stack of materials. The substrate is silicon doped with electron-acceptor atoms (P-type). Two regions on either side are doped with electron donors (N⁺) — these are the source and drain. Between them, an insulating layer of silicon dioxide separates the substrate from a conducting gate. With no voltage on the gate, no path exists between source and drain. Apply a small positive voltage and the substrate just under the oxide inverts — a thin layer of free electrons forms a conducting channel, and current flows. This effect — the field-effect — is why these are called field-effect transistors, or MOSFETs (Metal-Oxide-Semiconductor Field-Effect Transistors). Every modern CPU is a stack of these about 3 nanometres wide.
In 1958, Jack Kilby at Texas Instruments etched multiple transistors onto a single piece of germanium, creating the first integrated circuit. Within months, Robert Noyce at Fairchild Semiconductor independently did the same on silicon, with a more manufacturable process. Both filed patents. Both were right. Noyce's process became the basis of the entire industry.
In 1965, Gordon Moore — co-founder of Intel and a former colleague of Noyce — published an observation: the number of transistors that could fit on a single integrated circuit was doubling roughly every twelve months, and would likely continue to do so. He later revised this to every two years. This pattern, Moore's Law, held with remarkable accuracy for fifty years and drove the entire semiconductor industry's roadmap. It is why your phone, with perhaps fifteen billion transistors, exists at all.
Moore's Law plotted on a logarithmic y-axis. The points fall almost on a straight line, which is what exponential growth looks like in log space — each gridline is ten times the previous. The slope corresponds to a doubling roughly every two years. The pattern began to slow after 2015 as transistor sizes approached the size of single atoms, but it has lasted longer than nearly any technological prediction in history.
Shockley, Bardeen, and Brattain demonstrate the first working transistor on December 23, 1947. Nobel Prize 1956. The replacement of vacuum tubes makes miniaturization possible.
Kilby (TI) and Noyce (Fairchild) independently put multiple transistors on one chip. Kilby uses germanium, Noyce uses silicon. Silicon wins because it forms a superior native oxide insulator.
Gordon Moore predicts transistor density doubles every year (later revised to two years). The prediction becomes a self-fulfilling industrial roadmap.
2,300 transistors on a 10μm process. 740 kHz clock. Originally designed for a Japanese calculator manufacturer. Marked the transition from "computers fill rooms" to "computers fit in pockets."
28 billion transistors. 3-nanometer process. Each transistor smaller than the width of a flu virus. Moore's Law is slowing — physical limits are being reached — but it has lasted over half a century.
Both machines compute. ENIAC filled a thirty-metre hall; the M4 fits under your thumbnail and contains roughly twelve million times as many switches. Both fundamentally do the same thing — flip switches according to instructions — but the engineering distance between them is the entire history of the integrated circuit. The figures and silhouettes above are drawn at the same on-screen size, but the actual size ratio is closer to 2,000 to 1.
Von Neumann's insight: instructions are data
Early computers were hardwired. ENIAC, to be reprogrammed, had to be physically rewired — operators (almost all women, called "the ENIAC girls") would unplug and replug cables for days to set up a new calculation. The hardware encoded the program. To change what the machine did, you changed the machine.
In 1945, John von Neumann — a Hungarian-American polymath who had worked on the Manhattan Project and had a hand in nearly every important mathematical development of the mid-20th century — circulated a paper called the First Draft of a Report on the EDVAC. It described a different kind of architecture, one that would become the dominant design for every computer built afterward.
His insight, distilled: store the program itself in memory, alongside the data it operates on. Instructions are just patterns of bits. Data is just patterns of bits. They can live in the same memory. The CPU reads instructions from memory exactly the same way it reads data — which means a program can read another program, modify it, and then run it. A machine can write programs.
This is not a small idea. It is the difference between a machine that does one thing very well and a machine that, given the right instructions, can do anything. The compiler that turns C code into machine code is a program. The web browser running on your computer is a program. The kernel that runs the browser is a program. They all live as data in memory until the CPU reads them as instructions.
Von Neumann architecture in its essence: one shared memory holds both program instructions and data. The CPU fetches an instruction (gold packet flowing right on the bus), decodes it, executes it on the ALU, and writes results back to memory or registers (blue packet flowing left). Then it fetches the next instruction. Nearly every computer built since 1945 follows this design — and the diagram above plays out one cycle of it on a continuous loop.
The Von Neumann bottleneck
The architecture has one famous weakness, and it is a consequence of its core design choice. Because instructions and data share the same memory, they share the same bus — the same physical wires between the CPU and main memory. The CPU can only do one transfer at a time. And while CPUs got dramatically faster over the decades (clock speeds rose from kilohertz to gigahertz), main memory became faster much more slowly. Today, a modern CPU can perform a basic operation in under a nanosecond. A round trip to main RAM takes about a hundred nanoseconds. The CPU spends most of its time waiting.
The solution is a hierarchy of caches — small, fast memory built directly into the CPU that holds copies of recently used data. We'll cover this in detail later in this chapter. For now, it's enough to know that this bottleneck is one of the deepest design constraints of modern computing, and essentially every CPU optimization since the 1980s has been an attempt to work around it.
Fetch. Decode. Execute. Repeat.
Every CPU ever built does the same four things in a loop, billions of times per second. This loop is the heartbeat of every program you have ever run.
The instruction cycle is the fundamental unit of computation at the hardware level. A modern CPU executes billions of these per second on each of its cores. Everything else — your operating system, your browser, your music, this page — is just a particular sequence of instructions that the cycle runs through.
A modern CPU executes billions of these cycles per second on each of its cores. Watch the gold packet circulate above: it represents one instruction making its way through the four stages, then the program counter advances and the next instruction begins. Everything else — your operating system, your browser, this page — is a particular sequence of instructions threaded through this loop.
What an instruction actually is
At the hardware level, every instruction is simply a binary number — a specific pattern of ones and zeros. The CPU's control unit is a circuit that, when given a particular pattern, activates a particular sequence of internal signals. Different patterns trigger different operations. The mapping from binary patterns to operations is called the instruction set architecture, or ISA. Intel and AMD CPUs use the x86 ISA. Apple Silicon, your phone, the Raspberry Pi, and every modern Mac use ARM. They are mutually incompatible — code compiled for one will not run on the other unless translated.
Assembly language is just a human-readable label for these patterns. The mnemonic
mov eax, 5 is the assembler's name for the binary instruction
10111000 00000101 00000000 00000000 00000000 on x86. They mean exactly
the same thing — assembly is a one-to-one translation. We will spend all of
Chapter 3 on this.
; A simple addition program ; This is roughly what C's "int x = 5 + 3;" compiles to mov eax, 5 ; load the value 5 into register EAX mov ebx, 3 ; load the value 3 into register EBX add eax, ebx ; EAX = EAX + EBX, result is now 8 ; In memory, "add eax, ebx" is just two bytes: 01 D8 ; The CPU's decoder maps this pattern to "ALU add" with ; source EBX and destination EAX. Then the ALU performs it.
A single x86 instruction in memory, taken apart byte-by-byte. The first byte (gold) is the opcode — but it isn't atomic; its top five bits encode the operation type ("move 32-bit immediate into register"), and its bottom three bits encode which register (000 = EAX). The next four bytes (blue) are the literal value 5, stored in little-endian order — the lowest byte first, which is why 05 appears closest to the opcode and the zero-padding follows. The CPU's instruction decoder is a circuit that recognises these bit patterns and routes them to the appropriate units in microscopic time.
Pipelining and parallelism inside one core
A naive CPU would do each step of the instruction cycle one at a time, finishing one instruction completely before starting the next. Modern CPUs do not. They use pipelining — overlapping the stages of different instructions, like an assembly line. While instruction 3 is being executed, instruction 4 is being decoded, and instruction 5 is being fetched. A modern pipeline may have 14 to 20 stages.
Each instruction still takes five cycles to fully complete — but the CPU works on five instructions simultaneously, each in a different stage. The diagonal pattern is the signature of a pipeline: a new instruction enters the IF stage every cycle, and a finished instruction leaves the WB stage every cycle. Modern CPUs may have 14 to 20 pipeline stages, multiple of each kind, and execute several instructions in parallel per cycle. The cursor above sweeps across one cycle at a time so you can see what's happening inside the chip at each tick.
They also use out-of-order execution — if
instruction 5 doesn't depend on instruction 4's result, the CPU may execute 5
first while 4 waits for memory. And speculative execution:
if there's a branch (an if statement), the CPU guesses which way
it will go and starts executing that path before the condition has been computed.
If it guesses right, it saves time. If it guesses wrong, it discards the work.
Modern CPUs guess right about 95% of the time.
When the CPU reaches a conditional branch, it can't afford to wait for the condition to be computed — that would idle the pipeline for many cycles. So it guesses, based on the branch's history, and starts executing one path immediately. If the guess turns out right, the speculative work is committed and time was saved. If wrong, the work is discarded and the alternate path begins. Modern CPUs guess right about 95% of the time. The catch — exposed in 2018 — is that "discarded" only means the architectural state is rolled back. Microarchitectural side effects, especially in caches, persist. Read code that "shouldn't have run" leaves a fingerprint, and a careful attacker can read the fingerprint.
Spectre and Meltdown (2018). Two catastrophic CPU vulnerabilities discovered in nearly every processor made since 1995. They exploited speculative execution: when the CPU guessed wrong and discarded the work, traces of that work remained in the cache — traces that an attacker could measure to read memory they should not have access to. Hardware has bugs too, and they are far harder to fix than software bugs. We will return to this in Chapter 15.
Why the kernel had to be invented
In the early 1950s, running a program meant booking the entire computer for yourself. You brought your stack of punched cards to the machine room, the operator loaded them, the machine ran your program, printed the output, and you came back an hour later to read it. One program at a time. No sharing.
For nearly two decades, this was the entire user experience of computing. The machine was a precious shared resource; the user — a programmer, scientist, or engineer — was a supplicant who handed over a stack of cards and waited. The CPU often sat idle while the operator loaded the next deck. Universities started asking: can the machine run someone else's program while mine waits for I/O? Can two people use it at once? The kernel was the answer.
As computers got faster and programs got longer, this became absurd. The CPU sat idle most of the time, waiting for slow input/output devices like punch card readers or magnetic tape drives. Universities started asking the obvious question: can multiple programs share a computer? Can one program run while another waits for I/O? Can different users be logged in simultaneously?
The answer was yes — but only if something managed the sharing. That something became the operating system kernel. The kernel is the one program that always runs. It owns the hardware. Every other program must ask it for permission to do anything that touches the outside world.
x86 CPUs have four privilege levels (rings 0–3), but in practice operating systems only use two: Ring 0 (kernel mode) and Ring 3 (user mode). Code in Ring 0 can execute any CPU instruction and access any memory. Code in Ring 3 cannot. The boundary is enforced by the CPU hardware itself, not by software. Watch the gold packet: an ordinary user program executes a syscall instruction, the CPU traps into Ring 0, the kernel handles the request, and control returns to user space — typically in less than a microsecond, and a typical desktop performs millions of these per second.
The system call: crossing the boundary
When your Python script opens a file, it doesn't access the disk directly. Your
program — running in Ring 3 — has no permission to talk to the disk controller.
Instead, it calls open() in Python, which calls fopen()
in the C library, which calls a special CPU instruction (syscall on
modern x86-64) that triggers a hardware exception. The CPU saves the program's
state, switches to Ring 0, and jumps to a kernel entry point. The kernel then
checks whether your process has permission to read that file, finds it on disk,
and returns a numeric handle (a file descriptor) to your program.
Your program never touched the hardware. It asked the kernel, the kernel decided. This is the entire foundation of operating system security. Without this boundary, every program would have full access to every other program's memory, every file, every network packet. With it, programs are isolated from each other by hardware-enforced rules.
// What you write in C: FILE *f = fopen("data.csv", "r"); // What the C library does internally: int fd = syscall(SYS_open, "data.csv", O_RDONLY, 0); // What happens at the CPU level: // 1. The syscall instruction triggers a switch from Ring 3 to Ring 0 // 2. The kernel reads the syscall number (SYS_open == 2 on Linux) // 3. It dispatches to sys_open() inside the kernel // 4. sys_open checks process permissions against file ownership // 5. It walks the filesystem (directory tree) to find the file // 6. It allocates a file descriptor in this process's table // 7. It switches back to Ring 3 and returns the descriptor (e.g. 3) // // Your program sees: fd == 3. It never touched the disk hardware.
UNIX and the kernels we still use
In 1969, at Bell Labs, Ken Thompson and Dennis Ritchie designed an operating system called UNIX. Its design was austere and elegant: everything is a file. A regular file is a file. A keyboard is a file. A network connection is a file. A running process exposes a directory of files describing it. All of them accessed through the same system call interface: open, read, write, close.
UNIX became, directly or indirectly, the ancestor of nearly every operating system in current use. Linux is a UNIX-like kernel written from scratch by Linus Torvalds in 1991. macOS uses a kernel called XNU, derived from a hybrid of Mach and BSD (a UNIX descendant). iOS and Android are both built on UNIX-derived kernels. Even Windows, originally not UNIX-based, now ships with a Linux subsystem. The architectural ideas in UNIX — processes, file descriptors, the system call interface — define what an operating system is.
Why this matters for security. A privilege escalation attack is one in which an unprivileged process — your malicious program running in Ring 3 — finds a way to gain Ring 0 access. If it succeeds, it has full control of the machine: it can read any file, watch any keypress, install any rootkit. Every major operating system has had privilege escalation vulnerabilities. The kernel is the most security-critical code on any computer. We will spend significant time on this in Part II's kernel chapter and again in Part IV's unified security chapter.
The hierarchy of forgetting
Memory in a computer is not one thing. It is a hierarchy of increasingly larger, slower, cheaper storage, managed at different levels by the CPU and the kernel. Each level holds a copy of part of the level below it. Closer to the CPU means faster but smaller. Further away means larger but slower.
The numbers below tell the entire story of why optimizing software is hard. The CPU performs an arithmetic operation in roughly 0.3 nanoseconds. A round trip to main memory takes 60 nanoseconds — two hundred times longer. A read from a fast SSD takes 50,000 nanoseconds — 167,000 times longer than a register access. Most of computer architecture for the past forty years has been about hiding this gap.
| Level | Location | Typical size | Access time | Managed by |
|---|---|---|---|---|
| Registers | Inside the CPU core | ~1 KB | 1 cycle (~0.3 ns) | Compiler, CPU |
| L1 Cache | On the CPU die | 32–64 KB | 4 cycles (~1 ns) | CPU hardware |
| L2 Cache | On the CPU die | 256 KB – 1 MB | 12 cycles (~4 ns) | CPU hardware |
| L3 Cache | On the CPU package | 8–64 MB | 40 cycles (~13 ns) | CPU hardware |
| RAM (DRAM) | On the motherboard | 8–128 GB | ~100 cycles (~60 ns) | OS kernel |
| SSD (NVMe) | On the PCIe bus | 256 GB – 4 TB | ~50,000 ns | OS + filesystem |
| HDD (spinning disk) | SATA bus | 1–20 TB | ~5,000,000 ns | OS + filesystem |
The same data as the table above, drawn so the gaps are visible. Each row's width is roughly proportional to its capacity; each row's dot speed is roughly proportional to its access latency. The top three rows tick almost too fast to follow; the SSD dot crawls; the HDD dot is essentially still. Most of computer architecture for the past forty years — caching, prefetching, pipelining, branch prediction, parallelism — exists to hide this gap from the program.
The hierarchy works because programs do not access memory uniformly. They keep returning to the same regions over and over — looping over an array, calling a function repeatedly, reading the next word in a string. This is called locality of reference, and it is the reason a small fast cache speeds up a much larger slow memory by an enormous factor. The math is unforgiving but simple:
Tavg = h · tcache + (1 − h) · tRAM
If 95% of accesses hit L1 cache (1 ns) and only 5% miss into RAM (60 ns), the average access time is 0.95 × 1 + 0.05 × 60 = 3.95 ns. Without the cache, every access would cost 60 ns. The cache makes the whole system roughly fifteen times faster — and it does this with a hit rate that needs to be high but doesn't need to be perfect. The same calculation justifies every layer of the hierarchy: a small fast tier above a larger slow one, exploiting the fact that the next thing a program needs is usually close to the last thing it needed.
Peter Denning's working set theory (1968) made this rigorous, and every CPU since has been a refinement of the idea. We will see the same locality argument resurface in Part II's kernel chapter when we look at virtual memory and page caches, in Part III's web chapter when we examine DNS caching, and in Part IV's data chapter when we examine database buffer pools. The same equation, the same intuition, all the way up the stack.
Virtual memory: the lie the kernel tells
When your program asks for memory, the kernel does not give it a real physical address. It gives a virtual address — a fiction. Your program believes it has its own private address space starting at zero, with gigabytes of memory available. So does every other program. They all think they own the machine.
The CPU's Memory Management Unit (MMU), guided by tables that the kernel maintains, translates these virtual addresses to physical RAM addresses every time a program reads or writes memory. The translation tables — called page tables — divide memory into 4-kilobyte chunks called pages and map each virtual page to either a physical page in RAM or to a location on disk (if RAM is full and the page has been swapped out).
This achieves three things at once: isolation (no program can read another's memory because the translation tables for different programs map to different physical pages), flexibility (the kernel can move pages around in RAM, swap them to disk, or load them on demand from a file), and protection (the page tables also encode permissions — read, write, execute — and the MMU enforces them).
Buffer overflow attacks exploit the memory model directly. If a program writes more data than a buffer can hold, the extra bytes spill into adjacent memory. If that memory contains a return address — an address the CPU will jump to when the current function ends — an attacker who controls the input controls where the CPU jumps next. Stack canaries, ASLR (Address Space Layout Randomization), and DEP (data execution prevention) are the layered hardware and OS defenses against this. We'll see exactly how this works at the assembly level in Chapter 3.
What you now understand
You have followed the chain from sand to software. Transistors — silicon switches controlled by voltage — combine into logic gates, then into integrated circuits, then into CPUs. The CPU runs an endless loop: fetch, decode, execute, writeback. It executes instructions encoded as binary patterns drawn from an instruction set architecture like x86 or ARM. Programs and data live together in the same memory, a design choice — Von Neumann's — that defines what a modern computer is. The kernel mediates between programs and hardware, enforcing isolation and security through privilege rings and virtual memory.
This is the substrate. Every chapter that follows builds on it. Chapter 2 goes
deeper into a single layer of this stack — the layer where mathematics and
electricity meet. We will look at why a computer must be binary, how George Boole
invented the logic that runs on those binary signals, and how arithmetic — the
thing computers fundamentally do — emerges from a few simple gates. By the end
of Chapter 2, you will understand at a physical level why 0.1 + 0.2
does not equal 0.3 in any modern programming language.