The kernel is just a program
It is easy, after Chapter 1, to start thinking of the kernel as some abstract authority — a kind of governing law of the machine. It is not. The Bridge at the close of Part I showed the hardware contract that makes a kernel possible at all — the privilege bit, the trap, the MMU, the timer. This chapter examines what the program on the kernel side of that contract actually looks like. The kernel is a program: a set of instructions stored in memory, executed by the same CPU that runs everything else, written in a real programming language by real people. The Linux kernel, as of 2025, is roughly 30 million lines of C, with a small and growing amount of Rust. What makes it the kernel is not what it is made of, but where it sits — in Ring 0, with hardware-enforced privileges no other program has.
When you press the power button on a computer, a sequence unfolds. The CPU starts executing instructions from a fixed firmware address — historically called the BIOS, now usually UEFI — built into the motherboard. The firmware initializes hardware, finds the boot device, and loads a small program called a bootloader (GRUB on Linux, Windows Boot Manager, iBoot on Apple devices). The bootloader's only job is to find the kernel image on disk, load it into memory, and jump to its entry point. From that moment, the kernel runs forever — or until the machine shuts down.
Monolithic vs microkernel: the great schism
There are two philosophies about how to build a kernel. They have been arguing with each other since the late 1980s.
A monolithic kernel puts everything in Ring 0: process scheduling, memory management, filesystems, device drivers, network stacks, all of it. Performance is excellent because subsystems can call each other directly — no boundary crossings. The cost is fragility and security: a bug in any part of the kernel can corrupt the whole machine, and a single device driver with a vulnerability can be the entry point for total compromise. Linux is monolithic. Windows is monolithic. Most kernels in production use are monolithic.
A microkernel takes the opposite stance. It puts only the absolute minimum in Ring 0 — typically just message passing, basic memory protection, and minimal scheduling. Filesystems, drivers, and network stacks all run as separate user-space processes that communicate via messages. A bug in a driver crashes only the driver, not the kernel. The cost is performance: every interaction crosses the user/kernel boundary, and message passing adds latency. Notable microkernels: MINIX (Andrew Tanenbaum's research OS), L4 (used in many embedded systems), seL4 (formally verified — mathematically proven to have no kernel bugs of certain classes).
In 1992, Tanenbaum publicly criticized the then-new Linux on a USENET group, calling its monolithic design "obsolete" and "a giant step back into the 1970s." Linus Torvalds responded — bluntly, and at length. The exchange is one of the most-cited debates in computing history. Tanenbaum was correct in theory. Torvalds was correct in practice. Both are still right, for different reasons, and the schism has never closed.
The debate ran for weeks. Tanenbaum, the established expert, argued from architecture and theory: microkernels were the future, monolithic designs were a regression to the 1970s, Linux was tied to one specific CPU. Torvalds, twenty-two and writing Linux from his bedroom, argued from running code and pragmatic constraints. Three decades later both have been vindicated, just for different things. The cleanest microkernels (seL4, QNX, the formally-verified ones) win in safety-critical embedded systems where a single kernel bug is unacceptable. Linux runs everywhere else — server farms, phones, supercomputers, refrigerators — because "ships and works" beat "is theoretically correct" once the practical gap stopped being trivial. Both are still right. The argument never closed because both stances are answering different questions.
macOS sits in the middle. Its kernel, XNU, is a hybrid: a Mach microkernel core wrapped with a BSD UNIX layer that runs in the same address space — so you get message-passing primitives and monolithic-kernel performance. iOS uses the same kernel. Android uses Linux. Windows NT was originally microkernel-influenced but has drifted increasingly monolithic for performance reasons.
In a monolithic kernel, every subsystem runs in Ring 0 and shares one address space. Calls between them are direct and fast. In a microkernel, only message passing and the bare minimum live in Ring 0; filesystems, drivers, and network stacks run as ordinary user-space processes that communicate by IPC. Microkernels are safer; monolithic kernels are faster.
Kernel modules: a third path
Linux added a feature called loadable kernel modules
(LKMs) that softens the monolithic stance. Drivers and filesystems can be compiled
separately from the main kernel and loaded or unloaded at runtime with commands
like insmod and rmmod. They still run in Ring 0 — so a
buggy module can crash the kernel — but they don't have to be linked into the
main image. Most graphics drivers, filesystem drivers, and hardware support on
Linux ships as modules. lsmod on a running Linux system will list
hundreds of them.
Linux at runtime is a base kernel image plus a constellation of modules that have been linked into it dynamically. Each insmod resolves the module's symbols against the kernel's symbol table, allocates kernel memory, copies the module's code in, and calls its module_init() entry. From that moment the module runs in Ring 0 — a buggy module can panic the whole machine — but the source tree, build system, and distribution channel are independent of the main kernel. Distributions ship pre-compiled modules; vendors (NVIDIA, Broadcom) ship out-of-tree modules signed for specific kernel versions. rmmod reverses the process: call module_exit(), free the memory, remove the symbols. This is how Linux maintains both the speed of monolithic design and the operational flexibility microkernels promised.
The pattern shows up everywhere: a clean theoretical model, modified by practical necessity. The kernel is monolithic but modular. It runs in Ring 0 but increasingly delegates risky work (drivers in user space, eBPF in a verified sandbox, see Section 6) to safer compartments. The history of operating systems is the history of these compromises.
The art of taking turns
A modern laptop runs hundreds of processes simultaneously. A server may run thousands. Your machine has, at most, a few dozen CPU cores. The arithmetic doesn't work — most processes cannot be running at any given instant. The kernel creates the illusion that they all are, by switching between them many times per second, fast enough that you cannot perceive the gaps. The component that decides which process gets the CPU at any given moment is called the scheduler, and it is one of the deepest and most-studied parts of any kernel.
Cooperative vs preemptive: who interrupts whom
The earliest multi-tasking systems were cooperative: each running program voluntarily yielded control back to the scheduler when it had nothing useful to do. Mac OS through version 9 worked this way. So did 16-bit Windows. The model is simple and lightweight, but it has a fatal flaw: a single misbehaving program that never yields freezes the entire machine. Every Mac OS 9 user remembers the experience of one frozen application taking down everything else.
Modern systems are preemptive: the kernel forcibly takes the CPU back from a running process at regular intervals, regardless of whether the process is ready to give it up. The mechanism is a hardware timer interrupt — a chip on the motherboard that sends an electrical signal to the CPU at a fixed frequency (typically 100–1000 times per second on Linux). Each interrupt forces the CPU into Ring 0, where the scheduler runs, decides whether to switch processes, and either resumes the current one or chooses a different one. The user perceives perfectly smooth multitasking because the timer fires faster than human reaction.
What "scheduling" actually has to decide
The scheduler's job sounds trivial — pick a process to run — but the choices have surprising depth. The classical algorithms each capture a different trade-off:
| Algorithm | How it works | Trade-off |
|---|---|---|
| FCFS | First-come, first-served. Run processes in arrival order until they finish. | Simple. Terrible response time — one long process blocks all the short ones (the "convoy effect"). |
| SJF | Shortest Job First. Pick the process with the shortest remaining time. | Provably optimal for average wait time. Requires knowing job length in advance — usually you don't. |
| Round Robin | Each process gets a fixed time slice ("quantum"). When it expires, move to the next. | Fair. Responsive. Doesn't optimize for total throughput. |
| Priority | Each process has a priority number. Higher priority runs first. | Lets the system favor critical work. Risk: low-priority processes can be starved indefinitely. |
| MLFQ | Multilevel Feedback Queue. Multiple priority levels; processes that use a lot of CPU drift to lower priorities, processes that wait often rise. | Approximates SJF without knowing job length. Used by Windows, classic UNIX. |
| CFS | Completely Fair Scheduler. Linux's default since 2007. Tracks how much CPU each process has used; always runs whichever has used the least. | Approximates "everyone runs at exactly 1/N speed on N processes." O(log N) per scheduling decision via a red-black tree. |
Linux's CFS deserves a closer look because it's a particularly elegant idea. The scheduler keeps every runnable process in a self-balancing binary tree (a red-black tree), keyed by virtual runtime — a measure of how much CPU time the process has consumed, weighted by its niceness. Whenever the scheduler needs to pick the next process to run, it picks the leftmost node of the tree (smallest virtual runtime). When a process runs, its virtual runtime increases; when it blocks, it's removed; when it wakes, it's reinserted with its saved value. This naturally implements approximate fairness: every process tends toward equal CPU usage, and any process behind catches up first.
CFS stores all runnable processes in a red-black tree, keyed by virtual runtime (a measure of CPU consumption). The leftmost node has the smallest value — the process most "behind" on its share of the CPU — and is the next to run. Insertion, deletion, and finding the minimum are all O(log N). This is why a Linux machine with thousands of runnable processes still schedules in microseconds.
The same four jobs scheduled four different ways. FCFS serves them in arrival order — the eight-unit job A blocks everyone behind it (the convoy effect). SJF reorders by length — A still runs first because it arrived alone, but D (5 units) jumps ahead of C (9 units), giving the lowest possible average wait time. Round Robin with quantum 2 interleaves all four constantly — responsive but never letting any job make a long uninterrupted run. MLFQ demotes processes that use a lot of CPU into lower-priority queues; long-running C ends up at Q3 and waits, while interactive bursts of A get repeated Q1/Q2 attention. Each algorithm is correct, for a different definition of "correct" — average wait, total throughput, responsiveness, fairness. The choice depends on workload.
The mathematics underlying scheduling is queueing theory, developed in the early 20th century by Agner Krarup Erlang for telephone networks. The single most important result is Little's Law:
L = λ · W
The average number of items in a queueing system (L) equals the average arrival rate (λ) multiplied by the average time each item spends in the system (W). It holds for any stable queue, regardless of arrival distribution or service distribution. It is the reason a slightly oversubscribed system — λ approaches its capacity — produces explosive wait times: as utilization approaches 100%, W goes to infinity. Every web server, every database, every operating system kernel obeys this. It is why the difference between 95%-loaded and 99%-loaded servers is not 4% — it is often 10× in latency.
Little's Law in two pictures. Top: a queue with arrivals (λ) flowing into a buffer of average length L; each item waits average W. The relationship L = λW is true for any stable queue, regardless of how arrivals are distributed or how long service takes. Bottom: the consequence — average wait time as utilization approaches 100% follows W ~ 1/(1−ρ), which is hyperbolic. At 50% utilization, doubling load barely budges the wait. At 95%, doubling load is catastrophic. This is why operations teams panic when a server crosses about 80% sustained utilization — they are not panicking about "the server being slow," they are panicking about the math curve they can see coming.
Real-time scheduling: when missing a deadline kills
General-purpose schedulers like CFS optimize for average performance. They make no guarantees about worst-case latency. For most software this is fine. For some software it is catastrophically not fine: anti-lock braking systems, pacemakers, avionics flight control, industrial robots. These systems run on real-time kernels — variants of Linux (PREEMPT_RT) or specialized OSes (VxWorks, QNX, FreeRTOS) — that guarantee a task will run within a bounded time after it becomes ready, even under load. The mathematical foundation is rate-monotonic scheduling and earliest-deadline-first scheduling, analyzed by Liu and Layland in a foundational 1973 paper.
A real-time task that runs at 100Hz must complete its work within each 10ms window. The first three deadlines are met (5ms work, 10ms window — half the budget). The fourth job runs longer than expected (12ms) and overruns its deadline. In a non-real-time system this would be a slow frame or a stutter; in an anti-lock braking controller, a pacemaker, or a flight control loop, it is a system failure with physical consequences. Real-time kernels use schedulers (rate-monotonic, earliest-deadline-first) that mathematically guarantee deadlines provided the total task utilization stays below a known bound — about 69% for rate-monotonic, 100% for EDF. Going above that bound, missed deadlines become possible, and in a hard real-time system that is the same as broken.
The lie of unlimited memory
Chapter 1 introduced virtual memory as the kernel's mechanism for isolating processes; the Bridge showed the MMU and TLB as the silicon that makes the mechanism enforceable. We saw the headline: every process believes it has its own private address space starting at zero. Now we look at the machinery that sustains the illusion — the page tables, the translation cache called the TLB, the page faults that quietly bring memory into existence on demand, and the mathematical structures that make 64-bit addressing tractable at all.
The translation problem
Every memory access a program performs uses a virtual address. The CPU cannot use this directly; physical RAM is addressed by physical addresses. Some hardware must, on every load and store, translate one to the other. That hardware is the Memory Management Unit (MMU), built into the CPU. The translation table it consults is the page table, maintained by the kernel.
Memory is divided into fixed-size blocks called pages, typically 4 KB. The address space is therefore divided into virtual pages, and physical RAM into physical pages (often called "page frames"). The page table maps one to the other. When a virtual page has no corresponding physical page in RAM — because it's never been used, or has been swapped to disk, or belongs to a memory-mapped file not yet loaded — the table entry is marked invalid, and the MMU raises a hardware exception called a page fault. The kernel's page-fault handler decides what to do.
Why the page table cannot be flat
A naive page table would be a single flat array — one entry for every possible virtual page. The number is enormous. On 64-bit x86, the architecture defines a 48-bit usable virtual address space (256 terabytes), giving 236 pages. A flat table with one 8-byte entry per page would need 512 GB just for the table. Per process. This is plainly impossible.
The solution is a multi-level page table — a tree. The 48-bit virtual address is split into four 9-bit fields plus a 12-bit page offset. Each 9-bit field indexes into one level of a four-level tree. Most branches are empty (the address space is sparse — your process only uses a tiny fraction of the 256 TB available), so most of the tree is never allocated. A real x86-64 process typically uses a few megabytes of page tables to map its actual memory, not 512 gigabytes.
Translating one virtual address into a physical address requires walking four levels of page tables. The CPU register CR3 points to the root (L4). Each level uses 9 bits of the address to index into a 512-entry table; the entry points to the next level. The bottom 12 bits give the byte offset within the final page. Without optimization, every memory access would cost five memory accesses.
The TLB: a cache for translations
A naive page-table walk would be ruinously slow — every memory access would require five memory accesses (four for the table walk, one for the actual data). The fix is the Translation Lookaside Buffer (TLB), a small, very fast cache inside the CPU that stores recent virtual-to-physical translations. A typical x86-64 TLB has between 64 and 1500 entries. When the CPU needs to translate an address, it first checks the TLB. If the entry is there (a "TLB hit"), translation takes one cycle. If not (a "TLB miss"), the MMU walks the page tables and inserts the result into the TLB.
Hit rates on the TLB are typically above 99%. The 1% miss rate, multiplied by billions of memory accesses per second, still matters — and is why CPU designers have steadily grown TLB sizes and added second-level TLBs over the past two decades. When the kernel switches between processes, it must flush parts of the TLB (since the new process has different page tables). This is one of the hidden costs of context switching, and one reason why excessive switching hurts performance.
Page faults — the productive kind
A page fault sounds like an error. Most of them are not. There are several kinds, and the everyday ones are how the kernel implements many of its most useful features. They fall into three rough categories:
A minor page fault happens when the page is in physical memory but isn't yet mapped into this process. Example: when a program first reads a page of a file the kernel has cached. The kernel just adds an entry to the page table and returns. Cost: microseconds.
A major page fault happens when the page is not in RAM and must be fetched from disk — typically because it was swapped out, or because the program is reading a memory-mapped file for the first time. The kernel issues a disk read, suspends the process, and resumes it when the data arrives. Cost: milliseconds. Tens of thousands of times slower than minor faults.
An invalid page fault happens when the access truly is illegal —
writing to a read-only page, dereferencing a null pointer, executing data marked
non-executable. The kernel sends the offending process a SIGSEGV
signal, and unless the process catches it, the program dies with the famous
"segmentation fault" message.
A "page fault" is a CPU exception, but most faults are not errors — they are how the kernel implements memory management on demand. Minor faults are pure bookkeeping: the page is in RAM (perhaps already in another process's mapping or in the kernel's page cache), it just hasn't been wired into this process's page tables yet. Major faults involve disk I/O — the page must be paged in from swap, or read from a memory-mapped file for the first time — and are about a thousand times slower. Invalid faults are the real errors: the dereference of a NULL pointer, the write to a read-only page, the jump into non-executable data. Only invalid faults trigger SIGSEGV. On a typical Linux desktop, the kernel handles thousands of minor faults per second invisibly.
Two beautiful uses of page faults
Memory-mapped files. When you call mmap() on a file,
the kernel doesn't read the file. It sets up page table entries marking the
relevant virtual addresses as backed by that file, but invalid (not yet present).
The first time you actually access a page, you take a page fault, and the kernel
reads just that one page from disk. Reading a 100 GB file as if it were a
contiguous array in memory becomes trivial; only the pages you touch are loaded.
This is how databases, search engines, and many high-performance systems handle
large data — and how every shared library loads on Linux.
Copy-on-write fork. When a UNIX process calls fork()
to create a child process, the child receives a copy of the parent's entire
address space. Naively, this would require duplicating gigabytes of memory. It
doesn't. The kernel marks all of the parent's pages as read-only and shares them
with the child. As long as neither process writes to a page, they share it. Only
when one of them tries to write does a page fault occur, and the kernel makes a
private copy at that moment. Most pages are never written; most fork-and-exec
sequences (the standard way to launch a new program on UNIX) never copy any pages
at all. The cheapest way to make a copy is to lie about having made it.
After fork(), both parent and child point at the same physical pages, all marked read-only. The parent and child page tables are different — they could diverge — but no actual copying has happened. The instant either side writes to a shared page, the MMU raises a page fault. The kernel's COW handler allocates a fresh physical page, copies the contents, fixes up the page table of the writer to point at the new page (read-write), and resumes the instruction. The non-writing process never noticed. The famous fork-and-exec idiom — fork a child that immediately calls execve() to load a new program — copies zero pages, because the child throws away its address space before writing anything. This is why launching a UNIX program is essentially free.
Dirty COW (CVE-2016-5195). A bug in Linux's copy-on-write logic, present in the kernel for nine years before being discovered, allowed an unprivileged attacker to write to read-only files — including /etc/passwd — by exploiting a race condition between the page fault handler and the kernel's COW machinery. Privilege escalation followed trivially. The bug's existence and longevity is a reminder of how much of operating system security depends on subtle correctness in subsystems most users will never see.
Dirty COW exploited a race in the COW path. A thread tries to write a read-only file mapping, taking a page fault. The kernel begins COW: it allocates a copy and is about to install it as read-write in the writer's page table. Between "allocate copy" and "install copy in page table" there is a brief window. A second thread, on the same process, calls madvise(MADV_DONTNEED) on the same page — which legally discards the COW mapping. The first thread resumes, retries the write, but now the page table no longer has the copy installed, so the write goes through to the original mapping — the read-only file. An unprivileged process can therefore write to any file it can read. /etc/passwd is world-readable. Local privilege escalation in 60 lines of C. The bug had been in the kernel since 2007. Phil Oester reported it after seeing it abused on a production system; Linus Torvalds patched it within hours; every Linux distribution shipped fixes within days.
A tree of names on a flat array of blocks
A disk, at its lowest level, is a flat array of fixed-size blocks. A spinning hard drive presents itself to the OS as billions of 512-byte sectors numbered 0, 1, 2, … An SSD presents the same abstraction even though the physical reality underneath is radically different. The kernel sees: a sequence of bytes, addressed by index. From this raw substrate, the filesystem builds the structure you actually use — files with names, organized into directories, organized into a tree, with metadata about ownership and permissions and timestamps. None of this structure exists at the hardware level. It exists because the filesystem code pretends it does.
The inode: where a file actually lives
In UNIX-derived filesystems (which is most of them), the central data structure is the inode — short for "index node." Every file has exactly one inode, and the inode contains everything about the file except its name and its data:
Notice what's not there: the filename. In UNIX, a filename is a property
of the directory it lives in, not the file. A directory is itself a special kind
of file whose contents are a list of (name, inode-number) pairs. To open
/home/yki/notes.txt, the kernel looks up the inode for /,
reads its directory contents to find the entry for home, follows
that to the next directory's inode, and so on, until it reaches
notes.txt's inode. Then it uses the inode's block list to read the
actual data.
This separation of name from data has elegant consequences. A single file can have multiple names — multiple directory entries pointing to the same inode. These are called hard links. The file is only deleted when its link count drops to zero. It is also why renaming a huge file is instant: only the directory entry changes. The inode and its data don't move.
A path lookup walks directories to find an inode. The inode contains metadata and a list of data blocks. Two directory entries pointing to the same inode (notes.txt and backup.txt above) are hard links — the same file with two names. Data blocks are not necessarily contiguous; finding them quickly is one of the filesystem's main jobs.
The everything-is-a-file principle
UNIX took the inode concept further than just regular files. Every kernel-managed resource is exposed as a file:
- Your keyboard is
/dev/input/event0. - Your sound card is
/dev/snd/pcmC0D0p. - Random numbers come from
/dev/urandom. - The state of every running process lives under
/proc/<pid>. - Network connections are accessed through file descriptors with
readandwrite. - Hardware sensors expose temperature and voltage as files in
/sys.
The same five system calls — open, read, write,
close, lseek — work on all of them. This radical
uniformity is one of the reasons UNIX-derived systems became so dominant: tools
written for files just work on devices, network connections, and process state
without modification. The shell pipeline you can use to count lines in a text file
(cat file | wc -l) works equally well on the output of a sensor
driver or a debugging tool, because to the kernel they are all just files.
Journaling: the database technique that saved filesystems
A filesystem that simply writes data and metadata to disk wherever convenient has
a fatal weakness: power loss. If the machine crashes between writing data and
updating its corresponding metadata (or vice versa), the filesystem becomes
inconsistent — files exist whose blocks are still listed as free, or directory
entries point at non-existent inodes. The classic UNIX response was
fsck, a program that scans the entire disk after a crash to find
and fix inconsistencies. On a multi-terabyte disk this could take hours.
The modern solution, borrowed from database theory, is the journal — a small, dedicated region of the disk where every intended modification is written first, before being applied to the main filesystem. After a crash, the kernel only has to replay the journal from where it left off, applying or discarding incomplete operations. Recovery takes seconds instead of hours. ext4 (Linux), NTFS (Windows), and HFS+ (older macOS) all journal. The technique is called write-ahead logging; we'll meet it again in Chapter 13 when we get to databases.
A journaling filesystem does every write twice. First, the kernel writes a description of the intended change to a small journal region, followed by a single-block commit record (whose write is atomic by hardware guarantee). Then it applies the change to the actual filesystem at its leisure — possibly reordered, batched, or coalesced with other writes. If the machine crashes between commit and apply, recovery is straightforward: replay every journal entry whose commit record is intact, drop the rest. The technique comes from database write-ahead logging (Gray, IBM, 1981), and is now in ext4, NTFS, HFS+, JFS, XFS, and every other serious modern filesystem. It is also why a sudden power loss on a modern machine takes seconds to recover from, not hours.
The newest generation of filesystems — copy-on-write filesystems like ZFS, btrfs, and Apple's APFS — go further. They never overwrite existing data. Every modification writes new blocks; only after the write succeeds is the metadata updated to point to them. The old version remains until garbage-collected, which gives you essentially-free snapshots, atomic operations on entire directories, and built-in checksumming to detect silent disk corruption.
A copy-on-write filesystem never modifies a block in place. Updating one byte of block 103 means: allocate a new block (103′), copy 103's contents, apply the change, then re-write the parent that pointed at 103 to point at 103′ instead — and so on, all the way up to the root. The old version of every block remains valid until explicitly freed. A snapshot is just a saved copy of the old root pointer; it costs zero bytes until something diverges. Every block can include a checksum that the filesystem verifies on read, so silent disk corruption (which happens, even on enterprise hardware) gets caught instead of propagating. The trade is write amplification: a one-byte change touches a full block at every level of the tree. ZFS, btrfs, and APFS all make this trade and consider it a bargain. WAFL (NetApp) made the trade first, in 1992, and built a multi-billion-dollar storage business on it.
How processes talk
Processes are isolated by design. The whole point of virtual memory and privilege separation is that one process cannot reach into another's memory and read or modify it. This is the foundation of operating system security. But isolation taken to its extreme produces a useless system — programs that can't share anything would be unable to compose into pipelines, can't coordinate, can't even tell each other when work is done. So the kernel exposes a controlled set of mechanisms for processes to communicate. These are collectively called inter-process communication, or IPC.
Each IPC mechanism is a different point on the trade-off between convenience, performance, and flexibility. The set below is roughly chronological — older mechanisms at the top, newer at the bottom.
| Mechanism | How it works | Best for |
|---|---|---|
| Pipe | A unidirectional byte stream between two related processes. The shell | operator creates one. | Chaining commands together. The classic UNIX pipeline. |
| Named pipe (FIFO) | A pipe with a filesystem name. Any process with permission can connect. | Letting unrelated processes communicate via a known path. |
| Signal | A small numeric notification sent to a process. SIGTERM, SIGKILL, SIGINT (Ctrl+C). | Asynchronous control: "stop," "reload config," "we're shutting down." |
| Shared memory | Two processes map the same physical pages into both their address spaces. | The fastest IPC. No copying. Used by databases and high-performance systems. |
| Message queue | A kernel-managed queue of typed messages between processes. | Structured communication; less common today. |
| Semaphore | A kernel-managed counter used to coordinate access to a shared resource. | Synchronizing without sharing data — "is it my turn yet?" |
| UNIX socket | Like a network socket, but local — kernel-mediated, with proper authentication. | Modern desktop IPC. Used by Docker, systemd, X11, Wayland. |
| Network socket | TCP or UDP connection, possibly to another machine. | Distributed systems. We'll cover these in depth in Part III. |
The pipe: UNIX's most beautiful idea
The pipe was added to UNIX in 1973 by Doug McIlroy. Mechanically, a pipe is a kernel-allocated ring buffer with two file descriptors attached: one process writes to one end, another reads from the other. The kernel handles flow control automatically — if the buffer is full, the writer blocks; if it is empty, the reader blocks. No shared memory. No locks. No protocol. Just a stream of bytes.
A pipe is two file descriptors attached to one kernel-allocated ring buffer. The writer's write() deposits bytes; the reader's read() consumes them. The kernel handles synchronisation: when the buffer fills, the writer blocks until space appears; when it empties, the reader blocks until bytes arrive. No userspace synchronisation is required because the synchronisation lives in the kernel's well-tested code paths. This is the simplest possible mechanism for streaming data between two processes — and is the same machinery that fifty years of UNIX shell pipelines have been built on.
The pipe's syntactic appearance in the shell is dazzlingly simple:
# Count how many unique lines start with "ERROR" in a log: cat server.log | grep "^ERROR" | sort | uniq | wc -l # Five separate processes, each doing one small thing. # The shell creates pipes between them. Output of one # becomes input of the next, streamed byte-by-byte.
Each program in the pipeline does one small task, knows nothing about the others, and reads its input from standard input and writes to standard output as if they were ordinary files. The kernel arranges those file descriptors to be the ends of pipes. Five processes execute in parallel; the output of one streams into the next as fast as either can handle it. The model is so productive that it has been imitated in essentially every shell since. It is one reason why UNIX produced a culture of small composable tools rather than monolithic applications.
Five processes, each doing one small task, with pipes between every pair. cat reads the file. grep filters to lines starting with ERROR. sort orders them. uniq drops adjacent duplicates. wc -l counts what's left. None of these programs was written knowing about the others; each just reads from stdin and writes to stdout. The shell connects them with pipes, and the kernel runs all five in parallel — bytes stream through the pipeline as fast as the slowest stage allows. This is the canonical illustration of the UNIX philosophy: composition through tiny, sharp tools and a universal interface (text bytes), so any new tool you write joins the pipeline for free.
"This is the Unix philosophy: write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface."
— Doug McIlroy, inventor of the UNIX pipeSignals: a doorbell to a process
Pipes carry data. Signals carry no data — only the
bare fact that something happened. A signal is a small, numbered notification
delivered asynchronously to a process: number 2 (SIGINT) is what the
kernel sends when you press Ctrl+C; number 9 (SIGKILL) is the
uncatchable "stop now"; number 15 (SIGTERM) is the polite "please
stop." A process can register a handler function for most signals; when the signal
arrives, the kernel hijacks the running process, runs the handler, then resumes.
Three things can happen on a signal: handle, ignore,
or — for some signals like SIGKILL and SIGSTOP —
die, with no recourse.
A signal arrives at an arbitrary moment in the target process's execution. The kernel queues it, then — at the next kernel-to-userspace transition (a syscall return, a page-fault return, etc.) — diverts the process to its registered handler. The handler runs as if it were a function call inserted at that exact instruction; when it returns, the kernel restores the original CPU state and the process resumes. Some signals are uncatchable: SIGKILL (number 9) terminates the process unconditionally; the kernel does not even consult the process. SIGSTOP suspends it. Both exist because a privileged user must always be able to stop a misbehaving process — even one that has registered handlers for everything else. Signals are crude: a single integer per notification, no data, no queueing of duplicates of the same signal. They predate every other Unix IPC mechanism and remain the standard way to control the lifecycle of a process.
Shared memory and semaphores: the no-copy IPC
Pipes and signals are convenient. They are not fast. Every byte through a pipe is copied: from the writer's userspace into the kernel buffer, then out of the kernel buffer into the reader's userspace. For high-throughput communication — gigabytes per second between two processes on the same machine — shared memory is the only viable mechanism. Two processes call into the kernel to map the same physical pages into both their virtual address spaces. After that, reading and writing the shared region is no slower than reading and writing any other memory. There is no copy. There is no kernel involvement on every access.
The cost of zero-copy is that you now have a classic concurrency problem: two processes touching the same memory must coordinate so that neither sees the other mid-update. The kernel exposes semaphores for this — counters that processes increment and decrement atomically, used as gates. A semaphore initialised to 1 acts as a mutex (one process at a time may pass). Initialised to N, it acts as a resource pool. The mathematics is Edsger Dijkstra's from 1965, formalised in his note "Cooperating Sequential Processes," and underlies every modern concurrent system.
Two processes share a physical page by mapping it at (potentially different) virtual addresses. Reads and writes against that region are ordinary CPU loads and stores — no kernel involvement, no copying, no system calls in the fast path. The cost is coordination: if both processes write at the same time, the result is undefined. A semaphore, initialised to 1, lets exactly one of them through at a time. sem_wait() atomically decrements; if the value would go negative, the caller blocks until sem_post() elsewhere releases the gate. This is shared memory + semaphore: the IPC of choice for performance-sensitive systems where the cost of even one extra copy per byte is unacceptable. Database engines, video pipelines, audio mixers, and high-frequency trading systems are all built on it.
Sockets: when the other process is on another machine
A socket is a generalization of the pipe. Instead
of communicating with another process on the same machine through a kernel-managed
pipe, you communicate over a connection — either to another local process (a
"UNIX domain socket") or to a process on a different machine entirely (a "network
socket," using TCP or UDP). The same system calls — read,
write, close — apply. From the application's
perspective, the network is just another file. The kernel, the network drivers,
the protocol stack, and the wire are all hidden beneath a four-call API.
This abstraction is one of the most important pieces of glue in modern computing. It is why a web browser, a database client, a video conferencing app, and an SSH session all use roughly the same code shape: open a socket, write a request, read a reply, close. We will spend Part III opening up what happens between write and read on a network socket — a journey through ARPANET, packet switching, TCP, DNS, TLS, HTTP, and the rest of the internet stack. For now: the kernel makes it look like a file. That, more than anything else, is the kernel's job.
When the kernel breaks
Everything we've built in this chapter rests on one assumption: the kernel is correct. The privilege boundary, the address-space isolation, the file permission checks, the IPC mediation — all of it depends on the kernel actually doing what it says it does. A bug in the kernel is therefore not an ordinary bug. It is potentially the end of every security guarantee on the machine. A user-space program with a buffer overflow can be exploited to take over that program. A kernel with the same kind of bug can be exploited to take over everything running on the entire computer, including the kernel itself, every other process, every file, and every keystroke. This is what kernel security is about.
Privilege escalation, more carefully
A privilege escalation attack is one where a process gains more privilege than it was supposed to have. The two important versions:
Vertical escalation — moving up the privilege ladder. A normal user process exploits a bug to become root, or root exploits a bug to enter Ring 0 (kernel mode). Once in Ring 0, an attacker controls the machine without restriction.
Horizontal escalation — moving sideways at the same privilege level into another process's data. Reading another user's files, hijacking another user's session.
Memorable kernel-level privilege escalation bugs in the last decade include Dirty COW (mentioned earlier — copy-on-write race condition, 2016), DirtyPipe (a flaw in pipe-buffer initialization that allowed writes to read-only files, 2022), and dozens of bugs in device drivers — driver code is vast, often less audited than the core kernel, and runs in Ring 0. Most modern kernel exploits chain together multiple bugs: an information leak (to defeat KASLR, the kernel's version of ASLR), then a memory-corruption primitive (to overwrite something useful), then a privilege escalation.
Containers: not virtual machines
You've probably heard of Docker, Kubernetes, and "containers." They are often described as "lightweight VMs," which is misleading. A virtual machine emulates an entire computer — its own kernel, its own everything — running on top of a hypervisor. Containers do not. A container is a normal Linux process whose view of the system has been restricted by the kernel. There is one shared kernel. The container is just a process the kernel has lied to.
The lying is done with two Linux features:
Namespaces partition kernel resources so that different processes see different views. There are namespaces for process IDs (a containerized process believes it is PID 1, even though to the host it might be PID 8472), for the network (the container has its own virtual network interface), for mounts (its own root filesystem), for user IDs, for hostnames, and more. Inside its namespaces, the container looks like a complete isolated system.
Cgroups (control groups) limit how much of each resource a process group can use — CPU time, memory, disk I/O, network bandwidth. Together with namespaces, this gives Docker its model: cheap, fast, lightweight isolation that doesn't require a hypervisor.
A virtual machine runs a complete guest kernel on emulated hardware — strong isolation, but heavy. A container is just a process the host kernel has restricted with namespaces (what it can see) and cgroups (what it can use). One shared kernel, much less overhead — but a kernel bug in that shared kernel can break out of every container at once.
The trade-off matters: VMs isolate kernels, so a compromise of one guest doesn't affect others. Containers share a kernel, so a kernel privilege-escalation bug can let a malicious container take over the host and every sibling container. Cloud providers running multi-tenant workloads typically use VMs (or VM-strengthened containers like AWS Firecracker) for this reason; single-tenant deployments use plain containers because they're far cheaper.
Namespaces are how the kernel lies to a process about the world. There are seven kinds — PID, network, mount, IPC, UTS (hostname), user, cgroup — and each can be independently configured per process. Container A's ps shows only the processes inside its PID namespace; its ifconfig shows only its virtual interfaces; its / is the rootfs of its container image. Container B sees a completely different version of all of those, even though both processes are running under the same kernel, on the same machine, accessing the same physical RAM. From the host's perspective, both containers' PIDs are just numbers in the global PID table; namespaces are translation layers between the global state and what each container sees.
If namespaces decide what a process can see, cgroups decide what it can use. Each cgroup is a node in a hierarchy, with explicit limits on every resource the kernel can meter — CPU shares, memory bytes, block-device IOPS, network bandwidth, even the number of file descriptors. The kernel enforces these on every syscall and every scheduling decision: a process in the "batch" cgroup that tries to exceed its memory limit gets killed by the OOM killer; a process exceeding its CPU share gets descheduled until its budget refills. Cgroups are how Kubernetes packs ten workloads onto one machine without any of them noticing the others, how Docker's --cpus=2 flag works, and how systemd ensures a misbehaving service does not bring down the rest of the host.
eBPF: safe code inside the kernel
A more recent and remarkable Linux feature is eBPF — "extended Berkeley Packet Filter." It allows user programs to upload small pieces of code into the running kernel, where they are attached to specific events (network packets, system calls, function entries) and run inside Ring 0 with kernel-level performance.
The obvious worry is that this lets unprivileged users execute code in the kernel — historically the worst possible security outcome. eBPF makes it safe through a verifier: a static analyzer in the kernel that examines every uploaded program and refuses to load it unless it can prove the program will always terminate, never read out-of-bounds memory, and never crash. Programs that pass the verifier are then JIT-compiled to native machine code and run essentially as fast as compiled kernel code.
eBPF flips a forty-year-old assumption: that any code running in the kernel must be hand-vetted, audited, signed, and trusted. Instead, the kernel ships a verifier — a static analyser that mathematically proves an uploaded program is safe before letting it run. The proof obligations are concrete: every memory access must be in-bounds (so the program cannot read random kernel memory); every loop must be provably bounded (so the program cannot hang the kernel); the program must terminate within a fixed instruction budget. Programs that pass are JIT-compiled to native instructions and attached to hooks — system call entry, packet receive, function entry, scheduler tick — where they run in Ring 0 with no syscall overhead. Cilium uses eBPF for high-performance Kubernetes networking; bpftrace exposes a tracing language built on it; modern Linux observability is increasingly eBPF underneath. The pattern — sandboxed verified execution inside a privileged context — shows up again in Chapter 12 (browser JavaScript) and Chapter 14 (TLS), and is one of the most important architectural ideas of the last twenty years.
eBPF has become the foundation of modern Linux observability and networking: tools like Cilium (high-performance network policy), bpftrace (kernel tracing), and Falco (runtime security monitoring) all build on it. The general pattern is important — providing a sandboxed, formally checked execution environment inside a privileged context — and we will see it again when we discuss browser security models in Chapter 12 and TLS in Chapter 14.
The recurring pattern. Every layer in this chapter — the kernel itself, the privilege boundary, virtual memory, filesystem permissions, container isolation, eBPF verification — is a mechanism that restricts what less-trusted code can do. Operating system security, fundamentally, is the discipline of building such mechanisms and then living with the fact that any one of them can have a bug. The kernel is not a fortress. It is a series of carefully designed walls, each one with a guard at every gate, and the guards themselves have to be checked by other guards. There is no bottom — only better and worse layers.
What you now understand
The kernel is a program — written in C, loaded by a bootloader, run forever in Ring 0 — that owns the hardware and mediates everything every other program does. It comes in two main shapes: monolithic (everything in Ring 0, fast, fragile) and microkernel (minimal Ring 0, message-passing, safer and slower), with most real systems sitting somewhere on that spectrum. Its scheduler, anchored mathematically in queueing theory, decides which process gets the CPU at each microsecond — Linux's CFS does this in O(log N) using a red-black tree of virtual runtimes. Its virtual memory subsystem maintains a four-level page table per process, accelerated by a translation cache (TLB), and uses page faults productively to implement memory-mapped files and copy-on-write fork. Its filesystem turns a flat array of disk blocks into a tree of inodes and names, made crash-safe by journaling or copy-on-write. Its IPC mechanisms — pipes, signals, shared memory, sockets — are the controlled boundary across which isolated processes can still cooperate. And its security depends entirely on its own correctness — which is why kernel bugs are so dangerous, and why each new mechanism (containers, eBPF) is layered behind further verification.
With this, the kernel is in view from both sides. The Bridge at the close of Part I showed it as silicon — privilege rings as a CPU bit, the trap as a wire-level mechanism, the MMU as a hardware unit, the timer as the single piece of hardware that makes preemption possible. This chapter showed it as code. Same object, two genuinely different perspectives, and the kernel becomes legible only when both are visible at once: thirty million lines of C running in Ring 0, sustained by a hardware contract that lets it own the machine.
The next chapter climbs out of the kernel and into the language it was written in. C — Ritchie at Bell Labs, 1972 — is the language every kernel in production today still uses. We'll see why pointers are the same idea as the addresses we just walked through, and why they are simultaneously the source of C's power and the cause of every memory bug we discussed in Chapter 3. From C we'll move to C++, then to the radically different choice that is Python. Each language is a particular set of decisions about which kinds of mistakes to make easy and which to make hard.