Chapter 4: B+ Tree & Storage Engine
Metadata Card
Dimension Value Difficulty (Intermediate) Prerequisites Chapter 2 (Relational Algebra), basic data structures (binary tree) Keywords B+ Tree, B-Tree, page, WAL, buffer pool, page cache, disk I/O, MySQL, InnoDB Code Language Python (simulation) / C (conceptual structure)
Your Progress
"You've learned SQL — how to talk to the warehouse. But how does the warehouse actually store your things? The B+ Tree is the skeleton of the filing cabinet system. The WAL (Write-Ahead Log) is the incoming goods register. The Buffer Pool is your workbench."
Why B+ Tree? Why Not Binary Search Tree?
Database storage must handle:
- Massive data — billions of rows, can't fit in memory
- Disk is slow — random I/O is ~10ms, sequential is ~100x faster
- Data grows — structure must support efficient inserts
A binary search tree fails: too deep (log₂(n) levels), each lookup may hit many disk pages.
B+ Tree solves this with high branching factor: each node (page) can hold hundreds of keys. A 4-level B+ Tree can index billions of rows.
B+ Tree Structure
[50, 100, 150] ← Internal node (index)
/ | | \
[10,30] [60,80] [120,140] [160,180] ← Internal nodes
| | | | | | | |
[1,5,8,...] [31,35,...] [...] [...] [...] ← Leaf nodes (data + linked list)Key difference from B-Tree:
- B-Tree: data stored in all nodes (internal and leaf)
- B+ Tree: data only in leaf nodes; internal nodes are pure indexes
- Leaf nodes are linked in a sorted linked list for efficient range scans
Page Structure
The database reads/writes data in pages (typically 16KB in InnoDB). A page contains:
- Page header (type, checksum, LSN)
- Record pointers (sorted slot array)
- Free space
- Records (packed in the "heap" area)
WAL — Write-Ahead Log
Before modifying a page, the database writes the change to the WAL. WAL is always written to disk first. This ensures durability — if the DB crashes, it can replay the WAL to recover.
Steal/No-Force:
- Steal: An uncommitted transaction's dirty pages may be flushed to disk (allows buffer pool to be smaller)
- No-Force: A committed transaction's dirty pages may remain in buffer pool (avoids forcing writes at commit)
Buffer Pool
An in-memory cache of database pages. When reading, check buffer pool first. When writing, modify buffer pool and mark page dirty; background threads flush dirty pages.
Replacement policies: LRU (Least Recently Used), Clock, LIRS.
Page Checksums
Each page has a checksum to detect hardware-level corruption (bit flips, partial writes). InnoDB uses CRC32 or other algorithms (configurable).
Traveler's Notes
The B+ Tree is the most elegant data structure in database engineering. It turns the fundamental problem of "disk is slow" into a solved problem by keeping the tree wide and shallow. Combined with the buffer pool and WAL, it gives you the performance of in-memory access with the durability of persistent storage.