Skip to content

Chapter 4: B+ Tree & Storage Engine


Metadata Card

DimensionValue
Difficulty(Intermediate)
PrerequisitesChapter 2 (Relational Algebra), basic data structures (binary tree)
KeywordsB+ Tree, B-Tree, page, WAL, buffer pool, page cache, disk I/O, MySQL, InnoDB
Code LanguagePython (simulation) / C (conceptual structure)

Your Progress

"You've learned SQL — how to talk to the warehouse. But how does the warehouse actually store your things? The B+ Tree is the skeleton of the filing cabinet system. The WAL (Write-Ahead Log) is the incoming goods register. The Buffer Pool is your workbench."

Why B+ Tree? Why Not Binary Search Tree?

Database storage must handle:

  1. Massive data — billions of rows, can't fit in memory
  2. Disk is slow — random I/O is ~10ms, sequential is ~100x faster
  3. Data grows — structure must support efficient inserts

A binary search tree fails: too deep (log₂(n) levels), each lookup may hit many disk pages.

B+ Tree solves this with high branching factor: each node (page) can hold hundreds of keys. A 4-level B+ Tree can index billions of rows.

B+ Tree Structure

              [50, 100, 150]                    ← Internal node (index)
             /    |    |     \
     [10,30]  [60,80] [120,140] [160,180]      ← Internal nodes
       |   |    |   |    |   |     |    |
     [1,5,8,...] [31,35,...] [...] [...] [...]  ← Leaf nodes (data + linked list)

Key difference from B-Tree:

  • B-Tree: data stored in all nodes (internal and leaf)
  • B+ Tree: data only in leaf nodes; internal nodes are pure indexes
  • Leaf nodes are linked in a sorted linked list for efficient range scans

Page Structure

The database reads/writes data in pages (typically 16KB in InnoDB). A page contains:

  • Page header (type, checksum, LSN)
  • Record pointers (sorted slot array)
  • Free space
  • Records (packed in the "heap" area)

WAL — Write-Ahead Log

Before modifying a page, the database writes the change to the WAL. WAL is always written to disk first. This ensures durability — if the DB crashes, it can replay the WAL to recover.

Steal/No-Force:

  • Steal: An uncommitted transaction's dirty pages may be flushed to disk (allows buffer pool to be smaller)
  • No-Force: A committed transaction's dirty pages may remain in buffer pool (avoids forcing writes at commit)

Buffer Pool

An in-memory cache of database pages. When reading, check buffer pool first. When writing, modify buffer pool and mark page dirty; background threads flush dirty pages.

Replacement policies: LRU (Least Recently Used), Clock, LIRS.

Page Checksums

Each page has a checksum to detect hardware-level corruption (bit flips, partial writes). InnoDB uses CRC32 or other algorithms (configurable).


Traveler's Notes

The B+ Tree is the most elegant data structure in database engineering. It turns the fundamental problem of "disk is slow" into a solved problem by keeping the tree wide and shallow. Combined with the buffer pool and WAL, it gives you the performance of in-memory access with the durability of persistent storage.

Built with VitePress | Software Systems Atlas