Chapter 5: LSM-Tree Deep Dive
Metadata Card
Dimension Value Difficulty (Intermediate) Prerequisites Chapter 4 (B+ Tree), basic understanding of disk I/O Keywords LSM-Tree, SSTable, MemTable, compaction, write amplification, read amplification, LevelDB, RocksDB, HBase, Cassandra Code Language Python (simulation)
Your Progress
"Another workshop — continuously merging and rearranging files. LSM-Tree's compaction is like an assembly line in a workshop. While the B+ Tree optimizes for reads, the LSM-Tree optimizes for writes."
LSM-Tree: Log-Structured Merge Tree
Two main data structures:
- MemTable: In-memory sorted tree (red-black tree / skip list). All writes go here first
- SSTable (Sorted String Table): Immutable, sorted, on-disk files
Write path: Write to WAL → Write to MemTable → MemTable full → Flush to SSTable (L0) → Background compaction merges SSTables across levels
Read path: Check MemTable → Check L0 SSTables (newest first) → Check L1, L2, ... SSTables
Write Amplification & Read Amplification
| Metric | B+ Tree (InnoDB) | LSM-Tree (LevelDB) |
|---|---|---|
| Write amplification | Low (~1-2x) | Medium-High (10-40x, depends on compaction) |
| Read amplification | Minimal (B+ depth) | Higher (check multiple SSTables) |
| Read vs write performance | Read-optimized | Write-optimized |
| Random write speed | Moderate | Very high |
Compaction Strategies
Size-Tiered Compaction (Cassandra): When N SSTables of similar size exist, compact them into one larger SSTable.
Leveled Compaction (LevelDB, RocksDB): Each level has a size limit. L0 may have overlapping ranges; L1+ have non-overlapping, sorted runs. Compaction merges SSTables from Ln into Ln+1.
Bloom Filter
A probabilistic data structure used to reduce read amplification. Before checking an SSTable, check the Bloom Filter: "Could key X be in this SSTable?" False positives are possible, false negatives are not. If the filter says "no", we skip the SSTable entirely.
Deletion and Updates
LSM-Tree doesn't overwrite data in place. Instead:
- Update: Insert a new key-value pair with a newer timestamp
- Delete: Insert a "tombstone" marker
During compaction, older entries are discarded when tombstones or newer entries are encountered.
Traveler's Notes
LSM-Tree represents a fundamentally different design philosophy from B+ Tree. Where B+ Tree optimizes for read-heavy workloads, LSM-Tree sacrifices some read efficiency to achieve dramatically better write performance. This trade-off makes LSM-Tree the engine of choice for modern distributed databases and key-value stores handling write-intensive workloads.