Skip to content

Chapter 5: LSM-Tree Deep Dive


Metadata Card

DimensionValue
Difficulty(Intermediate)
PrerequisitesChapter 4 (B+ Tree), basic understanding of disk I/O
KeywordsLSM-Tree, SSTable, MemTable, compaction, write amplification, read amplification, LevelDB, RocksDB, HBase, Cassandra
Code LanguagePython (simulation)

Your Progress

"Another workshop — continuously merging and rearranging files. LSM-Tree's compaction is like an assembly line in a workshop. While the B+ Tree optimizes for reads, the LSM-Tree optimizes for writes."

LSM-Tree: Log-Structured Merge Tree

Two main data structures:

  1. MemTable: In-memory sorted tree (red-black tree / skip list). All writes go here first
  2. SSTable (Sorted String Table): Immutable, sorted, on-disk files

Write path: Write to WAL → Write to MemTable → MemTable full → Flush to SSTable (L0) → Background compaction merges SSTables across levels

Read path: Check MemTable → Check L0 SSTables (newest first) → Check L1, L2, ... SSTables

Write Amplification & Read Amplification

MetricB+ Tree (InnoDB)LSM-Tree (LevelDB)
Write amplificationLow (~1-2x)Medium-High (10-40x, depends on compaction)
Read amplificationMinimal (B+ depth)Higher (check multiple SSTables)
Read vs write performanceRead-optimizedWrite-optimized
Random write speedModerateVery high

Compaction Strategies

Size-Tiered Compaction (Cassandra): When N SSTables of similar size exist, compact them into one larger SSTable.

Leveled Compaction (LevelDB, RocksDB): Each level has a size limit. L0 may have overlapping ranges; L1+ have non-overlapping, sorted runs. Compaction merges SSTables from Ln into Ln+1.

Bloom Filter

A probabilistic data structure used to reduce read amplification. Before checking an SSTable, check the Bloom Filter: "Could key X be in this SSTable?" False positives are possible, false negatives are not. If the filter says "no", we skip the SSTable entirely.

Deletion and Updates

LSM-Tree doesn't overwrite data in place. Instead:

  • Update: Insert a new key-value pair with a newer timestamp
  • Delete: Insert a "tombstone" marker

During compaction, older entries are discarded when tombstones or newer entries are encountered.


Traveler's Notes

LSM-Tree represents a fundamentally different design philosophy from B+ Tree. Where B+ Tree optimizes for read-heavy workloads, LSM-Tree sacrifices some read efficiency to achieve dramatically better write performance. This trade-off makes LSM-Tree the engine of choice for modern distributed databases and key-value stores handling write-intensive workloads.

Built with VitePress | Software Systems Atlas