Skip to content

Chapter 7: Hash Index & External Sorting


Metadata Card

DimensionValue
Difficulty(Intermediate)
PrerequisitesBasic data structure knowledge (hash table, sorting algorithms)
KeywordsHash index, hash table, static hashing, extendible hashing, linear hashing, external sort, merge sort, two-phase sort
Code LanguagePython

Your Progress

"Two most commonly used tools in the warehouse: the quick lookup counter (hash index) and the goods sorting room (external sorting)."

Hash Index

A hash index maps keys to specific bucket locations using a hash function. Excellent for point lookups (exact key match). Not suitable for range queries.

Static hashing: Fixed number of buckets. Simple but doesn't handle growth — overflow chains degrade performance.

Extendible hashing: Dynamic bucket growth using a directory of pointers. When a bucket overflows, split it and double the directory. Only the overflowing bucket is rehashed.

Linear hashing: Gradual bucket growth. A split pointer cycles through buckets, splitting one at a time as the table grows. No directory needed — just a round-robin splitting scheme.

Hash Index vs B+ Tree

FeatureHash IndexB+ Tree
Point lookupO(1)O(log n)
Range scanNot supportedVery efficient (sorted leaf chain)
Prefix matchingNot supportedSupported
Sort orderAnyPreserves sort order
Use caseKey-value stores, de-duplicationGeneral-purpose indexing

External Sorting

When data is too large to fit in memory, you need external sorting. The standard approach is external merge sort (two-phase):

Phase 1 — Sorting runs: Read chunks that fit in memory, sort each chunk, write as a sorted "run" to disk.

Phase 2 — Merging runs: Use a k-way merge to combine sorted runs into one sorted output.

Two-Phase Multi-Way Merge Sort

[Data on disk: unsorted pages]
  ↓ Read M pages at a time into memory
[In-memory sort → write sorted runs to disk]

[Sorted Run 1] [Sorted Run 2] [Sorted Run 3] ... [Sorted Run N]

[k-way merge using B buffer pages]

[Fully sorted data]

Replacement Selection

A variant for Phase 1 that produces runs of expected size 2M (twice the memory available). Uses a min-heap: when a key is output, the next input key is read. If it's >= the last output key, it enters the heap; otherwise, it goes to a "dead" queue for the next run.

Optimization

  • Double buffering: Pre-fetch next run during merge to overlap I/O and computation
  • Compression: Compress runs on disk to reduce I/O volume
  • Using Multiple Disks: Distribute runs across disks for parallel I/O

Traveler's Notes

Hash indexes give you the fastest possible point lookups, but at the cost of range queries and sort order. External sorting is the unsung hero of database systems — every time you run ORDER BY on a large dataset, the database is using it. Understanding these two tools is essential for both database design and general systems programming.

Built with VitePress | Software Systems Atlas