Chapter 7: Hash Index & External Sorting
Metadata Card
Dimension Value Difficulty (Intermediate) Prerequisites Basic data structure knowledge (hash table, sorting algorithms) Keywords Hash index, hash table, static hashing, extendible hashing, linear hashing, external sort, merge sort, two-phase sort Code Language Python
Your Progress
"Two most commonly used tools in the warehouse: the quick lookup counter (hash index) and the goods sorting room (external sorting)."
Hash Index
A hash index maps keys to specific bucket locations using a hash function. Excellent for point lookups (exact key match). Not suitable for range queries.
Static hashing: Fixed number of buckets. Simple but doesn't handle growth — overflow chains degrade performance.
Extendible hashing: Dynamic bucket growth using a directory of pointers. When a bucket overflows, split it and double the directory. Only the overflowing bucket is rehashed.
Linear hashing: Gradual bucket growth. A split pointer cycles through buckets, splitting one at a time as the table grows. No directory needed — just a round-robin splitting scheme.
Hash Index vs B+ Tree
| Feature | Hash Index | B+ Tree |
|---|---|---|
| Point lookup | O(1) | O(log n) |
| Range scan | Not supported | Very efficient (sorted leaf chain) |
| Prefix matching | Not supported | Supported |
| Sort order | Any | Preserves sort order |
| Use case | Key-value stores, de-duplication | General-purpose indexing |
External Sorting
When data is too large to fit in memory, you need external sorting. The standard approach is external merge sort (two-phase):
Phase 1 — Sorting runs: Read chunks that fit in memory, sort each chunk, write as a sorted "run" to disk.
Phase 2 — Merging runs: Use a k-way merge to combine sorted runs into one sorted output.
Two-Phase Multi-Way Merge Sort
[Data on disk: unsorted pages]
↓ Read M pages at a time into memory
[In-memory sort → write sorted runs to disk]
↓
[Sorted Run 1] [Sorted Run 2] [Sorted Run 3] ... [Sorted Run N]
↓
[k-way merge using B buffer pages]
↓
[Fully sorted data]Replacement Selection
A variant for Phase 1 that produces runs of expected size 2M (twice the memory available). Uses a min-heap: when a key is output, the next input key is read. If it's >= the last output key, it enters the heap; otherwise, it goes to a "dead" queue for the next run.
Optimization
- Double buffering: Pre-fetch next run during merge to overlap I/O and computation
- Compression: Compress runs on disk to reduce I/O volume
- Using Multiple Disks: Distribute runs across disks for parallel I/O
Traveler's Notes
Hash indexes give you the fastest possible point lookups, but at the cost of range queries and sort order. External sorting is the unsung hero of database systems — every time you run ORDER BY on a large dataset, the database is using it. Understanding these two tools is essential for both database design and general systems programming.