Chapter 7: Hash Index & External Sorting

Metadata Card
Dimension Value
Difficulty (Intermediate)
Prerequisites Basic data structure knowledge (hash table, sorting algorithms)
Keywords Hash index, hash table, static hashing, extendible hashing, linear hashing, external sort, merge sort, two-phase sort
Code Language Python

Dimension	Value
Difficulty	(Intermediate)
Prerequisites	Basic data structure knowledge (hash table, sorting algorithms)
Keywords	Hash index, hash table, static hashing, extendible hashing, linear hashing, external sort, merge sort, two-phase sort
Code Language	Python

Your Progress

"Two most commonly used tools in the warehouse: the quick lookup counter (hash index) and the goods sorting room (external sorting)."

Hash Index

A hash index maps keys to specific bucket locations using a hash function. Excellent for point lookups (exact key match). Not suitable for range queries.

Static hashing: Fixed number of buckets. Simple but doesn't handle growth — overflow chains degrade performance.

Extendible hashing: Dynamic bucket growth using a directory of pointers. When a bucket overflows, split it and double the directory. Only the overflowing bucket is rehashed.

Linear hashing: Gradual bucket growth. A split pointer cycles through buckets, splitting one at a time as the table grows. No directory needed — just a round-robin splitting scheme.

Hash Index vs B+ Tree

Feature	Hash Index	B+ Tree
Point lookup	O(1)	O(log n)
Range scan	Not supported	Very efficient (sorted leaf chain)
Prefix matching	Not supported	Supported
Sort order	Any	Preserves sort order
Use case	Key-value stores, de-duplication	General-purpose indexing

External Sorting

When data is too large to fit in memory, you need external sorting. The standard approach is external merge sort (two-phase):

Phase 1 — Sorting runs: Read chunks that fit in memory, sort each chunk, write as a sorted "run" to disk.

Phase 2 — Merging runs: Use a k-way merge to combine sorted runs into one sorted output.

Two-Phase Multi-Way Merge Sort

[Data on disk: unsorted pages]
  ↓ Read M pages at a time into memory
[In-memory sort → write sorted runs to disk]
  ↓
[Sorted Run 1] [Sorted Run 2] [Sorted Run 3] ... [Sorted Run N]
  ↓
[k-way merge using B buffer pages]
  ↓
[Fully sorted data]

Replacement Selection

A variant for Phase 1 that produces runs of expected size 2M (twice the memory available). Uses a min-heap: when a key is output, the next input key is read. If it's >= the last output key, it enters the heap; otherwise, it goes to a "dead" queue for the next run.

Optimization

Double buffering: Pre-fetch next run during merge to overlap I/O and computation
Compression: Compress runs on disk to reduce I/O volume
Using Multiple Disks: Distribute runs across disks for parallel I/O

Traveler's Notes

Hash indexes give you the fastest possible point lookups, but at the cost of range queries and sort order. External sorting is the unsung hero of database systems — every time you run ORDER BY on a large dataset, the database is using it. Understanding these two tools is essential for both database design and general systems programming.

Chapter 7: Hash Index & External Sorting ​

Your Progress ​

Hash Index ​

Hash Index vs B+ Tree ​

External Sorting ​

Two-Phase Multi-Way Merge Sort ​

Replacement Selection ​

Optimization ​

Traveler's Notes ​