Chapter 20: Search System
Metadata Card
| Attribute | Content |
|---|---|
| Difficulty | (Intermediate) |
| Prerequisites | Understanding of full-text search, basic data structures |
| Keywords | Inverted index, tokenization, TF-IDF, BM25, Elasticsearch, Lucene, relevance scoring |
| Code | Python, Elasticsearch queries |
Your Progress
"The final workshop of the data fortress — full-text search. No longer content with exact matches, the warehouse can now find any item by describing it."
Why Search?
SQL's LIKE '%keyword%' is slow (full table scan) and inflexible (no ranking, no relevance). Full-text search systems provide:
- Fast substring/fuzzy matching: Millisecond response on billions of documents
- Relevance ranking: Best results first
- Tolerance: Typos, stemming, synonyms
- Faceted search: Filter by categories, ranges
Inverted Index
The core data structure for full-text search:
Documents:
Doc 1: "The quick brown fox"
Doc 2: "Brown fox jumps over the lazy dog"
Doc 3: "The lazy dog sleeps"
Inverted Index:
brown → [Doc1(pos2), Doc2(pos1)]
dog → [Doc2(pos7), Doc3(pos3)]
fox → [Doc1(pos3), Doc2(pos2)]
jumps → [Doc2(pos3)]
lazy → [Doc2(pos6), Doc3(pos2)]
quick → [Doc1(pos2)]
sleeps → [Doc3(pos4)]
the → [Doc1(pos1), Doc2(pos5), Doc3(pos1)]Structure: term → list of (document_id, position_list, frequency)
Tokenization
Breaking text into tokens:
import re
def tokenize(text: str) -> list[str]:
# Lowercase, split on non-alphanumeric
return re.findall(r'\w+', text.lower())Stemming: Reduce words to root form ("running" → "run", "ran" → "run") Lemmatization: More sophisticated, uses vocabulary and morphological analysis Stop words: Remove common words ("the", "a", "is", "on")
Relevance Scoring
TF-IDF: Term Frequency × Inverse Document Frequency
TF = frequency of term in document / total terms in document
IDF = log(total documents / documents containing the term)
TF-IDF = TF × IDFBM25 (OKapi BM25): Improved version of TF-IDF, used by Elasticsearch/Lucene:
score(D,Q) = Σ (IDF × TF(D) × (k1 + 1)) / (TF(D) + k1 × (1 - b + b × |D| / avgdl))Key parameters: k1 (term frequency saturation), b (length normalization).
Elasticsearch Architecture
Built on Apache Lucene (Java search library):
Elasticsearch Cluster
┌─────────────────────────────────┐
│ Node 1 Node 2 │
│ ┌────────────┐ ┌────────────┐ │
│ │Index A │ │Index B │ │
│ │ Shard 0 │ │ Shard 0 │ │
│ │ Shard 1 │ │ Shard 1 │ │
│ └────────────┘ └────────────┘ │
└─────────────────────────────────┘Index → Shard (Lucene index) → Segment (immutable inverted index)
Query Types
| Query | Description |
|---|---|
| match | Full-text search with analysis |
| term | Exact value lookup (no analysis) |
| range | Numeric/date range query |
| fuzzy | Levenshtein edit distance tolerance |
| wildcard | Pattern matching (slow on large datasets) |
| bool | Boolean combination of queries (must, should, must_not, filter) |
Distributed Search
- Index-time routing: Which shard a document belongs to
- Query-time scatter-gather: Query all shards → merge ranked results → return top N
- Replication: Search can be served from any replica
Search vs Database
| Database (B+ Tree) | Search Engine (Inverted Index) | |
|---|---|---|
| Exact match | Fast | Fast |
| Full-text search | Slow (LIKE) | Very fast |
| Range queries | Fast | Supported (but slower) |
| Ranking | No | Yes (BM25) |
| Aggregation | SQL GROUP BY | Elasticsearch aggregations |
| Fuzzy/typo tolerance | No | Yes |
Traveler's Notes
Full-text search systems are specialized databases optimized for one thing: finding text quickly and ranking results by relevance. The inverted index is the key innovation — trading storage space for search speed. Elasticsearch and Solr/Lucene power much of the internet's search infrastructure. Understanding how they work helps you design better search experiences and troubleshoot performance issues.