Chapter 20: Search System

Metadata Card

Attribute	Content
Difficulty	(Intermediate)
Prerequisites	Understanding of full-text search, basic data structures
Keywords	Inverted index, tokenization, TF-IDF, BM25, Elasticsearch, Lucene, relevance scoring
Code	Python, Elasticsearch queries

Your Progress

"The final workshop of the data fortress — full-text search. No longer content with exact matches, the warehouse can now find any item by describing it."

Why Search?

SQL's LIKE '%keyword%' is slow (full table scan) and inflexible (no ranking, no relevance). Full-text search systems provide:

Fast substring/fuzzy matching: Millisecond response on billions of documents
Relevance ranking: Best results first
Tolerance: Typos, stemming, synonyms
Faceted search: Filter by categories, ranges

Inverted Index

The core data structure for full-text search:

Documents:
Doc 1: "The quick brown fox"
Doc 2: "Brown fox jumps over the lazy dog"
Doc 3: "The lazy dog sleeps"

Inverted Index:
brown  → [Doc1(pos2), Doc2(pos1)]
dog    → [Doc2(pos7), Doc3(pos3)]
fox    → [Doc1(pos3), Doc2(pos2)]
jumps  → [Doc2(pos3)]
lazy   → [Doc2(pos6), Doc3(pos2)]
quick  → [Doc1(pos2)]
sleeps → [Doc3(pos4)]
the    → [Doc1(pos1), Doc2(pos5), Doc3(pos1)]

Structure: term → list of (document_id, position_list, frequency)

Tokenization

Breaking text into tokens:

python

import re
def tokenize(text: str) -> list[str]:
    # Lowercase, split on non-alphanumeric
    return re.findall(r'\w+', text.lower())

Stemming: Reduce words to root form ("running" → "run", "ran" → "run") Lemmatization: More sophisticated, uses vocabulary and morphological analysis Stop words: Remove common words ("the", "a", "is", "on")

Relevance Scoring

TF-IDF: Term Frequency × Inverse Document Frequency

TF = frequency of term in document / total terms in document
IDF = log(total documents / documents containing the term)
TF-IDF = TF × IDF

BM25 (OKapi BM25): Improved version of TF-IDF, used by Elasticsearch/Lucene:

score(D,Q) = Σ (IDF × TF(D) × (k1 + 1)) / (TF(D) + k1 × (1 - b + b × |D| / avgdl))

Key parameters: k1 (term frequency saturation), b (length normalization).

Elasticsearch Architecture

Built on Apache Lucene (Java search library):

Elasticsearch Cluster
┌─────────────────────────────────┐
│  Node 1          Node 2          │
│ ┌────────────┐ ┌────────────┐  │
│ │Index A     │ │Index B     │  │
│ │ Shard 0    │ │ Shard 0    │  │
│ │ Shard 1    │ │ Shard 1    │  │
│ └────────────┘ └────────────┘  │
└─────────────────────────────────┘

Index → Shard (Lucene index) → Segment (immutable inverted index)

Query Types

Query	Description
match	Full-text search with analysis
term	Exact value lookup (no analysis)
range	Numeric/date range query
fuzzy	Levenshtein edit distance tolerance
wildcard	Pattern matching (slow on large datasets)
bool	Boolean combination of queries (must, should, must_not, filter)

Distributed Search

Index-time routing: Which shard a document belongs to
Query-time scatter-gather: Query all shards → merge ranked results → return top N
Replication: Search can be served from any replica

Search vs Database

	Database (B+ Tree)	Search Engine (Inverted Index)
Exact match	Fast	Fast
Full-text search	Slow (LIKE)	Very fast
Range queries	Fast	Supported (but slower)
Ranking	No	Yes (BM25)
Aggregation	SQL GROUP BY	Elasticsearch aggregations
Fuzzy/typo tolerance	No	Yes

Traveler's Notes

Full-text search systems are specialized databases optimized for one thing: finding text quickly and ranking results by relevance. The inverted index is the key innovation — trading storage space for search speed. Elasticsearch and Solr/Lucene power much of the internet's search infrastructure. Understanding how they work helps you design better search experiences and troubleshoot performance issues.

Chapter 20: Search System ​

Metadata Card ​

Your Progress ​

Why Search? ​

Inverted Index ​

Tokenization ​

Relevance Scoring ​

Elasticsearch Architecture ​

Query Types ​

Distributed Search ​

Search vs Database ​

Traveler's Notes ​