Skip to content

Chapter 20: Search System


Metadata Card

AttributeContent
Difficulty(Intermediate)
PrerequisitesUnderstanding of full-text search, basic data structures
KeywordsInverted index, tokenization, TF-IDF, BM25, Elasticsearch, Lucene, relevance scoring
CodePython, Elasticsearch queries

Your Progress

"The final workshop of the data fortress — full-text search. No longer content with exact matches, the warehouse can now find any item by describing it."

SQL's LIKE '%keyword%' is slow (full table scan) and inflexible (no ranking, no relevance). Full-text search systems provide:

  • Fast substring/fuzzy matching: Millisecond response on billions of documents
  • Relevance ranking: Best results first
  • Tolerance: Typos, stemming, synonyms
  • Faceted search: Filter by categories, ranges

Inverted Index

The core data structure for full-text search:

Documents:
Doc 1: "The quick brown fox"
Doc 2: "Brown fox jumps over the lazy dog"
Doc 3: "The lazy dog sleeps"

Inverted Index:
brown  → [Doc1(pos2), Doc2(pos1)]
dog    → [Doc2(pos7), Doc3(pos3)]
fox    → [Doc1(pos3), Doc2(pos2)]
jumps  → [Doc2(pos3)]
lazy   → [Doc2(pos6), Doc3(pos2)]
quick  → [Doc1(pos2)]
sleeps → [Doc3(pos4)]
the    → [Doc1(pos1), Doc2(pos5), Doc3(pos1)]

Structure: term → list of (document_id, position_list, frequency)

Tokenization

Breaking text into tokens:

python
import re
def tokenize(text: str) -> list[str]:
    # Lowercase, split on non-alphanumeric
    return re.findall(r'\w+', text.lower())

Stemming: Reduce words to root form ("running" → "run", "ran" → "run") Lemmatization: More sophisticated, uses vocabulary and morphological analysis Stop words: Remove common words ("the", "a", "is", "on")

Relevance Scoring

TF-IDF: Term Frequency × Inverse Document Frequency

TF = frequency of term in document / total terms in document
IDF = log(total documents / documents containing the term)
TF-IDF = TF × IDF

BM25 (OKapi BM25): Improved version of TF-IDF, used by Elasticsearch/Lucene:

score(D,Q) = Σ (IDF × TF(D) × (k1 + 1)) / (TF(D) + k1 × (1 - b + b × |D| / avgdl))

Key parameters: k1 (term frequency saturation), b (length normalization).

Elasticsearch Architecture

Built on Apache Lucene (Java search library):

Elasticsearch Cluster
┌─────────────────────────────────┐
│  Node 1          Node 2          │
│ ┌────────────┐ ┌────────────┐  │
│ │Index A     │ │Index B     │  │
│ │ Shard 0    │ │ Shard 0    │  │
│ │ Shard 1    │ │ Shard 1    │  │
│ └────────────┘ └────────────┘  │
└─────────────────────────────────┘

IndexShard (Lucene index) → Segment (immutable inverted index)

Query Types

QueryDescription
matchFull-text search with analysis
termExact value lookup (no analysis)
rangeNumeric/date range query
fuzzyLevenshtein edit distance tolerance
wildcardPattern matching (slow on large datasets)
boolBoolean combination of queries (must, should, must_not, filter)
  • Index-time routing: Which shard a document belongs to
  • Query-time scatter-gather: Query all shards → merge ranked results → return top N
  • Replication: Search can be served from any replica

Search vs Database

Database (B+ Tree)Search Engine (Inverted Index)
Exact matchFastFast
Full-text searchSlow (LIKE)Very fast
Range queriesFastSupported (but slower)
RankingNoYes (BM25)
AggregationSQL GROUP BYElasticsearch aggregations
Fuzzy/typo toleranceNoYes

Traveler's Notes

Full-text search systems are specialized databases optimized for one thing: finding text quickly and ranking results by relevance. The inverted index is the key innovation — trading storage space for search speed. Elasticsearch and Solr/Lucene power much of the internet's search infrastructure. Understanding how they work helps you design better search experiences and troubleshoot performance issues.

Built with VitePress | Software Systems Atlas