Skip to content

Chapter 9: Query Compilation


Metadata Card

AttributeContent
Difficulty(Advanced)
PrerequisitesChapter 8 (Query Execution & Optimizer), compiler basics
KeywordsQuery compilation, JIT, code generation, LLVM, interpreted execution, compiled execution
CodeConceptual Python/C++

Your Progress

"Master Chen discovered that 'verbal instructions' were too slow. He started compiling instructions directly into mechanical operations — interpretation vs compilation."

The Interpretation Overhead

Traditional database execution uses the Volcano iterator model: each operator (scan, filter, join) implements next() that returns one tuple. The overhead per tuple includes:

  • Function calls between operators
  • Virtual method dispatch
  • Type checking
  • Null checking
  • Branch misprediction

For queries processing billions of rows, this overhead becomes significant.

Compiled Execution

Instead of interpreting the query plan at runtime, compile the query into native machine code. This eliminates interpretation overhead.

Approaches

ApproachExampleHow
Template-basedHyper, Databricks PhotonGenerate C++ code from query plan, compile via gcc/LLVM
JIT-compiledHyPer, Actian VectorCompile to LLVM IR at runtime, JIT to native code
BytecodeSQLiteCompile to custom bytecode, interpreted by a VM loop

Code Generation Example

SQL: SELECT name, value FROM items WHERE type = 'Weapon' AND value > 100

Template-based compiled version (pseudocode):

c
void query_scan(int page_count, Page* pages, OutputBuffer* output) {
    for (int p = 0; p < page_count; p++) {
        Page* page = &pages[p];
        int row_count = page->row_count;
        for (int r = 0; r < row_count; r++) {
            Row* row = &page->rows[r];
            // Filter: type = 'Weapon' AND value > 100
            if (strcmp(row->type, "Weapon") == 0 && row->value > 100) {
                // Project: name, value
                output->write(row->name, row->value);
            }
        }
    }
}

No function calls per row! The entire query is a tight loop with the filters and projections baked in.

Benefits

AspectInterpretationCompilation
Per-tuple overhead10-100 CPU cycles< 5 CPU cycles
Branch predictionPoor (operators switch)Excellent (tight loops)
Cache utilizationPoor (scattered state)Excellent (localized data)
Compile timeNone10-100ms (acceptable for complex queries)

Compilation vs Vectorization

VectorizationCompilation
GranularityBatches (1024 rows)Per-row (tight loop)
SIMD benefitNaturalCan be added
Compile overheadNoneMedium
Best forSimple scans, aggregationsComplex expressions, multi-way joins

Modern Systems

  • Hyper (Tableau): Compiles query to LLVM IR
  • SingleStore: JIT compilation for complex queries
  • CockroachDB: Vectorized execution fallback
  • Umbra: State-of-the-art compiled execution system

Traveler's Notes

Query compilation represents the frontier of database performance optimization. By eliminating interpretation overhead, compiled databases can execute queries tens of times faster than traditional systems. The trade-off is compilation time — but for complex analytical queries, the optimization is well worth it.

Built with VitePress | Software Systems Atlas