Skip to content

Chapter 6: Columnar Storage


Metadata Card

DimensionValue
Difficulty(Intermediate)
PrerequisitesChapter 1 (SQL), understanding of rows and columns
KeywordsRow store, column store, PAX, ORC, Parquet, vectorized execution, compression, analytical workload
Code LanguagePython (conceptual simulation)

Your Progress

"Analysis warehouses don't need entire rows of shelves — they just need one column of each type. Row store vs column store: two different perspectives."

Row Store vs Column Store

Row store (MySQL, PostgreSQL): All fields of a row stored together. Good for OLTP (Online Transaction Processing) — point queries, single-row lookups.

Column store (ClickHouse, BigQuery, Redshift): All values of a column stored together. Good for OLAP (Online Analytical Processing) — aggregations, full-column scans.

Row Store:      Column Store:
Row 1: A1,B1,C1    Column A: A1,A2,A3,A4...
Row 2: A2,B2,C2    Column B: B1,B2,B3,B4...
Row 3: A3,B3,C3    Column C: C1,C2,C3,C4...

Why Columnar is Better for Analytics

Query: SELECT AVG(value) FROM items WHERE year = 2024

Row store: Read all rows (all columns) → filter → compute. Wasteful I/O reading unused columns.

Column store: Only read year and value columns. Massive I/O savings.

Compression Benefits

Column values tend to be similar (same type, range, distribution). This enables:

  • Dictionary encoding: Replace values with short integer codes
  • Run-length encoding (RLE): Consecutive identical values stored as (value, count)
  • Delta encoding: Store differences between consecutive values
  • Bit packing: Store integers in fewer bits when range is small

Hybrid: PAX Format

Partition Attributes Across (PAX) combines row and columnar approaches. Within each page, data is stored in columnar format. This gives the best of both worlds: row-oriented page management with column-oriented cache efficiency.

Vectorized Execution

Instead of processing one row at a time (tuple at a time, "Volcano" model), process batches of values (vectors). This reduces function call overhead and enables SIMD optimization.

Columnar Formats

FormatCreatorFeatures
ORCApache HiveStripe-level indexes, bloom filters, ACID support
ParquetApache (Twitter/Cloudera)Nested data support, schema evolution, widely adopted
ArrowApacheIn-memory columnar format, zero-copy sharing, cross-language

Traveler's Notes

The row vs column decision is one of the most impactful architectural choices in data systems. Row stores handle your daily transactions; column stores power your analytics dashboards. Understanding the difference — and knowing when to use each — is essential for any data engineer.

Built with VitePress | Software Systems Atlas