Chapter 6: Columnar Storage
Metadata Card
Dimension Value Difficulty (Intermediate) Prerequisites Chapter 1 (SQL), understanding of rows and columns Keywords Row store, column store, PAX, ORC, Parquet, vectorized execution, compression, analytical workload Code Language Python (conceptual simulation)
Your Progress
"Analysis warehouses don't need entire rows of shelves — they just need one column of each type. Row store vs column store: two different perspectives."
Row Store vs Column Store
Row store (MySQL, PostgreSQL): All fields of a row stored together. Good for OLTP (Online Transaction Processing) — point queries, single-row lookups.
Column store (ClickHouse, BigQuery, Redshift): All values of a column stored together. Good for OLAP (Online Analytical Processing) — aggregations, full-column scans.
Row Store: Column Store:
Row 1: A1,B1,C1 Column A: A1,A2,A3,A4...
Row 2: A2,B2,C2 Column B: B1,B2,B3,B4...
Row 3: A3,B3,C3 Column C: C1,C2,C3,C4...Why Columnar is Better for Analytics
Query: SELECT AVG(value) FROM items WHERE year = 2024
Row store: Read all rows (all columns) → filter → compute. Wasteful I/O reading unused columns.
Column store: Only read year and value columns. Massive I/O savings.
Compression Benefits
Column values tend to be similar (same type, range, distribution). This enables:
- Dictionary encoding: Replace values with short integer codes
- Run-length encoding (RLE): Consecutive identical values stored as (value, count)
- Delta encoding: Store differences between consecutive values
- Bit packing: Store integers in fewer bits when range is small
Hybrid: PAX Format
Partition Attributes Across (PAX) combines row and columnar approaches. Within each page, data is stored in columnar format. This gives the best of both worlds: row-oriented page management with column-oriented cache efficiency.
Vectorized Execution
Instead of processing one row at a time (tuple at a time, "Volcano" model), process batches of values (vectors). This reduces function call overhead and enables SIMD optimization.
Columnar Formats
| Format | Creator | Features |
|---|---|---|
| ORC | Apache Hive | Stripe-level indexes, bloom filters, ACID support |
| Parquet | Apache (Twitter/Cloudera) | Nested data support, schema evolution, widely adopted |
| Arrow | Apache | In-memory columnar format, zero-copy sharing, cross-language |
Traveler's Notes
The row vs column decision is one of the most impactful architectural choices in data systems. Row stores handle your daily transactions; column stores power your analytics dashboards. Understanding the difference — and knowing when to use each — is essential for any data engineer.