Chapter 10: SIMD Vectorization

Metadata Card

Attribute	Content
Difficulty	(Advanced)
Prerequisites	Chapter 8 (Query Execution), basic CPU architecture
Keywords	SIMD, SSE, AVX, vectorized execution, data parallelism, CPU pipeline
Code	C++ (intrinsics), Python (simulated)

Your Progress

"The warehouse introduced an automated assembly line — processing entire batches at once, instead of moving items one by one."

What is SIMD?

SIMD (Single Instruction, Multiple Data): A CPU instruction applies the same operation to multiple data elements simultaneously.

Scalar (one at a time):     Vector (SIMD, 4 at a time):
a[0] + b[0]                 a[0..3] + b[0..3]
a[1] + b[1]
a[2] + b[2]
a[3] + b[3]

One instruction → Multiple data lanes → Parallel computation in a single core.

SIMD Instruction Sets

Extension	Width	Registers	Available
MMX	64-bit	8	1997 (obsolete)
SSE	128-bit	16 (XMM)	1999
SSE2	128-bit	16 (XMM)	2001
AVX	256-bit	16 (YMM)	2011
AVX2	256-bit	16 (YMM)	2013
AVX-512	512-bit	32 (ZMM)	2017 (Xeon)
NEON	128-bit	32	ARM (smartphones)
SVE	Variable (128-2048-bit)	—	ARM (servers)

SIMD in Databases

Key operations that benefit from SIMD:

Filtering (selection): Compare a column of values against a predicate
- WHERE value > 100: Load vector of values, compare, collect matching indices
Aggregation: Sum, average, min, max over a column
- SELECT SUM(salary): Horizontal add across vector lanes
Hash probing: Compute hash for multiple keys simultaneously
Data serialization/deserialization: Batch encode/decode multiple values
Compression/Decompression: Dictionary lookups, delta decoding

Example: SIMD Filter

Without SIMD (scalar):

int count = 0;
for (int i = 0; i < N; i++) {
    if (data[i] > threshold) {
        result[count++] = i;
    }
}

With SIMD (AVX2, 256-bit = 8 int32s):

__m256i thresh_vec = _mm256_set1_epi32(threshold);
int count = 0;
for (int i = 0; i < N; i += 8) {
    __m256i data_vec = _mm256_loadu_si256((__m256i*)&data[i]);
    __m256i mask = _mm256_cmpgt_epi32(data_vec, thresh_vec);
    int mask_bits = _mm256_movemask_ps((__m256)mask);
    // Collect matching indices (compress store)
    while (mask_bits) {
        int idx = __builtin_ctz(mask_bits);
        result[count++] = i + idx;
        mask_bits &= mask_bits - 1;
    }
}

Challenges

Challenge	Description	Solution
Data alignment	Misaligned loads are slower	Use aligned allocation
Irregular data	String/long values	Columnar + dictionary encoding
Control flow	Branches break SIMD	Predication (compute both paths)
Portability	Different ISAs	Use libraries (Highway, SIMDe)

Libraries

Library	Features
Highway (Google)	Portable SIMD with auto-dispatch
SIMDe	Portable SIMD intrinsics (emulates advanced on older CPUs)
EVE	Expressive SIMD in C++20
xsimd	C++ wrapper for cross-platform SIMD

Traveler's Notes

SIMD turns the database into a parallel processing engine within a single core. Combined with columnar storage, it enables modern analytical databases to process billions of rows per second per core. It's a key reason why columnar databases are so much faster for analytical queries than row stores.

Chapter 10: SIMD Vectorization ​

Metadata Card ​

Your Progress ​

What is SIMD? ​

SIMD Instruction Sets ​

SIMD in Databases ​

Example: SIMD Filter ​

Challenges ​

Libraries ​

Traveler's Notes ​