Chapter 10: SIMD Vectorization
Metadata Card
| Attribute | Content |
|---|---|
| Difficulty | (Advanced) |
| Prerequisites | Chapter 8 (Query Execution), basic CPU architecture |
| Keywords | SIMD, SSE, AVX, vectorized execution, data parallelism, CPU pipeline |
| Code | C++ (intrinsics), Python (simulated) |
Your Progress
"The warehouse introduced an automated assembly line — processing entire batches at once, instead of moving items one by one."
What is SIMD?
SIMD (Single Instruction, Multiple Data): A CPU instruction applies the same operation to multiple data elements simultaneously.
Scalar (one at a time): Vector (SIMD, 4 at a time):
a[0] + b[0] a[0..3] + b[0..3]
a[1] + b[1]
a[2] + b[2]
a[3] + b[3]One instruction → Multiple data lanes → Parallel computation in a single core.
SIMD Instruction Sets
| Extension | Width | Registers | Available |
|---|---|---|---|
| MMX | 64-bit | 8 | 1997 (obsolete) |
| SSE | 128-bit | 16 (XMM) | 1999 |
| SSE2 | 128-bit | 16 (XMM) | 2001 |
| AVX | 256-bit | 16 (YMM) | 2011 |
| AVX2 | 256-bit | 16 (YMM) | 2013 |
| AVX-512 | 512-bit | 32 (ZMM) | 2017 (Xeon) |
| NEON | 128-bit | 32 | ARM (smartphones) |
| SVE | Variable (128-2048-bit) | — | ARM (servers) |
SIMD in Databases
Key operations that benefit from SIMD:
Filtering (selection): Compare a column of values against a predicate
WHERE value > 100: Load vector of values, compare, collect matching indices
Aggregation: Sum, average, min, max over a column
SELECT SUM(salary): Horizontal add across vector lanes
Hash probing: Compute hash for multiple keys simultaneously
Data serialization/deserialization: Batch encode/decode multiple values
Compression/Decompression: Dictionary lookups, delta decoding
Example: SIMD Filter
Without SIMD (scalar):
int count = 0;
for (int i = 0; i < N; i++) {
if (data[i] > threshold) {
result[count++] = i;
}
}With SIMD (AVX2, 256-bit = 8 int32s):
__m256i thresh_vec = _mm256_set1_epi32(threshold);
int count = 0;
for (int i = 0; i < N; i += 8) {
__m256i data_vec = _mm256_loadu_si256((__m256i*)&data[i]);
__m256i mask = _mm256_cmpgt_epi32(data_vec, thresh_vec);
int mask_bits = _mm256_movemask_ps((__m256)mask);
// Collect matching indices (compress store)
while (mask_bits) {
int idx = __builtin_ctz(mask_bits);
result[count++] = i + idx;
mask_bits &= mask_bits - 1;
}
}Challenges
| Challenge | Description | Solution |
|---|---|---|
| Data alignment | Misaligned loads are slower | Use aligned allocation |
| Irregular data | String/long values | Columnar + dictionary encoding |
| Control flow | Branches break SIMD | Predication (compute both paths) |
| Portability | Different ISAs | Use libraries (Highway, SIMDe) |
Libraries
| Library | Features |
|---|---|
| Highway (Google) | Portable SIMD with auto-dispatch |
| SIMDe | Portable SIMD intrinsics (emulates advanced on older CPUs) |
| EVE | Expressive SIMD in C++20 |
| xsimd | C++ wrapper for cross-platform SIMD |
Traveler's Notes
SIMD turns the database into a parallel processing engine within a single core. Combined with columnar storage, it enables modern analytical databases to process billions of rows per second per core. It's a key reason why columnar databases are so much faster for analytical queries than row stores.