Skip to content

Chapter 10: SIMD Vectorization


Metadata Card

AttributeContent
Difficulty(Advanced)
PrerequisitesChapter 8 (Query Execution), basic CPU architecture
KeywordsSIMD, SSE, AVX, vectorized execution, data parallelism, CPU pipeline
CodeC++ (intrinsics), Python (simulated)

Your Progress

"The warehouse introduced an automated assembly line — processing entire batches at once, instead of moving items one by one."

What is SIMD?

SIMD (Single Instruction, Multiple Data): A CPU instruction applies the same operation to multiple data elements simultaneously.

Scalar (one at a time):     Vector (SIMD, 4 at a time):
a[0] + b[0]                 a[0..3] + b[0..3]
a[1] + b[1]
a[2] + b[2]
a[3] + b[3]

One instructionMultiple data lanesParallel computation in a single core.

SIMD Instruction Sets

ExtensionWidthRegistersAvailable
MMX64-bit81997 (obsolete)
SSE128-bit16 (XMM)1999
SSE2128-bit16 (XMM)2001
AVX256-bit16 (YMM)2011
AVX2256-bit16 (YMM)2013
AVX-512512-bit32 (ZMM)2017 (Xeon)
NEON128-bit32ARM (smartphones)
SVEVariable (128-2048-bit)ARM (servers)

SIMD in Databases

Key operations that benefit from SIMD:

  1. Filtering (selection): Compare a column of values against a predicate

    • WHERE value > 100: Load vector of values, compare, collect matching indices
  2. Aggregation: Sum, average, min, max over a column

    • SELECT SUM(salary): Horizontal add across vector lanes
  3. Hash probing: Compute hash for multiple keys simultaneously

  4. Data serialization/deserialization: Batch encode/decode multiple values

  5. Compression/Decompression: Dictionary lookups, delta decoding

Example: SIMD Filter

Without SIMD (scalar):

c
int count = 0;
for (int i = 0; i < N; i++) {
    if (data[i] > threshold) {
        result[count++] = i;
    }
}

With SIMD (AVX2, 256-bit = 8 int32s):

c
__m256i thresh_vec = _mm256_set1_epi32(threshold);
int count = 0;
for (int i = 0; i < N; i += 8) {
    __m256i data_vec = _mm256_loadu_si256((__m256i*)&data[i]);
    __m256i mask = _mm256_cmpgt_epi32(data_vec, thresh_vec);
    int mask_bits = _mm256_movemask_ps((__m256)mask);
    // Collect matching indices (compress store)
    while (mask_bits) {
        int idx = __builtin_ctz(mask_bits);
        result[count++] = i + idx;
        mask_bits &= mask_bits - 1;
    }
}

Challenges

ChallengeDescriptionSolution
Data alignmentMisaligned loads are slowerUse aligned allocation
Irregular dataString/long valuesColumnar + dictionary encoding
Control flowBranches break SIMDPredication (compute both paths)
PortabilityDifferent ISAsUse libraries (Highway, SIMDe)

Libraries

LibraryFeatures
Highway (Google)Portable SIMD with auto-dispatch
SIMDePortable SIMD intrinsics (emulates advanced on older CPUs)
EVEExpressive SIMD in C++20
xsimdC++ wrapper for cross-platform SIMD

Traveler's Notes

SIMD turns the database into a parallel processing engine within a single core. Combined with columnar storage, it enables modern analytical databases to process billions of rows per second per core. It's a key reason why columnar databases are so much faster for analytical queries than row stores.

Built with VitePress | Software Systems Atlas