Software Systems Atlas

Metadata Card

Prerequisites: Chapter 7 (Semantic Analysis and Intermediate Representation)
Estimated time: 55 minutes
Core difficulty: Intermediate
Reading mode: High focus (recommend 55+ minutes)
Completion marker: Able to explain basic graph-coloring register allocation, list 3 common local optimization techniques, understand the core idea of JIT

The top floor of the Spell Tower.

The Librarian stands before a massive iron door, engraved with the line: "From thought to silicon." He pushes it open, and you see an ultimate weapon laid out on the table — the entire toolchain from source code to machine code.

But the Librarian doesn't start teaching you how to use these tools directly. He picks up a pair of scissors, snips at a line of intermediate code on the table — and a useless temporary variable flutters down. "This final step," he says, "is to fit the carefully constructed skeleton into the machine's nerves, and along the way, throw away as much baggage as you can."

You pause — you spent several chapters converting code into an intermediate representation, and now he's telling you to "throw away baggage"?

"Exactly. Semantic correctness is the baseline, not the destination. You want code that runs faster, uses less memory, and needs fewer registers. That's the meaning of optimization."

Your Progress

You've gone through the frontend — lexical analysis, parsing, semantic analysis, and intermediate code generation. Now you have a smooth sequence of intermediate code (TAC/SSA). It's semantically correct, but bloated — piles of temporary variables, unused assignments, redundant computations. Two things remain: optimize this code, then generate target machine code.

Your Task

This chapter completes the two core tasks of the compiler backend: producing target code and improving code quality. You'll learn the basic strategy for register allocation (graph coloring), master several common local (within basic blocks) and global optimization techniques (constant folding, common subexpression elimination, dead code elimination, loop optimization), and finally understand how modern compilers combine all of these, and how JIT compilers change the traditional "compile first, run later" model.

Breakthrough · Origin Story

Your TAC generator outputs code like this:

t0 = a + b
t1 = t0 + c
t2 = a + b
t3 = t2 + d
t4 = t3 + t1

It's correct but wasteful: a + b is computed twice. You could try to avoid this duplication in an earlier phase — but you can't always know in advance what will be recomputed. And a bunch of temporary variables — t0 through t4, 5 "registers" — but on real hardware you might only have 16 general-purpose registers.

You need two things: reduce instruction count (optimization) and map countless temporary variables onto limited hardware registers (register allocation).

I. Local Optimization — Working Within Basic Blocks

A Basic Block is a consecutive sequence of instructions with "one entry and one exit." The basic unit of optimization first operates within a basic block.

Constant Folding: If operands are constants, compute directly.

t0 = 3 * 4 + 2  →  t0 = 14

Constant Propagation: If a variable is guaranteed to be constant, replace all its references with the constant.

x = 5
y = x + 3    →  y = 8

Common Subexpression Elimination (CSE): If the same expression appears in multiple places and its operands haven't changed, compute it only once.

t1 = a + b           t1 = a + b
t2 = a * c           t2 = a * c
t3 = a + b    →      t3 = t1
t4 = t3 + d          t4 = t1 + d

Dead Code Elimination: If a variable's value is never used later, delete the instruction that computes it.

t1 = a + b    // t1 is never used later — delete
t2 = a * c
ret t2        // only uses t2

Copy Propagation: If x = y appears and y hasn't been modified since, replace x with y.

x = y
z = x + 1    →  z = y + 1

II. Register Allocation — Graph Coloring

Register allocation is the process of mapping unlimited virtual registers (t0, t1, ...) onto a limited number of physical registers (r0, r1, ..., r15). The classic method is graph coloring:

Create a node for each virtual register
If two virtual registers are both "live" at the same time, they cannot share a physical register — draw an edge (interference edge) between them
The resulting graph is the register interference graph
Color the graph with K colors (K = number of available registers) — if adjacent nodes have the same color, there's a conflict
If a node cannot be colored (no available color), spill its value to memory — temporarily store it on the stack

Assuming 3 physical registers available (K=3), interference graph:

  t0 --- t1
   |   /
   |  /
   t2

Coloring:
  t0 = r0
  t1 = r1
  t2 = r2   ← no conflicts, all three registers used

If another t3 appears and conflicts with all nodes — spill to stack.

This is an NP-hard problem (see Chapter 4), so practical implementations use heuristics that balance quality and speed. LLVM uses a priority-based greedy allocator: sort by "live interval length," trying to squeeze short-lived variables into registers.

III. Global Optimization — Across Basic Blocks

Optimizations beyond basic blocks require more complex analysis. The most typical global optimization is dataflow analysis — propagating properties across the control flow graph.

Reaching Definitions: For each use point, where did the last assignment to the variable occur? This is the foundation of all global optimizations.

Loop Invariant Code Motion: If an expression computes the same value in every iteration of a loop, move it outside the loop.

// Before optimization
for (i = 0; i < 100; i++) {
    x = 2 * PI;          // PI doesn't change inside the loop
    a[i] = a[i] + x;
}

// After optimization
x = 2 * PI;
for (i = 0; i < 100; i++) {
    a[i] = a[i] + x;
}

Strength Reduction: Replace expensive operations with cheaper ones. For example, incrementing by 1 each iteration (i = i + 1) is cheaper than i = i * 2. Multiplication can become addition, left-shift, etc.

IV. Target Code Generation — From IR to Assembly

The final step: convert the optimized intermediate code into assembly instructions for the target machine. The main work is instruction selection — each three-address code pattern corresponds to one or a set of target instruction sequences.

TAC: t1 = a + b

x86-64:  mov eax, [a]
         add eax, [b]
         mov [t1], eax

ARM64:   ldr w0, [sp, #a_offset]
         ldr w1, [sp, #b_offset]
         add w2, w0, w1
         str w2, [sp, #t1_offset]

The instruction selection process can be seen as tree pattern matching — your AST subtrees (now TAC patterns) must match target machine instruction templates. LLVM uses a representation called SelectionDAG for this.

V. JIT Compiler

AOT (Ahead-of-Time) compilation follows the full pipeline described in Chapter 7. But there is another approach: JIT (Just-In-Time) compilation, where code is compiled at runtime only for hot paths.

The core intuition of JIT: a lot of code only executes along certain paths and doesn't need to be compiled in advance. But some functions are executed frequently; the JIT detects these hot spots and switches them from interpreted execution to compiled execution. Java's HotSpot VM, V8's TurboFan, and LuaJIT all work this way.

JIT compilers typically use a tiered compilation strategy:

Source code → bytecode / interpreted execution
             → baseline compiler (fast, few optimizations)
             → optimizing compiler (slow, many optimizations, applies SSA, register allocation, etc.)

JIT can do better than AOT because it has access to actual runtime information — branch probabilities, type distributions (for dynamic languages), call frequencies. This information allows JIT to make optimization decisions that AOT compilation cannot determine.

Layer One: Good Enough — Read Compiler Output

You don't need to implement all this yourself, but you should be able to recognize optimization results in compiler output. Use gcc -O2 -S example.c to view assembly output, and compare -O0 with -O2 — you'll see the -O2 version is shorter, has fewer temporary variables, and some operations have been inlined.

Layer Two: Deep Dive — The Iron Rule: "Optimization Must Not Change Program Semantics"

No matter how many optimizations are applied, the compiler must guarantee that the optimized code is consistent with the original code in all observable behaviors. This constraint is called the "as-if" rule. The compiler cannot: arbitrarily reorder floating-point operations (due to precision differences), apply CSE to volatile variables, or delete external printf calls whose side effects are not determined. Violating this iron rule turns optimizations into bug generators.

Common Pitfalls

Pitfall 1: Thinking higher optimization levels always mean faster programs. -O3 includes some optimizations that may increase code size or produce negative effects (like loop unrolling and function inlining). In embedded systems or cache-constrained scenarios, -Os (optimize for size) might be faster than -O3.

Pitfall 2: Ignoring the non-commutativity of floating-point operations. x + y is not always equal to y + x (due to floating-point precision issues), so compilers generally do not reorder floating-point operations. If you want to allow floating-point reordering, use -ffast-math — but there may be precision loss.

Pitfall 3: Thinking local optimization is sufficient. Without global dataflow analysis, local optimization can only see within basic blocks — many cross-block optimization opportunities go to waste. For example, a variable assigned in block 1 and last used in block 3, with block 2 not modifying it — without global analysis, you can't delete the useless assignment at the end of block 1.

Challenge Questions

Warm-up (5 min): For the following TAC sequence, apply constant folding, dead code elimination, common subexpression elimination (CSE), and copy propagation one by one:
```
t0 = 2 + 3
t1 = a + b
t2 = t0 * 4
x = t2
y = a + b
t3 = x + y
ret t3
```
1
2
3
4
5
6
7
Challenge (30 min): Implement a minimal register allocator using your language of choice (graph coloring, K=4): input is a TAC sequence for one function, output is the register allocation result (which variables go to which registers, and which are spilled to the stack). You don't need to handle complex control flow — only linear code.
Observation: Pick a small C program and compile it with gcc -O0 -S, gcc -O1 -S, and gcc -O2 -S. Compare the generated assembly code length and instruction count. Can you see traces of CSE, constant folding, or loop optimization?

Traveler's Notes

The compiler backend operates under two constraints: having infinite possibilities (many optimizations) and being bound by iron rules (semantic equivalence). Register allocation maps virtual registers to physical registers; constant folding, CSE, dead code elimination, loop invariant code motion, etc., make code more efficient; JIT leverages actual runtime information to perform optimizations far beyond what AOT can achieve.

Next Stop Preview

You step out of the Spell Tower. The Librarian says, "You understand spells now." In the next volume, carrying a deep understanding of "computation," you enter the Data Prediction Hall — from code-driven systems to data-driven decisions, that is the world of Data Processing and Data Science.