Software Systems Atlas

Metadata Card

Prerequisites: Chapter 6 (Parsing and AST)
Estimated time: 50 minutes
Core difficulty: Intermediate
Reading mode: High focus (recommend 50+ minutes)
Completion marker: Able to explain the roles of type checking and symbol tables, describe the core ideas of three-address code and SSA

The AST looks like code, but the compiler still doesn't "understand" it. For instance, x = "hello" * 42 — parsing can pass it because it satisfies grammar rules. But you know this line makes no sense — you can't multiply a string by an integer. You can sense this because you understand types. The compiler needs to have a similar "sense."

Your Progress

Your compiler frontend has completed lexical analysis and parsing. You have an AST, hung with nodes, but without "semantic" information — types, scopes, variable visibility. This chapter performs semantic analysis on the AST and finally translates it into an "efficiently computable" intermediate representation for the next stage.

Your Task

You add "understanding" ability to the compiler. You'll traverse the AST, track each variable's type (symbol table), check whether expression types match (type checking), and handle scope nesting. Then you flatten this semantic information — converting the AST into linear instruction sequences, which is intermediate code. You'll learn about three-address code and SSA (Static Single Assignment), the two most widely used intermediate representations.

Breakthrough · Origin Story

You have an AST of a piece of code — the Abstract Syntax Tree has clearly unfolded the structure of the source code. Now you need to inject soul into it in the Spell Tower: semantic analysis checks whether it makes sense, and then translates it into intermediate code.

(function_decl
  name: "add"
  params: (a: int, b: int)
  body: (return (binop + a b))
)

It is syntactically flawless. But you still need to answer these questions:

What types are a and b? (Check parameter declarations)
Is the type of a + b valid? (Two ints added = int, valid)
Does the return type match the declaration? (Implicitly int -> int, matches)
Are a and b visible in this scope? (Yes, within the same function)

Symbol Table

A Symbol Table is the mapping you use to record "variable name -> type + other information." The compiler maintains a stack of symbol tables while traversing the AST:

When entering a new scope (such as a function body or code block { }), you push a new symbol table onto the stack. When leaving the scope, you pop it.

Symbol table stack when inside the function body "add":

Top:    { a: int, b: int }   ← current scope
Bottom: { printf: (char*, ...) -> int }  ← global scope

Name lookup goes from top of stack downward — check the current scope first, then outer scopes, finally the global scope. This is precisely the model for variable shadowing.

Type Checking

While traversing AST nodes, for each expression node, check whether child node types and operators are compatible:

Operator	Left Type	Right Type	Result Type
+	int	int	int
+	float	float	float
+	int	float	float (implicit conversion)
+	int *	int	illegal
==	any T	T	int

The type-checking function looks like recursive descent, but it operates on the AST, not the token stream:

The compiler performs recursive descent from the root node, doing type inference for each expression level. NODE_NUM directly returns int; NODE_ID looks up the variable type in the symbol table; NODE_BINOP first checks whether the left and right child node types are compatible, then returns the operation result type. This recursive process precisely mirrors the static checking you see in your IDE when writing code.

Type check_expr(ASTNode *node, SymTable *st) {
    switch (node->type) {
        case NODE_NUM:
            return TYPE_INT;
        case NODE_ID:
            return lookup_type(st, node->data.name);
        case NODE_BINOP:
            Type left_t = check_expr(node->data.binop.left, st);
            Type right_t = check_expr(node->data.binop.right, st);
            if (!compatible(left_t, right_t))
                error("type mismatch at %s", node->location);
            return result_type(node->data.binop.op, left_t, right_t);
        // ...
    }
}

From AST to Intermediate Code: Three-Address Code

The AST is a tree, but the computer executes linear instruction sequences. Three-Address Code (TAC) is a compromise between the tree and machine code — each instruction has at most three addresses (two operands, one destination).

AST node: (a + b) * c

Corresponding three-address code:
  t1 = a + b
  t2 = t1 * c

Each instruction handles one basic operation. "Three addresses" means two sources plus one destination. The compiler introduces temporary variables (t1, t2) to hold intermediate results. This makes each instruction very "atomic" — one operation, one result.

TAC instruction types typically include:

t = x op y        // binary operation
t = op x          // unary operation
t = x             // assignment
goto L            // unconditional jump
if x relop y goto L  // conditional jump
param x           // pass parameter
call f, n         // function call (n parameters)
ret x             // return
t = &x            // address-of
t = *p            // pointer dereference
*p = t            // pointer assignment

What you need to do is recursively traverse the AST and generate the corresponding TAC instruction sequence for each AST node. The traversal order is the code execution order.

SSA: Static Single Assignment

An upgraded version of three-address code is SSA (Static Single Assignment). The core idea is simple: each variable is assigned only once.

TAC version:
  t1 = a + b
  t2 = t1 * c
  t1 = t2 + d     // reassign to t1 — SSA doesn't allow this

SSA version:
  t1 = a + b
  t2 = t1 * c
  t3 = t2 + d     // must use a new name

When control flow converges (e.g., after the two branches of an if-else), SSA uses a special φ (Phi) function to "select" the variable's value:

if (cond)
  x1 = 1
else
  x2 = 2
x3 = φ(x1, x2)    // at runtime, choose between x1 and x2

SSA's benefit: it makes dataflow analysis extremely simple — you always know the "single definition point" for a variable. This allows the optimizer to perform constant propagation, dead code elimination, etc., without complex dataflow analysis.

Layer One: Good Enough — Write a Minimal Type Checker + TAC Generator

You can translate a "safe expression" containing only integer addition, subtraction, multiplication, and division (with no type errors) into TAC:

These few dozen lines of code demonstrate how AST-to-intermediate-code achieves "one traversal, complete generation." Recursively traverse the tree; for each binary operation node, allocate a new temporary variable and print an instruction of the form 'tN = left op right' — the tree structure of the AST is flattened into a linear instruction sequence, which is the first step of code generation.

int tac_temp = 0;

char* new_temp() {
    char buf[16];
    snprintf(buf, 16, "t%d", tac_temp++);
    return strdup(buf);
}

void gen_tac(ASTNode *node) {
    if (node->type == NODE_NUM) {
        // a number's "result" is itself
        return;
    }
    if (node->type == NODE_BINOP) {
        gen_tac(node->data.binop.left);
        gen_tac(node->data.binop.right);
        char *t = new_temp();
        printf("%s = %s %c %s\n", t,
            node->data.binop.left->result,
            node->data.binop.op,
            node->data.binop.right->result);
        node->result = t;
    }
}

This simple function prints TAC. For (a + b) * c, it outputs:

t0 = a + b
t1 = t0 * c

Layer Two: Deep Dive — Why Compilers Introduce Multiple Intermediate Representations

Modern compilers typically have more than one IR. The Clang/LLVM pipeline is:

C source → AST (Clang) → LLVM IR (SSA) → Machine IR → Assembly

Each IR layer discards information not needed by the higher layer and adds details needed by the lower layer. The AST retains variable names and types; when converted to LLVM IR, it becomes SSA-form virtual registers; when reaching Machine IR, virtual registers are mapped to real hardware registers. This step-wise lowering approach lets you use only the data you need at each stage.

Common Pitfalls

Pitfall 1: Thinking type checking only happens at variable declarations. Type checking is a recursive process applied to every expression node. f(1, "hello") is checked at the call site, not at f's declaration.

Pitfall 2: Confusing "lvalue" and "rvalue" during intermediate code generation. In a = b, the left-hand a is an address (lvalue), the right-hand b is a value (rvalue). In a = a + 1, the a on the right is a read of the value, the a on the left is a write to the address. TAC generation must distinguish: the left-hand side generates an "address expression," the right-hand side generates a "value expression."

Pitfall 3: SSA's φ function is not an "executed instruction." φ is a conceptual "selector" indicating the source of a variable when control flow converges at runtime. It doesn't correspond to a real CPU instruction — during register allocation, φ functions are eliminated and converted to move instructions.

Challenge Questions

Warm-up (5 min): Convert the following code to AST → TAC: int x = 10; int y = x * 2 + 5;
Challenge (30 min): Extend the TAC generator above to add if-else branch structures. You need to generate two code blocks for the branches and connect them via jump instructions. Input if (a > b) c = a; else c = b; should generate three-address code containing conditional jumps and labels.
Observation: Run clang -S -emit-llvm example.c in LLVM and inspect the generated .ll file. Note its SSA form — the % prefix before each variable name, the phi instruction, and the boundary between load/store instructions.

Traveler's Notes

Semantic analysis imbues the AST with "meaning" — type checking, scope resolution; intermediate code (three-address code/SSA) flattens the tree structure into linear instruction sequences, the foundation for optimization and code generation.

Next Stop Preview

The AST has become TAC or SSA; you can now go from source code to linear code. But this linear code is crude — it uses far more temporary variables and instructions than necessary. The final chapter handles code generation and optimization, making it truly efficient on target machines.