Metadata Card
- Prerequisites: Chapter 6 (Parsing and AST)
- Estimated time: 50 minutes
- Core difficulty: Intermediate
- Reading mode: High focus (recommend 50+ minutes)
- Completion marker: Able to explain the roles of type checking and symbol tables, describe the core ideas of three-address code and SSA
The AST looks like code, but the compiler still doesn't "understand" it. For instance, x = "hello" * 42 — parsing can pass it because it satisfies grammar rules. But you know this line makes no sense — you can't multiply a string by an integer. You can sense this because you understand types. The compiler needs to have a similar "sense."
Your Progress
Your compiler frontend has completed lexical analysis and parsing. You have an AST, hung with nodes, but without "semantic" information — types, scopes, variable visibility. This chapter performs semantic analysis on the AST and finally translates it into an "efficiently computable" intermediate representation for the next stage.
Your Task
You add "understanding" ability to the compiler. You'll traverse the AST, track each variable's type (symbol table), check whether expression types match (type checking), and handle scope nesting. Then you flatten this semantic information — converting the AST into linear instruction sequences, which is intermediate code. You'll learn about three-address code and SSA (Static Single Assignment), the two most widely used intermediate representations.
Breakthrough · Origin Story
You have an AST of a piece of code — the Abstract Syntax Tree has clearly unfolded the structure of the source code. Now you need to inject soul into it in the Spell Tower: semantic analysis checks whether it makes sense, and then translates it into intermediate code.
(function_decl
name: "add"
params: (a: int, b: int)
body: (return (binop + a b))
)It is syntactically flawless. But you still need to answer these questions:
- What types are
aandb? (Check parameter declarations) - Is the type of
a + bvalid? (Two ints added = int, valid) - Does the return type match the declaration? (Implicitly int -> int, matches)
- Are
aandbvisible in this scope? (Yes, within the same function)
Symbol Table
A Symbol Table is the mapping you use to record "variable name -> type + other information." The compiler maintains a stack of symbol tables while traversing the AST:
When entering a new scope (such as a function body or code block { }), you push a new symbol table onto the stack. When leaving the scope, you pop it.
Symbol table stack when inside the function body "add":
Top: { a: int, b: int } ← current scope
Bottom: { printf: (char*, ...) -> int } ← global scopeName lookup goes from top of stack downward — check the current scope first, then outer scopes, finally the global scope. This is precisely the model for variable shadowing.
Type Checking
While traversing AST nodes, for each expression node, check whether child node types and operators are compatible:
| Operator | Left Type | Right Type | Result Type |
|---|---|---|---|
| + | int | int | int |
| + | float | float | float |
| + | int | float | float (implicit conversion) |
| + | int * | int | illegal |
| == | any T | T | int |
The type-checking function looks like recursive descent, but it operates on the AST, not the token stream:
The compiler performs recursive descent from the root node, doing type inference for each expression level. NODE_NUM directly returns int; NODE_ID looks up the variable type in the symbol table; NODE_BINOP first checks whether the left and right child node types are compatible, then returns the operation result type. This recursive process precisely mirrors the static checking you see in your IDE when writing code.
Type check_expr(ASTNode *node, SymTable *st) {
switch (node->type) {
case NODE_NUM:
return TYPE_INT;
case NODE_ID:
return lookup_type(st, node->data.name);
case NODE_BINOP:
Type left_t = check_expr(node->data.binop.left, st);
Type right_t = check_expr(node->data.binop.right, st);
if (!compatible(left_t, right_t))
error("type mismatch at %s", node->location);
return result_type(node->data.binop.op, left_t, right_t);
// ...
}
}From AST to Intermediate Code: Three-Address Code
The AST is a tree, but the computer executes linear instruction sequences. Three-Address Code (TAC) is a compromise between the tree and machine code — each instruction has at most three addresses (two operands, one destination).
AST node: (a + b) * c
Corresponding three-address code:
t1 = a + b
t2 = t1 * cEach instruction handles one basic operation. "Three addresses" means two sources plus one destination. The compiler introduces temporary variables (t1, t2) to hold intermediate results. This makes each instruction very "atomic" — one operation, one result.
TAC instruction types typically include:
t = x op y // binary operation
t = op x // unary operation
t = x // assignment
goto L // unconditional jump
if x relop y goto L // conditional jump
param x // pass parameter
call f, n // function call (n parameters)
ret x // return
t = &x // address-of
t = *p // pointer dereference
*p = t // pointer assignmentWhat you need to do is recursively traverse the AST and generate the corresponding TAC instruction sequence for each AST node. The traversal order is the code execution order.
SSA: Static Single Assignment
An upgraded version of three-address code is SSA (Static Single Assignment). The core idea is simple: each variable is assigned only once.
TAC version:
t1 = a + b
t2 = t1 * c
t1 = t2 + d // reassign to t1 — SSA doesn't allow this
SSA version:
t1 = a + b
t2 = t1 * c
t3 = t2 + d // must use a new nameWhen control flow converges (e.g., after the two branches of an if-else), SSA uses a special φ (Phi) function to "select" the variable's value:
if (cond)
x1 = 1
else
x2 = 2
x3 = φ(x1, x2) // at runtime, choose between x1 and x2SSA's benefit: it makes dataflow analysis extremely simple — you always know the "single definition point" for a variable. This allows the optimizer to perform constant propagation, dead code elimination, etc., without complex dataflow analysis.
Layer One: Good Enough — Write a Minimal Type Checker + TAC Generator
You can translate a "safe expression" containing only integer addition, subtraction, multiplication, and division (with no type errors) into TAC:
These few dozen lines of code demonstrate how AST-to-intermediate-code achieves "one traversal, complete generation." Recursively traverse the tree; for each binary operation node, allocate a new temporary variable and print an instruction of the form 'tN = left op right' — the tree structure of the AST is flattened into a linear instruction sequence, which is the first step of code generation.
int tac_temp = 0;
char* new_temp() {
char buf[16];
snprintf(buf, 16, "t%d", tac_temp++);
return strdup(buf);
}
void gen_tac(ASTNode *node) {
if (node->type == NODE_NUM) {
// a number's "result" is itself
return;
}
if (node->type == NODE_BINOP) {
gen_tac(node->data.binop.left);
gen_tac(node->data.binop.right);
char *t = new_temp();
printf("%s = %s %c %s\n", t,
node->data.binop.left->result,
node->data.binop.op,
node->data.binop.right->result);
node->result = t;
}
}This simple function prints TAC. For (a + b) * c, it outputs:
t0 = a + b
t1 = t0 * cLayer Two: Deep Dive — Why Compilers Introduce Multiple Intermediate Representations
Modern compilers typically have more than one IR. The Clang/LLVM pipeline is:
C source → AST (Clang) → LLVM IR (SSA) → Machine IR → AssemblyEach IR layer discards information not needed by the higher layer and adds details needed by the lower layer. The AST retains variable names and types; when converted to LLVM IR, it becomes SSA-form virtual registers; when reaching Machine IR, virtual registers are mapped to real hardware registers. This step-wise lowering approach lets you use only the data you need at each stage.
Common Pitfalls
Pitfall 1: Thinking type checking only happens at variable declarations. Type checking is a recursive process applied to every expression node. f(1, "hello") is checked at the call site, not at f's declaration.
Pitfall 2: Confusing "lvalue" and "rvalue" during intermediate code generation. In a = b, the left-hand a is an address (lvalue), the right-hand b is a value (rvalue). In a = a + 1, the a on the right is a read of the value, the a on the left is a write to the address. TAC generation must distinguish: the left-hand side generates an "address expression," the right-hand side generates a "value expression."
Pitfall 3: SSA's φ function is not an "executed instruction." φ is a conceptual "selector" indicating the source of a variable when control flow converges at runtime. It doesn't correspond to a real CPU instruction — during register allocation, φ functions are eliminated and converted to move instructions.
Challenge Questions
Warm-up (5 min): Convert the following code to AST → TAC:
int x = 10; int y = x * 2 + 5;Challenge (30 min): Extend the TAC generator above to add if-else branch structures. You need to generate two code blocks for the branches and connect them via jump instructions. Input
if (a > b) c = a; else c = b;should generate three-address code containing conditional jumps and labels.Observation: Run
clang -S -emit-llvm example.cin LLVM and inspect the generated.llfile. Note its SSA form — the%prefix before each variable name, thephiinstruction, and the boundary betweenload/storeinstructions.
Traveler's Notes
Semantic analysis imbues the AST with "meaning" — type checking, scope resolution; intermediate code (three-address code/SSA) flattens the tree structure into linear instruction sequences, the foundation for optimization and code generation.
Next Stop Preview
The AST has become TAC or SSA; you can now go from source code to linear code. But this linear code is crude — it uses far more temporary variables and instructions than necessary. The final chapter handles code generation and optimization, making it truly efficient on target machines.