Software Systems Atlas

Metadata Card

Prerequisites: Chapter 3 (CFG/PDA), Chapter 5 (Lexical Analysis)
Estimated time: 55 minutes
Core difficulty: Intermediate
Reading mode: High focus (recommend 55+ minutes)
Completion marker: Able to write a recursive descent parser, explain the difference between LL and LR, manually construct a parse tree for an expression

On the sixth floor, the tower is filled with various tools: some look like top-down funnels, others like bottom-up cranes. You'll use them to turn the token stream from the lexer into a tree — not a decorative tree, but the skeleton of the entire compilation pipeline: the Abstract Syntax Tree (AST).

Your Progress

In the previous chapter, you turned source code into a token sequence — now you have a string of IF, LPAREN, ID, .... But a token sequence is like a pile of lumber — not yet a house. You need to use grammar rules to reconstruct the hierarchical structure of the code.

Your Task

This chapter completes the most core part of the compiler frontend — parsing. You'll learn to write a recursive descent parser (the hand-written approach), understand the difference between LL and LR automatic parsing algorithms, and then understand how the Abstract Syntax Tree (AST) emerges from the parsing process. The goal: given a token stream, you can write code to turn it into a tree.

Breakthrough · Origin Story

You have a token sequence: ID(=x) ASSIGN NUM(5) SEMI. You instinctively know this corresponds to an assignment statement x = 5;. But you need to turn this "instinct" into mathematical rules and programs.

Recursive Descent Parsing (Top-Down)

Recursive descent is the easiest parser to write by hand. Write one function for each non-terminal; the function body expands according to productions, checking the current token when encountering a terminal and calling the corresponding parse function when encountering a non-terminal. Start with the function for the start symbol and recurse downward.

Below is a core part of a simplified expression parser implemented in C (supporting only addition, subtraction, and numbers):

Each grammar non-terminal corresponds to a C function — this is the most practical forging technique in the Spell Tower. Go, Rust, and Swift compilers are all hand-written using this pattern — you're building tools of the same caliber.

typedef struct {
    Token tokens[1024];
    int pos;
} Parser;

Token peek(Parser *p) {
    return p->tokens[p->pos];
}

Token advance(Parser *p) {
    return p->tokens[p->pos++];
}

// Grammar: E -> E + T | T
// After left-recursion elimination: E -> T { + T }
// T -> num

void parse_E(Parser *p) {
    parse_T(p);
    while (peek(p).type == TOKEN_PLUS) {
        advance(p);         // consume '+'
        parse_T(p);
        // AST addition node can be constructed here
    }
}

void parse_T(Parser *p) {
    if (peek(p).type == TOKEN_NUM) {
        advance(p);         // consume number
        // number node can be constructed here
    } else {
        // syntax error
        error("expected number");
    }
}

This pattern is clear, intuitive, and debuggable. Each function corresponds to a grammar non-terminal; recursive calls reflect the nested structure. Since the grammar has eliminated left recursion, there's no infinite recursion.

Note the most critical point: the parser must not only determine whether input is valid, but also construct the syntax tree. Inside parse_E, when encountering +, you take the left subexpression node already constructed and the right subexpression node about to be constructed, and create a new "addition node."

AST (Abstract Syntax Tree) vs CST (Concrete Syntax Tree)

The complete tree derived from a CFG includes all intermediate production steps, including "syntactic sugar" like parentheses. The AST retains only semantically meaningful nodes:

Input: 1 + 2 * 3

CST (Complete derivation tree):
    E
   /|\
  E + T
  |   /|\
  T  T * F
  |  |   |
  F  F   num(3)
  |  |
num(1) num(2)

AST (semantic structure only):
     +
    / \
   1   *
      / \
     2   3

The AST removes parentheses (because parentheses are encoded in the tree structure), removes intermediate non-terminals (E, T, F), and keeps only operators and operands. AST node types are typically an enum with a union:

typedef enum { NODE_NUM, NODE_BINOP } NodeType;

typedef struct Node {
    NodeType type;
    union {
        int num_val;            // NODE_NUM
        struct {                // NODE_BINOP
            struct Node *left;
            struct Node *right;
            int op;             // PLUS, MINUS, MULT, DIV
        } binop;
    } data;
} Node;

LL Parsing vs LR Parsing

Feature	LL (Top-Down)	LR (Bottom-Up)
Perspective	Derive from start symbol to input	Reduce input to start symbol
Implementation	Recursive descent + prediction	Shift-Reduce + state stack
Grammar restrictions	No left recursion, must factor out common prefixes	More relaxed, can handle more grammars
Hand-writing difficulty	Low	High (needs analysis tables)
Tools	Primarily hand-written	Yacc/Bison (LALR(1))
Error recovery	Good (can insert close-brace, etc.)	Poor (has already consumed a lot when error is detected)

The vast majority of hand-written parsers are LL-style recursive descent. Yacc/Bison are automatic generators for LR parsers. LL is more programmer-friendly; LR is more language-designer-friendly (fewer grammar restrictions). Modern language design trends toward LL — Go, Rust, and Swift all use hand-written recursive descent parsers.

Layer One: Good Enough — Parse Expressions with a Recursive Descent Parser

Trace through the parsing of 1 + 2 + 3:

parse_E():
  call parse_T(): consume number 1
  see '+', consume it
  call parse_T(): consume number 2
  see '+', consume it
  call parse_T(): consume number 3
  no more '+', return

Constructed AST:

(Natural result of left-associative addition. If it were right-associative — like exponentiation — the AST order would reverse.)

Layer Two: Deep Dive — Embedded Actions and Reductions

In Yacc/Bison, you don't need to write recursion by hand. You write grammar rules and the corresponding action for each rule (C code):

expr: expr '+' term   { $$ = make_binop($1, PLUS, $3); }
    | term            { $$ = $1; }
    ;

term: term '*' factor { $$ = make_binop($1, MULT, $3); }
    | factor          { $$ = $1; }
    ;

factor: '(' expr ')'  { $$ = $2; }
      | NUMBER        { $$ = make_num($1); }
      ;

Bison generates the LR parsing table and AST construction code from these rules. Your job is to define the rules and provide node constructors. The underlying "Shift (push token onto the stack)" and "Reduce (reduce to a non-terminal)" — Bison does those for you.

Common Pitfalls

Pitfall 1: Left recursion causing recursive descent parser crashes. If the grammar is E -> E + T | T, the recursive descent parse_E function will call itself on its first line — infinite recursion. Left recursion must be eliminated.

Pitfall 2: Losing token position information in the AST. When constructing AST nodes, remember to attach source position (line, column). Without position information, error messages from the parser will say "wrong here" but not "wrong at line 23, column 5" — which is nearly unusable for programmers.

Pitfall 3: Confusing the representation of parentheses in the AST. The AST doesn't need a special "parenthesis node" — parentheses are encoded as tree depth. 1 + (2 * 3) and (1 + 2) * 3 produce entirely different ASTs, which is exactly the effect of parentheses. If you put parent node in the AST, you're doing it wrong.

Challenge Questions

Warm-up (5 min): Draw the AST for the expression a + b * c - d. First, follow standard precedence (multiplication before addition/subtraction, left-associative). Then change the grammar to make addition take priority over multiplication, and draw the AST again.
Challenge (30 min): Implement a complete recursive descent parser in C or Java supporting: addition, subtraction, multiplication, division, parentheses, numbers, single-letter variables. Use a three-level priority grammar (addition level, multiplication level, base level). Verify that the AST output for input (a + b) * c + d is correct.
Observation: Open the source code of the Rust or Go compiler, find parser.rs or parser.go, and observe how their recursive descent functions correspond to the CFG structure. Note how they handle error recovery (jumping to the next semicolon on syntax error to continue parsing).

Traveler's Notes

Parsing reconstructs a token stream into an Abstract Syntax Tree; recursive descent (LL style) is the easiest parser to hand-write; the AST removes syntactic surface and keeps only the semantic skeleton.

Next Stop Preview

The AST tells you the formal structure of the code — but formal correctness doesn't mean it makes sense. "hello" + 42 produces a valid AST after parsing, but it's semantically invalid. In the next chapter, you perform semantic analysis on the AST, making it truly understood and translating it into intermediate code.