Metadata Card
- Prerequisites: Chapter 1 (Formal Languages Overview), Chapter 2 (DFA/NFA)
- Estimated time: 50 minutes
- Core difficulty: Intermediate
- Reading mode: High focus (recommend 50+ minutes)
- Completion marker: Able to write a grammar for simple expressions, detect and eliminate ambiguity, understand how PDAs use a stack to match parentheses
On the third floor, a spell is engraved on the wall: "abracadabra" — but strangely, it contains multiple nested parentheses, like (ab(ra(ca)dab)ra). You try to match it using the DFA from the previous floor and find it impossible; no matter how you draw the state diagram, you can't remember "how many parentheses are currently open." You need a device with memory — a stack.
Your Progress
Regular languages and DFAs can only handle "flat" rules — local patterns in a character sequence. But programming languages universally have nested structures: parentheses, expressions, statement blocks, scopes — all nested. You need to upgrade from Type 3 grammar to Type 2.
Your Task
This chapter lets you master the weapons of the second level of computational power: Context-Free Grammars (CFG) and Pushdown Automata (PDA). You'll learn how to use CFGs to describe parentheses matching, arithmetic expressions, and even a small language, then see why a PDA (DFA + a stack) is exactly what can recognize such grammars. You'll also encounter "ambiguous grammars" — a problem frequently encountered in practice but easily overlooked.
Breakthrough · Origin Story
You try to match parentheses with a regular expression — \(.*\) can only match the outermost pair of parentheses, but can't guarantee correct nesting. To ensure every level of parentheses is properly paired, you need to remember "what level you're currently reading." Regular languages are memoryless — they only care about the current character and current state, not "how many left parentheses you've read so far."
You need a stack.
CFG: Context-Free Grammar
A CFG adds one thing over a regular grammar: the right-hand side of a production can be any mixture of terminals and non-terminals of arbitrary length.
What you're forging in the Spell Tower is exactly this kind of recursive generation rule — each production is a casting template; expanding it once moves you one step closer to concrete program code.
Three-level decomposition of arithmetic expressions:
E -> E + T | T
T -> T * F | F
F -> ( E ) | idThis grammar describes the hierarchy of arithmetic expressions: addition is lowest priority, multiplication is middle, parentheses are innermost. You start expanding from E (Expression) and eventually get a string of terminals — a valid arithmetic expression.
Derivation example: id + id * id
E -> E + T
-> T + T (since E -> T)
-> F + T (T -> F)
-> id + T (F -> id)
-> id + T * F (T -> T * F)
-> id + F * F (T -> F)
-> id + id * F (F -> id)
-> id + id * id (F -> id)Note that each derivation step replaces only one non-terminal. This "leftmost derivation" is the most common expansion method in syntactic analysis.
PDA: Pushdown Automaton
PDA = DFA + stack. The stack has unbounded capacity (but LIFO access only).
A PDA is defined as:
- State set Q
- Input alphabet ∑
- Stack alphabet Γ
- Transition function δ: Q × (∑ U {ε}) × Γ -> subset of Q × Γ*
- Start state q0
- Initial stack symbol Z0
- Accepting state set F
The transition function's triple means: (current state, next input character or ε, top-of-stack symbol) -> (new state, stack replacement string)
Take the language a^n b^n as an example — equal numbers of a's and b's, a's first:
PDA execution:
Read a: push a onto the stack
Read b: pop one a from the stack
After reading all input: check if the stack is empty
Specific transitions (simplified notation):
δ(q0, a, Z) -> (q0, aZ) // first a, push
δ(q0, a, a) -> (q0, aa) // more a's, continue pushing
δ(q0, b, a) -> (q1, ε) // encounter b, pop one a, enter q1
δ(q1, b, a) -> (q1, ε) // continue popping a's
δ(q1, ε, Z) -> (q2, ε) // stack back to initial symbol, enter accepting stateThis example is intuitive: the stack is your "parenthesis counter."
Ambiguous Grammar
A grammar is ambiguous if it can produce more than one distinct derivation tree (parse tree) for the same sentence.
G: E -> E + E | E * E | id
For input id + id * id, there can be two parse trees:
1. Expand addition first (id + (id * id))
2. Expand multiplication first ((id + id) * id)
Two different interpretations — in a compiler, this means you can't determine operator precedence.The solution is to decompose by precedence: encode precedence information into the structure of the grammar. The three-level grammar E -> E + T | T above is the standard solution. Another solution is to use operator-precedence parsing, but it's less common.
Layer One: Good Enough — Write a CFG for a Small Language
Suppose you need to write a grammar for a mini-language supporting variable declarations, assignments, and procedure calls:
Program -> Statement*
Statement -> VarDecl | Assign | Call
VarDecl -> type id ;
Assign -> id = Exp ;
Call -> id ( Args ) ;
Args -> Exp ( , Exp )* | ε
Exp -> id | num | Exp + Exp | Exp * Exp | ( Exp )Although small, this grammar already covers the key syntactic structures of most programming languages. Once written, you can feed it to a parser generator (like Yacc/Bison) to produce a parser.
Layer Two: Deep Dive — CFG Parsing Complexity
The general parsing algorithm for CFGs (like the CYK algorithm) has complexity O(n³), where n is the input length. This is unacceptable in practice. So compilers don't parse arbitrary CFGs — they restrict the grammar subset and use faster algorithms:
- LL grammars: Top-down (recursive descent), O(n). Suitable for hand-written parsers. Requires no left recursion.
- LR grammars: Bottom-up (Shift-Reduce), O(n). Can handle more grammars (including all LL grammars). Yacc/Bison use the LR variant LALR(1).
So the grammars used in actual compilers aren't "general CFG" but designed subsets that can be parsed efficiently. Parsability must be considered when designing languages and grammars, rather than writing arbitrary CFGs.
Common Pitfalls
Pitfall 1: Left recursion causing infinite loops in recursive descent parsers. If you write A -> Aα | β, the recursive descent parseA function will call parseA directly on its first line, creating infinite recursion. The solution is to eliminate left recursion: rewrite A -> Aα | β as A -> βA' and A' -> αA' | ε.
Pitfall 2: Common prefixes causing FIRST set conflicts. A -> αβ | αγ has both branches starting with α; the recursive descent parser doesn't know which path to take after seeing one token of α. Extract the common prefix: A -> αA' and A' -> β | γ.
Pitfall 3: Confusing "pushdown automaton" and "two-stack automaton." A PDA has only one stack. A two-stack automaton is computationally equivalent to a Turing machine (TM), far exceeding a PDA. Adding one more stack jumps to the next level of computational power.
Challenge Questions
Warm-up (5 min): Write a CFG for matching nested
{and}(C language brace blocks). Don't worry about the internal syntax of braces; just write the brace-matching grammar itself.Challenge (30 min): Write a CFG that includes if-else matching. Note: the "dangling else" problem of if-else is a classic ambiguity —
if (c1) if (c2) s1 else s2— which if does the else match? The rule is "the closest unmatched if." Modify your grammar to make the matching unambiguous. Hint: use a "mandatory match" strategy, meaning the inner if must have an else.Observation: Find the BNF of an ANSI C grammar file (e.g., the C example in
man yacc), and identify which CFG patterns are used to resolve operator precedence.
Traveler's Notes
Context-free grammars (CFG) describe nested structures; pushdown automata (PDA) use a stack for recognition; compilers actually use restricted subsets like LL/LR in exchange for linear parsing time.
Next Stop Preview
CFG + PDA can describe the syntax of almost all programming languages. But "syntax" is only the "form" of a program — the next level asks: are there problems that no computational model can solve?