Software Systems Atlas

Metadata Card

Prerequisites: Chapter 1 (Formal Languages and Automata Overview)
Estimated time: 50 minutes
Core difficulty: Intermediate
Reading mode: High focus (recommend 50+ minutes)
Completion marker: Able to hand-draw DFA/NFA and convert between them, able to construct NFA from regular expressions

The second floor of the Spell Tower: the walls are covered with arrows and circles, like a giant roadmap. You notice each path corresponds to a type of "spell template" — as long as a spell matches a certain template, its corresponding effect is automatically triggered. This floor teaches you exactly how to design this kind of "template matching" machine.

Your Progress

In the previous chapter, you learned that "a language is a set of strings." But you were verifying manually — taking a string of characters and checking it against the rules one by one. That work is too slow and error-prone. You need a machine that can automatically recognize spells.

Your Task

This chapter starts with two types of automata — Deterministic Finite Automata (DFA) and Nondeterministic Finite Automata (NFA) — to solve the problem of "how to mechanically determine whether a string belongs to a regular language." You will first learn how to construct DFAs, then see how NFAs are more convenient for "designing rules by humans," and finally use Thompson's construction to convert regular expressions into NFAs — the theoretical foundation of compiler lexical analyzers.

Breakthrough · Origin Story

Before you are two spell rules:

Rule A: Start with a, followed by any number of b's, ending with b Rule B: Contains at least one a and at least one b, and a must come before b

You check each spell against the rules one by one and quickly get confused. Because Rule B says "contains at least one a and at least one b" — you can't be sure whether the string you're reading already has these characters.

What you need is a "stateful" checking process: it doesn't need to remember the entire string, only what mode it's currently in. This is precisely the use of states.

DFA: Deterministic Finite Automaton

A DFA consists of five parts:

Q: Finite set of states
∑: Input alphabet
δ: Transition function Q x ∑ -> Q (each state + each character, exactly one transition)
q0: Start state
F: Set of accepting states

The "deterministic" nature of a DFA means: given the current state and the next character, the next step is unique. It's like a one-way path where each intersection has only one signpost.

Here's a DFA for checking a "C language integer identifier (starting with a digit)" — in the actual C standard, identifiers cannot start with a digit; we're just demonstrating the automaton:

When you write a lexer, the compiler scans your tokens using exactly this kind of state table. Each character is one table lookup, O(n) time, O(1) space — no algorithm can be faster.

State set: Q = {q0, q1}
Alphabet: ∑ = {d (digit), o (other)}
Transition function:
  δ(q0, d) = q1
  δ(q0, o) = q0
  δ(q1, d) = q1
  δ(q1, o) = q0
Start state: q0
Accepting states: {q0, q1} (all states are accepting, since any string containing a digit is accepted)

This DFA scans input left to right, one character at a time, pushing the state forward. Time O(n), space O(1). You can't design a faster algorithm because you have to look at each input character at least once to know what it is.

NFA: Nondeterministic Finite Automaton

Now use a more flexible approach. The only difference between NFA and DFA is: from the same state, reading the same character, there can be multiple transitions, or even "epsilon transitions" (jumping states without reading any character).

NFA example: Recognize "starts with a, ends with b"
  
  State diagram: q0 --a--> q1 --b--> q2 (accepting)
                  q1 --a--> q1
                  q1 --b--> q1

For input "aab":
  From q0 read a -> q1
  Read a -> q1 (can take self-loop)
  Read b -> q1 or q2 (two paths)

NFA seems 'unreliable' — the same input can have multiple possible paths. But the computer tracks all paths behind the scenes; as long as one path reaches an accepting state, it's a match. This is why people find writing regex natural (NFA thinking), while the machine needs to convert to a DFA at runtime.

An NFA's "nondeterminism" sounds unreliable — how can a machine take multiple paths? In theory, an NFA "tries all possible paths," accepting if any one reaches an accepting state. In engineering, we never use NFA as an execution model; it's a "design intermediate."

NFA is designed for humans. DFA is executed by machines.

NFA -> DFA: Subset Construction

The algorithm for converting an NFA to a DFA is called "subset construction," with a simple core idea:

Each state of the DFA = a power set of NFA states (the set of all possible NFA states)

Steps:

Start from the NFA's start state, add all states reachable via ε (empty) transitions — this becomes the DFA's start state
For this set of states, for each input character, compute the "next state set" reachable by the NFA
Treat each distinct "state set" as a new DFA state
Repeat until no new states emerge

This conversion can, in the worst case, cause exponential state growth (the DFA's state count may be 2^n times the NFA's), but for practical regex scenarios, the DFA state count is usually manageable.

Regular Expression -> NFA: Thompson's Construction

This is the most common operation in lexical analysis: given a regular expression, automatically construct the corresponding NFA. Ken Thompson gave these construction rules in 1968:

The brilliance of Thompson's construction is that it's a set of 'spell-building blocks': each regex syntactic structure (concatenation, alternation, repetition) corresponds to a small NFA fragment, and these can be pieced together using epsilon transitions to form a complete NFA. Every regex you write, inside flex or re2c, is translated into an NFA using this rule.

Basic rules:
  character a       ->  q0 --a--> q1
  empty string      ->  q0 (epsilon transition to q1)
  
Compound rules:
  Concatenation AB  : connect A's accept state via epsilon to B's start
  Alternation A|B   : from new start state, epsilon transitions to the starts of A and B in parallel
  Repetition A*     : add epsilon self-loops between A's start and end

You don't need to memorize the details, but you should understand the spirit: each syntactic structure of a regular expression — concatenation, alternation, repetition — has a corresponding NFA fragment, and these fragments can be assembled using epsilon transitions into a complete NFA. This construction is linear in size and won't cause exponential explosion.

Layer One: Good Enough — Hand-Drawing DFA to Recognize Simple Tokens

In a C compiler's lexer, you need to recognize three types of tokens: the keyword if, identifiers (letter followed by letters/digits), and numeric constants. You write a regex for each, construct the corresponding DFA, then merge them into one large DFA — each DFA's final state is tagged with the corresponding token type. The lexer walks on this big DFA as it scans input, emitting whichever token it reaches.

When you write if (x > 0), the compiler's lexical phase identifies the if keyword and x identifier using this DFA — multiple DFAs embedded in the same machine, selecting the result using the longest-match strategy.

Regex:
  if        -> "if"
  identifier-> [a-zA-Z][a-zA-Z0-9]*
  number    -> [0-9]+

Part of the merged DFA:
  q0 --i--> q1 --f--> qif (emit IF token)
  q0 --letter--> qid --letter/digit--> qid (emit ID token)
  q0 --digit--> qnum --digit--> qnum (emit NUM token)

Layer Two: Deep Dive — Why NFA is "More Advanced" but Weaker in Engineering

DFA execution efficiency is unbeatable: one table lookup per input character, fixed number of states. But DFA construction cost is high — for complex regexes, DFA states can explode (something like (a|b)*a(a|b)(a|b), a medium-complexity pattern, already has DFA state count several times that of the NFA). Worse, some regexes correspond to DFAs whose size is exponential in the length of the regex.

So modern lexer generators (like flex, re2c) typically do this: use NFA as the internal representation during construction, but decompose or cache DFA states in the final product. Some use "NFA simulation" (directly tracking the NFA's current state set), sacrificing a bit of speed for construction time and memory.

Most of the time, you don't need to implement these yourself — lex/flex or hand-written lexers call existing regex libraries. But this understanding lets you know: your lexer is using the fastest possible tool to process the surface-level structure of a language.

Common Pitfalls

Pitfall 1: Thinking NFA is more powerful than DFA. NFA and DFA are completely equivalent in expressive power — they both describe regular languages. Every NFA has an equivalent DFA, and vice versa. The difference lies in size and construction difficulty, not expressive power.

Pitfall 2: Not understanding "epsilon closure" in subset construction. NFA's ε transitions may allow a state to jump to the next without any input. In subset construction, a state set must include all states reachable via ε transitions. Miss this step, and your DFA will miss matches.

Pitfall 3: Confusing the "longest match" rule used by lexers. Even with a DFA, when faced with if (keyword) and ifx (identifier), the lexer uses the "longest match" rule — it keeps reading characters until the DFA has no path forward, then backtracks to the last accepting state. This isn't inherent to the DFA itself, but an engineering strategy of the lexer.

Challenge Questions

Warm-up (5 min): Hand-draw a DFA that recognizes all strings of a and b where "the number of a's is odd." Only two states (even number of a's, odd number of a's).
Challenge (30 min): Given the regex (a|b)*abb, draw the NFA using Thompson's construction, then convert to DFA using subset construction. Verify whether aabb and abba can be recognized respectively.
Observation: Open a Linux terminal and run man regex to see the seven subexpressions of POSIX regex. See which can be directly mapped using Thompson's construction and which need extensions (e.g., [a-z] range matching, backreferences — the latter already go beyond regular languages).

Traveler's Notes

DFA is the static, efficient executor; NFA is the flexible design intermediate; regular expressions are converted to NFA via Thompson's construction, then to DFA via subset construction — this is the theoretical skeleton of lexical analysis.

Next Stop Preview

Regular languages can only describe "flat" string structures. The next chapter introduces stronger grammars for handling nesting and recursion — such as parentheses matching. That will be the stage for context-free grammars and pushdown automata.