Skip to content

Metadata Card

  • Prerequisites: Volume 10 (Foundations of Mathematics), especially basic set theory and graph theory concepts
  • Estimated time: 40 minutes
  • Core difficulty: Beginner
  • Reading mode: High focus (recommend 40+ minutes)
  • Completion marker: Able to explain the four levels of the Chomsky hierarchy, able to manually construct the simplest finite automaton

"Study the Creator's incantations. The Librarian says 'only by understanding spells can you create new things.'" You stand on the first floor of the Spell Tower, facing a stone door covered in strange symbols — some look like letters, some like arrows, some like lines winding in circles. You vaguely sense an hidden order among these symbols.


Your Progress

You obtained the basic tools (sets, mappings, graph theory) from the Mathematics Tower, and now stand at the entrance of the Spell Tower. This floor addresses a fundamental question: what is meant by "a language"? Not spoken language, not written text — but what programming languages, configuration languages, DSLs, and even regular expressions — these "artificial languages" — all have in common.


Your Task

This chapter establishes the theoretical framework for understanding all "artificial languages." You will see that every programming language can be categorized into a "complexity level," and the stages of a compiler correspond exactly to these levels. First, clarify the mathematical definition of language, then progressively understand the Chomsky hierarchy, and finally use automata (the simplest computational model) to get an intuition for "checking whether a string belongs to a language." This process requires no programming, but it does require you to slow down and draw a few examples on paper.


Breakthrough — Origin Story

You stand before the stone door, which is engraved with the following rule:

Only by reciting the correct spell will the stone door open.

But what does "correct spell" mean? You try a few words — "open", "hello", "abracadabra" — nothing happens. You notice a line of small text at the bottom of the door: "All spells consist of the letters a and b, and must start with a and end with b."

This, in fact, defines a language. In mathematical terms:

A language is a subset of all possible strings over a given alphabet.

For instance, if the alphabet ∑ = {a, b}, then all possible strings form an infinite set: {ε, a, b, aa, ab, ba, bb, aaa, ...}. The rule "starts with a, ends with b" filters out a subset — a language.

This is the first lesson of formal language theory: precisely define "what is valid."

The Chomsky Hierarchy — Four Levels of Language

Linguist Chomsky discovered that all formal languages can be divided into four levels, each corresponding to a type of grammar and a class of automaton — a "machine" that can recognize this language.

LevelGrammar TypeRule FormCorresponding AutomatonTypical Example
Type 3Regular GrammarA → aB or A → aFinite Automaton (FA)Regular expressions, lexical rules
Type 2Context-Free GrammarA → γPushdown Automaton (PDA)Programming language syntax
Type 1Context-Sensitive GrammarαAβ → αγβLinear Bounded Automaton (LBA)Some natural language structures
Type 0Unrestricted Grammarα → βTuring Machine (TM)All computable languages

Level 4 (Type 0) is the most powerful; Level 3 (Type 3) is the most restricted. The various stages of a compiler progress exactly from bottom to top: lexical analysis (Type 3) → parsing (Type 2) → semantic analysis (core of Type 1) → code generation (Type 0 intuition).

You don't need to memorize all four levels at once. Focus first on the lowest level — Type 3 (regular grammar) — because it's the smallest, most controllable building block.

Finite Automaton — The Simplest Computational Model

A finite automaton has only two core components: states and transitions. It uses a finite number of states to remember "how far you've read."

A finite automaton for recognizing "starts with a, ends with b" can be drawn like this:

This is the simplest recognition rule you can describe to a spell machine. Three states are like three guards: the first guard only lets through spells starting with a, the second guard waits for the end of the spell, and the third guard confirms it ends with b. Each guard guides the flow of spells to the next step.

States: q0 (start), q1 (has received a), q2 (has ended with b")
Transitions:
  q0 --a--> q1
  q1 --a--> q1
  q1 --b--> q2
  q2 --a--> q1
  q2 --b--> q2

Execution:

  • Read "ab": q0 → q1 → q2 (accept)
  • Read "aab": q0 → q1 → q1 → q2 (accept)
  • Read "ba": q0 cannot read b from q0, stays at q0 (reject)

This is how a Deterministic Finite Automaton (DFA) works. It is "deterministic" because each state + input symbol combination has exactly one next state.

Nondeterministic Finite Automaton (NFA)

An NFA allows one state + input symbol to have multiple transitions, or allows "epsilon transitions" (jumping without consuming input). An NFA appears to "guess" — but in practice, the computer can track all possible states (subset construction), proving that NFA and DFA are exactly equivalent in expressive power.

Why NFA is useful: When writing regular expressions, what you naturally produce is often an NFA ("or" relationships expressed as multiple transitions), while the computer actually needs a DFA at runtime. Converting NFA to DFA is the core step of a regex engine.

Regular Languages vs. Context-Free Languages — Where's the Boundary

Although regular languages are simple and useful, they have a clear boundary of capability. A classic example: checking whether parentheses are balanced.

This simple example draws the boundary of regular languages. The difficulty you feel when trying to match nested parentheses in a string — no matter how complex your regular expression, it can never get it right — has its roots here. Regular languages have no memory, but parentheses need a counter.

"()"       -> balanced
"(())"     -> balanced
"((())))"  -> unbalanced

Balancing parentheses requires "remembering how many I've opened," which needs an unbounded counter. A finite automaton only has a finite number of states and cannot do this. Therefore parentheses matching is not a regular language — it requires a context-free grammar (Type 2).

This is why compilers are split into lexical and parsing phases: lexical analysis uses regular grammars (fast, simple, linear time), while parsing uses context-free grammars (more powerful, but also slower).


Common Pitfalls

Pitfall 1: Confusing languages and automata. "A language is the problem; an automaton is the solution." An automaton is a judgment tool, a language is the object being judged. First define the language clearly, then choose the appropriate automaton.

Pitfall 2: Thinking NFA is more powerful than DFA. NFA and DFA are exactly equivalent in expressive power. NFA is just easier to write; DFA is more efficient. Any NFA can be converted into an equivalent DFA (though the number of states may grow exponentially).

Pitfall 3: Underestimating the limitations of regular languages. Parentheses matching, palindrome checking, prime counting — these are all problems regular languages cannot handle. When writing a compiler, lexical analysis only handles word-level structure; don't try to solve syntactic problems at the lexical stage.


Challenge Questions

  1. Draw a DFA that recognizes "all strings of a and b that contain at least one a."
  2. Rewrite the DFA above as an NFA, then manually convert it back to a DFA using subset construction.
  3. Explain why "all strings of a and b where the number of a's equals the number of b's" is not a regular language.

Traveler's Notes

The core of formal language theory in one sentence: Precisely describe what is correct, and let the machine check it for you.


Next Stop Preview

Having understood the lowest-level regular languages, the next chapter will dive into the construction details of DFA and NFA, and have you manually translate regular expressions into automata.

Built with VitePress | Software Systems Atlas