Skip to content

Metadata Card

  • Prerequisites: Chapters 1–2 (Regular Languages and Automata), basic C or Java programming
  • Estimated time: 45 minutes
  • Core difficulty: Beginner
  • Reading mode: Casual walkthrough
  • Completion marker: Able to hand-write a simple lexer, understand how Lex/flex works

You've climbed the first four floors of the Spell Tower — all pure mathematics: regular languages, pushdown automata, Turing machines. Every floor was abstract symbols and state diagrams, not a single line of running code.

When the door to the fifth floor opens, the atmosphere is completely different. You see rows of compiler tool chains displayed like a weapons rack. The smallest tool is at the front; it doesn't need complex input — its job is very simple: convert the source code character stream into meaningful tokens.

The Librarian stands before the tool rack, picking up the smallest blade on the left. "The first four floors of the Spell Tower taught you the theory of spells — from this floor, you start forging real spell tools. This blade is called a lexer. Every line of code you write must first pass through it before it can become spell symbols the compiler understands."


Your Progress

The previous four chapters gave you theoretical foundations: regular languages correspond to lexical analysis (this blade), CFG corresponds to parsing (the next blade), TM corresponds to code optimization and static analysis (the sledgehammer). This chapter, you put down paper and pick up a real tool — write a working lexer.


Your Task

From a string of characters, separate out tokens — keywords, identifiers, numbers, operators, delimiters. You'll start with a hand-written state machine (implementing a simple lexer in C), then understand auto-generating tools like Lex/flex, and finally understand the lexer's place in the compiler and how it collaborates with the parser.


Breakthrough · Origin Story

Before you is a line of C code — a small snippet from battlefield code you've used on expeditions:

c
int count = 42;

To you, this is clearly divided into: int (keyword), count (identifier), = (assignment operator), 42 (integer constant), ; (statement terminator). But the compiler only sees a string of ASCII characters: i, n, t, space, c, o, u, n, t... It doesn't understand concepts like "keyword" or "identifier." You need to transform the character stream into a token stream.

Building a Lexer by Hand

The most straightforward approach: maintain a state machine where each character triggers a state transition. Here's a very simplified lexer implemented in C:

c
#include <stdio.h>
#include <ctype.h>
#include <string.h>

typedef enum {
    TOKEN_EOF, TOKEN_NUM, TOKEN_ID, TOKEN_IF,
    TOKEN_ASSIGN, TOKEN_SEMI, TOKEN_UNKNOWN
} TokenType;

typedef struct {
    TokenType type;
    char lexeme[256];
} Token;

Token next_token(FILE *fp) {
    Token tok;
    int c = fgetc(fp);

    // skip whitespace
    while (c == ' ' || c == '\t' || c == '\n')
        c = fgetc(fp);

    if (c == EOF) {
        tok.type = TOKEN_EOF;
        return tok;
    }

    // number
    if (isdigit(c)) {
        int i = 0;
        do {
            tok.lexeme[i++] = c;
            c = fgetc(fp);
        } while (isdigit(c));
        tok.lexeme[i] = '\0';
        tok.type = TOKEN_NUM;
        ungetc(c, fp);  // put back the extra character
        return tok;
    }

    // identifier or keyword
    if (isalpha(c) || c == '_') {
        int i = 0;
        do {
            tok.lexeme[i++] = c;
            c = fgetc(fp);
        } while (isalnum(c) || c == '_');
        tok.lexeme[i] = '\0';
        ungetc(c, fp);
        if (strcmp(tok.lexeme, "if") == 0)
            tok.type = TOKEN_IF;
        else
            tok.type = TOKEN_ID;
        return tok;
    }

    // operator
    if (c == '=') {
        tok.lexeme[0] = '=';
        tok.lexeme[1] = '\0';
        tok.type = TOKEN_ASSIGN;
        return tok;
    }

    // semicolon
    if (c == ';') {
        tok.lexeme[0] = ';';
        tok.lexeme[1] = '\0';
        tok.type = TOKEN_SEMI;
        return tok;
    }

    tok.lexeme[0] = c;
    tok.lexeme[1] = '\0';
    tok.type = TOKEN_UNKNOWN;
    return tok;
}

This function is a simplified state machine. Notice the pattern: read characters one by one, and with each character decide where to go next. When a pattern ends (e.g., when a non-digit character is read, ending the number), you need to put the extra character back (ungetc) — this is an extremely common "longest match" engineering strategy in lexers.

Longest Match and Backtracking

As mentioned before: the lexer always reads as many characters as possible until it can't proceed. For instance, input ifx — if the lexer stops after reading if, it would treat it as the keyword if instead of the identifier ifx. The correct strategy is to continue reading — after reading ifx, finding it can't continue (next character is a space), check accepting states: if ifx itself isn't a complete token but if is — backtrack to the if position. In DFA implementation, you always maintain "the most recent accepting state position" as a backtrack point.

Lex/flex Lexer Generator

Hand-writing state machines is tedious to maintain. Lex and its GNU version flex are tools that automatically generate lexers. You only need to write a set of rule templates; flex generates C code.

A flex example:

%{
/* C headers and helper functions can go here */
#include "tokens.h"
%}

%%
"if"          { return IF; }
[a-zA-Z][a-zA-Z0-9]*  { return IDENTIFIER; }
[0-9]+        { return NUMBER; }
"="           { return ASSIGN; }
";"           { return SEMI; }
[ \t\n]       { /* skip whitespace */ }
.             { return UNKNOWN; }
%%

Each line consists of three parts: a regex + action code in curly braces. What flex does internally is exactly the process described in Chapter 2 — convert regexes to NFAs, merge all NFAs, convert to a DFA, and generate a jump table. This DFA, at flex runtime, is a large switch statement or jump table.

Lexer-Parser Collaboration

You don't usually run the lexer alone — it's "pulled" by the parser: whenever the parser needs the next token, it calls next_token(). This collaboration is called "single-pass scanning": you don't need to read all tokens into memory before parsing; instead, you tokenize and parse simultaneously.

Parser ←calls→ Lexer ←reads← Source file
        (tokens)       (characters)

Layer One: Good Enough — Write a Working Simple Lexer

The C code above is already functional — you can combine it with a main function to read code from standard input and output a token stream. Give it a try:

c
int main(int argc, char *argv[]) {
    Token t;
    do {
        t = next_token(stdin);
        printf("Token: type=%d, lexeme=%s\n", t.type, t.lexeme);
    } while (t.type != TOKEN_EOF);
    return 0;
}

Compile and run:

$ echo "int x = 42;" | ./lexer
Token: type=4, lexeme=int
Token: type=3, lexeme=x
Token: type=5, lexeme==
Token: type=2, lexeme=42
Token: type=6, lexeme=;
Token: type=1, lexeme=

Layer Two: Deep Dive — Why Lex/flex Is Faster Than Hand-Written

Flex generates a DFA that does at most one table lookup + jump per character. Hand-written lexers with if-else chains or switch statements may execute multiple condition checks per character. Flex's DFA table is pre-compiled, making the decision time per character O(1). On modern hardware, flex-generated lexers can process several MB of source code per second — one or two orders of magnitude faster than equivalent if-else chains.


Common Pitfalls

Pitfall 1: Forgetting to skip comments. In classic C lexing, /* ... */ comments are multi-line structures — the lexer needs to enter a "comment mode" until it encounters */ before exiting. This state needs separate handling in the lexer, after which it continues outputting a token stream without comments.

Pitfall 2: Escape sequences in string literals. When handling "hello\nworld", \n is one character, not a backslash plus letter n. The lexer needs to recognize escape sequences while in string mode. This is typically handled by adding an "inside string literal" state.

Pitfall 3: Preprocessor directives. C's #include, #define, etc., are handled by the preprocessor before lexical analysis and are not part of the token stream. If you're writing a lexer for a complete compiler, be careful not to treat #include as an identifier.


Challenge Questions

  • Warm-up (5 min): Add support for == and != operators to the lexer above. Note: when encountering =, you also need to check whether the next character is = — this requires "lookahead."

  • Challenge (30 min): Write a complete lexer using flex that supports a subset of C: keywords (int, if, else, while, return), numbers, identifiers, operators (+ - * / = == != < > <= >=), delimiters (; { } ( )). Verify that the input int gcd(int a, int b) { if (b == 0) return a; } produces correct output.

  • Observation: Download the source code of an open-source compiler (like TinyCC or 8cc), find its lexer (usually called lexer.c or a next_token function), and see how many states it has. Guess how much of the "longest match + backtracking" strategy it uses.


Traveler's Notes

The lexer converts a character stream into a token stream; its core weapon is regular languages (DFA); flex automates the entire "regex -> NFA -> DFA" pipeline, letting you write only rules, not state machines.


Next Stop Preview

The lexer turns code into a token sequence, but what structure does this token sequence have? Next chapter, you take these tokens and rebuild the code's tree structure — the Abstract Syntax Tree.

Built with VitePress | Software Systems Atlas