Metadata Card
- Prerequisites: ch01 Search, ch02 Probabilistic Reasoning, Vol 10 Dynamic Programming
- Estimated time: 50 minutes
- Core difficulty: Advanced
- Reading mode: High focus
- Completion: Able to model decision problems with MDP, implement Q-Learning and Policy Gradient
Your Progress
You set up two systems in the Model Workshop: one for search, one for reasoning. But another letter from Ahua arrived:
"Your systems only learn when someone tells them the answer. In reality, nobody's always there to give you the answer."
You look deeper into the workshop—there's a third workbench, inscribed with: "Only feedback, no answers."
Your Task
Reinforcement learning is a framework where agents learn decision policies through reward signals from the environment. You start with the mathematical modeling of Markov Decision Processes (MDPs), then implement two core methods: value-based Q-Learning and policy-based Policy Gradient.
Chapter Layers
- Required: MDP modeling, Q-Learning, epsilon-greedy exploration
- Optional: Policy Gradient and the REINFORCE algorithm
- Advanced: Actor-Critic methods, Deep Q-Networks
Breaking Through · Tracing the Origin
Imagine designing an automatic maze-navigating robot. You don't manually label each step as "correct" or "wrong"—you only tell it: "Get 100 points at the finish line, lose 1 point per step."
Now it has to find the optimal route through trial and error. Why is this harder than BFS? First, the environment may be stochastic (wind might blow the robot off course); second, it doesn't have a complete map (the environment is unknown).
The reinforcement learning framework is designed for this setting: the triple (state s, action a, reward r).
Markov Decision Processes
MDP is the mathematical foundation of reinforcement learning. It assumes the future depends only on the current state (Markov property), described by the following elements:
In the Model Workshop, you first abstract the agent's decision-making environment into four components—the rest is just filling in algorithmic details within this framework.
(S, A, P, R, gamma)
- S: Set of all possible states
- A: Available actions in each state
- P(s'|s,a): State transition probability
- R(s,a,s'): Immediate reward function
- gamma: Discount factor (0~1), controls emphasis on long-term rewardsThe goal is to find a policy π(a|s) that maximizes the expected cumulative discounted reward.
The value function V*(s) represents the optimal expected return in state s. The Bellman equation gives the recursive definition:
V*(s) = max_a sum_{s'} P(s'|s,a) [R(s,a,s') + gamma * V*(s')]Q-Learning
Q-Learning is a model-free algorithm—it doesn't explicitly learn P(s'|s,a), but directly learns the value Q(s,a) of each state-action pair.
The reinforcement learning workbench in the Model Workshop has no standard answers, only reward signals. Q-Learning uses a table storing "the value of each action in each state," continuously updating this table through trial and error. Epsilon-greedy is its exploration strategy—most of the time choose the known best, occasionally try new paths.
import numpy as np
class QLearning:
def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.9, epsilon=0.1):
self.Q = np.zeros((n_states, n_actions))
self.alpha = alpha # learning rate
self.gamma = gamma # discount factor
self.epsilon = epsilon # exploration rate
def act(self, state):
"""epsilon-greedy policy"""
if np.random.random() < self.epsilon:
return np.random.randint(self.Q.shape[1])
return np.argmax(self.Q[state])
def update(self, state, action, reward, next_state):
"""Q-value update: Q(s,a) += alpha * (r + gamma * max_a' Q(s',a') - Q(s,a))"""
best_next = np.max(self.Q[next_state])
td_target = reward + self.gamma * best_next
td_error = td_target - self.Q[state][action]
self.Q[state][action] += self.alpha * td_errorCore insight: Q-Learning is an off-policy algorithm—it uses the greedy policy (max_a' Q(s',a')) to evaluate the next step's value, but actually executes the epsilon-greedy policy. This balances exploration (trying new actions) and exploitation (executing the known best).
Test it on the classic "Frozen Lake" environment (4x4 grid, from S to G, avoiding H):
S . . .
. H . H
. . . H
H . . GAfter several thousand training episodes, the Q-table converges: the agent learns to slide to the goal while avoiding ice holes.
Policy Gradient
Q-Learning learns the state-action value function and derives the policy from it. Policy Gradient directly learns the policy parameters θ, maximizing expected returns through gradient ascent.
The REINFORCE algorithm is the simplest method:
Unlike Q-Learning, Policy Gradient doesn't learn a value table—it directly learns the policy itself, a mapping function from states to action probabilities. Q-Learning suits discrete action spaces; Policy Gradient extends naturally to continuous actions.
class REINFORCE:
def __init__(self, n_states, n_actions, lr=0.01):
# Simple linear policy: P(a|s) = softmax(s * W + b)
self.W = np.random.randn(n_states, n_actions) * 0.01
self.b = np.zeros(n_actions)
self.lr = lr
def policy(self, state):
logits = state @ self.W + self.b
exp_logits = np.exp(logits - np.max(logits))
return exp_logits / exp_logits.sum()
def act(self, state):
probs = self.policy(state)
return np.random.choice(len(probs), p=probs)
def update(self, trajectory):
"""Update policy parameters using the full trajectory's returns"""
states, actions, rewards = zip(*trajectory)
# Compute discounted returns G_t
G = np.zeros(len(rewards))
running = 0
for t in reversed(range(len(rewards))):
running = rewards[t] + self.gamma * running
G[t] = running
for t in range(len(states)):
probs = self.policy(states[t])
# Policy gradient approximation: grad = G_t * grad(log pi(a_t|s_t))
d_log = probs.copy()
d_log[actions[t]] -= 1 # gradient of cross-entropy
grad = np.outer(states[t], d_log) * G[t]
self.W += self.lr * gradPolicy Gradient's direct advantage: it can learn stochastic policies (useful in Rock-Paper-Scissors), naturally handles continuous action spaces, and naturally parameterizes with neural networks.
Common Pitfalls
- Q-Learning's Q-table fails immediately when the state space is continuous—need function approximation (Deep Q-Networks).
- Epsilon decaying too slowly leads to slow convergence; too fast leads to local optima. Recommended: decay exponentially from 1.0 to 0.01.
- REINFORCE has high variance: the same policy can produce wildly different returns across different trajectories. Solutions include baselines (subtracting average return) and Advantage functions.
- Reward design is the hidden engineering of reinforcement learning—sparse rewards (only giving points at the end) are nearly impossible to learn; reward shaping (giving small intermediate rewards) is needed.
- A discount factor gamma close to 1 considers long-term returns but converges slowly; close to 0 focuses only on the immediate step, suitable for "each step independent" scenarios.
Pass Challenges
- Warm-up (15 min): Manually derive the Q-Learning Bellman update on a 3x3 grid. Start (0,0), goal (2,2), reward +10 at goal, -1 per step.
- Challenge (40 min): Implement a Q-Learning solver for CartPole. Note: you need to discretize the continuous state space (discretize angle, position into buckets).
- Observation: During Q-Learning, print epsilon values and average per-episode returns, observing the transition from exploration to exploitation.
Traveler's Notes
Reinforcement learning takes the final step beyond "data-driven"—even the data itself is collected through interaction. MDP modeling, Q-value iteration, policy gradient—three progressive layers, from tables to function approximation, from discrete to continuous.
-> Next Chapter Preview
Search, reasoning, and reinforcement learning form the three pillars of "classical AI." But starting from the next chapter, the protagonist changes to "data"—Machine Learning takes the stage.