Metadata Card
- Prerequisites: ch05 Linear Models, Vol 10 Calculus (Chain Rule)
- Estimated time: 60 minutes
- Core difficulty: Advanced/In-depth
- Reading mode: High focus
- Completion: Able to manually derive backpropagation gradient flow, understand differences between activation functions, train a two-layer network with PyTorch
Your Progress
You've tried tree models and linear models in the Model Workshop—they perform well on tabular data. You've handled user features, sensor data, price predictions... everything was going fine.
But today you pulled a picture covered in numbers from your pocket—a handwritten "7"—and threw it onto the workshop's test bench. The tree model took one look and started babbling nonsense; the linear model could handle it but needed 784 inputs (28×28 pixels), the parameter count exploded, and it didn't even know the concept that "neighboring pixels are related."
You stood frozen in the workshop, staring at that crooked "7." You can't write rules for it ("a horizontal line in the middle and a vertical line on top?"—every "7" looks different), and linear models can't handle it (pixels aren't independent). You need a completely new approach.
You remembered the chain rule from the Math Tower, how to differentiate layer by layer, how computation graphs break complex operations into basic steps. The curator said back then: "You'll use this later."—Now it finds its home here.
The Model Workshop needs a new production line. This production line is called Neural Networks.
Your Task
Neural networks stack linear layers, add nonlinear activation functions after each layer, and train end-to-end with backpropagation. This chapter starts from the perceptron to multi-layer networks, manually computes backpropagation, and understands the evolution of optimizers.
Chapter Layers
- Required: Perceptron → MLP, manual backpropagation derivation, common activation functions
- Optional: Intuitive differences between SGD vs Adam vs RMSProp
- Advanced: Theoretical analysis of vanishing and exploding gradients
Breaking Through · Tracing the Origin
Recall logistic regression from ch05. It's a simple pipeline: input → weighted sum → sigmoid output. It performs well on binary classification—as long as the data is linearly separable.
But there's a classic test that stumps it: the XOR problem.
Input: (0,0) → 0
(0,1) → 1
(1,0) → 1
(1,1) → 0If you plot these four points in 2D space—(0,0) is class 0, (1,1) is also class 0, but (0,1) and (1,0) are class 1. You can draw infinite straight lines but never separate 0 and 1 with a single line. Logistic regression collapses on XOR.
The solution is surprisingly simple: add one more layer.
The first layer doesn't directly classify—it learns some intermediate features. For example, one neuron learns to detect "both inputs are equal," another learns "inputs are not equal." The second layer then classifies based on these intermediate features.
"The two inputs are equal" feature → linearly separable! This is the essence of neural networks: a composition of multiple nonlinear transformations, each layer transforming the previous layer's representation into a more abstract form.
From Perceptron to Multi-Layer Network
A single perceptron is a linear classifier:
y = sigmoid(w1*x1 + w2*x2 + ... + b)Stack them up: the first layer learns low-level features, the second layer learns feature combinations, more layers learn more abstract patterns.
import torch
import torch.nn as nn
import torch.optim as optim
class TwoLayerNet(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.activation = nn.ReLU()
self.fc2 = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
h = self.activation(self.fc1(x))
return self.fc2(h) # Classification tasks typically use CrossEntropyLoss (includes softmax)
model = TwoLayerNet(784, 128, 10)
print(model)All learnable parameters are in the Linear layers: each Linear layer has two parameter groups—weights W (shape: [in_features, out_features]) and bias b (shape: [out_features]). Total parameters = in_features * out_features + out_features.
Backpropagation—"Learning from mistakes, distributing blame backward"
Imagine training a neural network in the Model Workshop. You showed it a picture of "7" and it guessed "1." Wrong.
Now the question is: whose fault is it? Is the first hidden layer's weight off? Or is the output layer's bias wrong? Or is the activation function choice wrong?—You need to trace the "misclassification blame" backward, distributing it to every parameter.
That's what backpropagation does. It's the core mechanism of training, using the chain rule you learned in the Math Tower to compute the gradient of the loss with respect to every parameter—starting from the output layer and pushing backward layer by layer.
Manually tracing forward and backward through a simple network:
# Tiny network: one hidden neuron + one output neuron
# Forward:
# z1 = w1*x + b1
# a1 = relu(z1)
# z2 = w2*a1 + b2
# loss = 0.5 * (z2 - y)^2
# Backward: push backward from the loss
# d_loss/d_z2 = z2 - y
# d_loss/d_w2 = d_loss/d_z2 * a1
# d_loss/d_b2 = d_loss/d_z2
# d_loss/d_a1 = d_loss/d_z2 * w2
# d_loss/d_z1 = d_loss/d_a1 * (1 if z1 > 0 else 0) [relu derivative]
# d_loss/d_w1 = d_loss/d_z1 * x
# d_loss/d_b1 = d_loss/d_z1
# PyTorch does all this automatically for you
x = torch.randn(4, 784)
y = torch.randint(0, 10, (4,))
output = model(x)
loss = nn.CrossEntropyLoss()(output, y)
loss.backward() # automatically computes all parameter gradientsThe core insight of backpropagation is "gradient sharing"—the intermediate variable a1 appears in multiple downstream nodes along the forward path. By traversing the computation graph in topological order, backpropagation ensures each intermediate variable's gradient is computed only once.
Activation Functions vs Linear Layers: Why can't we use only linear layers?
Stacking two linear layers is equivalent to a single linear layer (W2*(W1x) = (W2W1)*x). Nonlinear activation functions break this degeneracy, allowing the network to represent arbitrarily complex functions.
# Common activation functions
activation_fns = {
'sigmoid': nn.Sigmoid(), # [0,1], but gradients near zero at both ends (vanishing gradients)
'tanh': nn.Tanh(), # [-1,1], slightly better vanishing gradient
'ReLU': nn.ReLU(), # [0,∞), simple and fast, but negative side dead zone
'LeakyReLU': nn.LeakyReLU(0.01), # small gradient on negative side, solves "dead neurons"
'GELU': nn.GELU(), # smooth approximation of ReLU, Transformer standard
}ReLU is the most widely used activation function in practice. Its gradient is constant at 1 on the positive side—doesn't increase or decrease—avoiding exponential gradient decay in deep networks.
Optimizers
Gradient descent tells you "which direction to go"; the optimizer determines "how big a step to take" (learning rate) and "how to take smarter steps."
# SGD: the most primitive gradient descent
optimizer = optim.SGD(model.parameters(), lr=0.01)
# Adam: adaptive learning rate + momentum
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop
for epoch in range(100):
optimizer.zero_grad()
output = model(x)
loss = nn.CrossEntropyLoss()(output, y)
loss.backward()
optimizer.step()Key optimizer comparison:
| Optimizer | Core Mechanism | Use Case |
|---|---|---|
| SGD | Raw gradient | Simple tasks, need fine tuning |
| SGD+Momentum | Accumulates historical gradient direction | Avoid saddle point stagnation |
| RMSProp | Per-parameter adaptive learning rate | RNNs, non-stationary objectives |
| Adam | Momentum + RMSProp | General purpose, default recommendation |
Adam adjusts each parameter's learning rate using two statistics: first moment estimate of the gradient (momentum) and second moment estimate (adaptive scaling). For most deep learning tasks, Adam(lr=0.001) is a usable starting point.
Common Pitfalls
- Weight initialization is important: zero initialization causes all neurons to learn the same features (symmetry). Recommended: Xavier/Glorot or He initialization.
- Loss function selection: regression → MSELoss, binary classification → BCEWithLogitsLoss, multi-class → CrossEntropyLoss (includes softmax, no need to add separately).
- Learning rate too high causes loss to diverge (NaN), too low barely converges. Observe the loss curve—if oscillating, reduce learning rate.
- BatchNorm: before or after the activation function? There's debate in practice—recommended before activation, after the fully connected layer.
- Batch size impact: too large converges fast but generalizes poorly, too small causes gradient fluctuation. 32~256 are common choices.
Pass Challenges
- Warm-up (15 min): On paper, manually compute backpropagation for a 2-2-1 network (2 inputs → hidden layer 2 ReLU → 1 output). Given input (x1,x2)=(1,0), target y=1, compute w11's gradient.
- Challenge (40 min): Use PyTorch to train a three-layer network (784→512→256→10) on MNIST, using ReLU + Adam. Reach 97%+ test accuracy.
- Observation: During training, keep the learning rate constant but try 0.1, 0.01, 0.001, 0.0001. Plot loss curves—observe which learning rate causes oscillation, which is too slow, which is best.
Traveler's Notes
A neural network equals linear transformations nested with nonlinear activations. Backpropagation uses the chain rule to enable end-to-end learning across the entire network. The optimizer choice determines the speed and quality of reaching a "good solution." All three evolve continuously across deep learning iterations, but the core framework remains unchanged.
-> Next Chapter Preview
Fully connected networks have limited performance on images and sequences. Next chapter introduces two major breakthroughs in deep learning: CNN and Transformer.