Skip to content

Metadata Card

  • Prerequisites: ch09 Neural Networks Foundations
  • Estimated time: 60 minutes
  • Core difficulty: In-depth
  • Reading mode: High focus
  • Completion: Able to understand the core mechanisms of CNN, RNN, Attention, and Transformer

Your Progress

The fully connected network runs in the Model Workshop and can recognize handwritten digits.

But when you throw a 256×256 image at it, the parameter count jumps from tens of thousands to millions. Training for a day still hasn't converged.

You realize: fully connected networks don't exploit data structure—adjacent pixels should share parameters, and temporal information should be processed sequentially. Deep in the Model Workshop, two more advanced engines await: convolution and attention.

Your Task

The core breakthroughs in deep learning come from structured network design: Convolution (CNN) leverages spatial locality, Recurrent (RNN) leverages temporal dependencies, and Attention leverages global relevance. Transformer pushes attention to its extreme, becoming the cornerstone of modern generative AI.

Chapter Layers

  • Required: Convolution operation, self-attention mechanism, Transformer architecture
  • Optional: LSTM gating mechanism, residual connections
  • Advanced: Mathematical principles of positional encoding, multi-head attention computation

Breaking Through · Tracing the Origin

Every neuron in a fully connected network sees the "entire image"—a pixel in one corner is treated as equally important as a pixel in another corner. This is wasteful for images: a cat's ears and eyes often appear in adjacent regions, and your network should exploit this spatial structure.

CNN's solution: let each neuron only see a local window (receptive field), and different local windows share the same set of parameters (convolution kernel/filter). So a "vertical edge detector" uses the same weights everywhere in the image, drastically reducing parameters.

Convolutional Neural Networks

Convolution operation: a small kernel (e.g., 3×3) slides across the entire image, performing element-wise multiplication and summation at each position.

Fully connected networks treat each pixel as an independent feature, ignoring relationships between neighboring pixels. CNN solves this with local receptive fields and parameter sharing—a single 3×3 convolution kernel slides across the entire image, reducing parameters from hundreds of thousands to just hundreds. This is the core engine of the image workbench in the Model Workshop.

python
import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc = nn.Linear(64 * 7 * 7, 10)  # MNIST: 28x28 -> after pooling 7x7

    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))  # 28→14
        x = self.pool(torch.relu(self.conv2(x)))  # 14→7
        x = x.view(x.size(0), -1)
        return self.fc(x)

model = SimpleCNN()
print(f"Parameters: {sum(p.numel() for p in model.parameters())}")

Compare with a fully connected network: for the same 28×28 MNIST image, the first FC layer alone has 784128≈100K parameters, while CNN's first conv layer (conv1) has only 13233 + 32 ≈ 320 parameters.

Key evolution in CNN design: residual connections (ResNet) solve the degradation problem in deep networks—direct skip-connections allow gradients to flow to shallow layers without going through nonlinearities.

Recurrent Neural Networks

Sequential data (text, audio, time series) needs "memory"—the meaning of the current input depends on previous context. RNN receives an input and the previous time step's hidden state at each time step, outputting a new hidden state.

python
class SimpleRNN(nn.Module):
    def __init__(self, vocab_size, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, 128)
        self.rnn = nn.RNN(128, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, hidden=None):
        x = self.embedding(x)
        output, hidden = self.rnn(x, hidden)
        return self.fc(output), hidden

RNN problem: gradients must backpropagate through time (BPTT), and the chain rule causes gradients to vanish or explode exponentially. LSTM (Long Short-Term Memory) alleviates this with three gates (input gate, forget gate, output gate) and an independent memory cell.

Attention Mechanism

Attention lets the model selectively focus on different positions when processing sequences. RNN can only "read" sequences sequentially—attention allows the model to "look back" at any position when computing the current position's output.

python
# Scaled Dot-Product Attention
import torch.nn.functional as F

def attention(query, key, value, mask=None):
    """Attention computation: Q*K^T / sqrt(d_k) -> softmax -> *V"""
    d_k = query.size(-1)
    scores = query @ key.transpose(-2, -1) / (d_k ** 0.5)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    weights = F.softmax(scores, dim=-1)
    return weights @ value

Core formula: Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V

  • Q (Query): the query vector at the current position, asking "what should I focus on"
  • K (Key): the key vector at each position, telling Q "I have this information here"
  • V (Value): the actual information at each position

Dot-product attention measures the similarity between Q and K, then computes a weighted sum to get the final context vector.

Transformer

Transformer = remove RNN, use attention everywhere. The 2017 paper "Attention Is All You Need" caused the entire NLP field to rewrite its foundations.

Transformer's core innovation: replacing recurrence with self-attention—no longer processing tokens sequentially, but letting all tokens directly "see" each other. This brings two benefits—parallel training (hundreds of times faster) and long-range dependencies (no longer decaying with distance).

python
class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = nn.MultiheadAttention(d_model, n_heads, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )

    def forward(self, x, mask=None):
        # Multi-head self-attention + residual connection + layer normalization
        attn_out, _ = self.attention(x, x, x, attn_mask=mask)
        x = self.norm1(x + attn_out)
        # Feed-forward network + residual connection
        ffn_out = self.ffn(x)
        x = self.norm2(x + ffn_out)
        return x

What Transformer changes relative to RNN:

  • Parallel computation: RNN must process sequentially through time steps. Transformer processes the entire sequence in parallel, training hundreds of times faster.
  • Long-range dependencies: RNN's memory of long sequences decays with distance. Transformer's attention directly connects any two positions.
  • Multi-head: N heads compute N sets of attention in parallel, each head focusing on different types of relevance (syntax, semantics, anaphora, etc.).

Transformer cost: self-attention has O(n^2) time complexity, where n is the sequence length. Long texts (like entire books) require sparse attention or approximation methods.

Common Pitfalls

  • CNN's receptive field only grows through layer stacking—a single 3×3 conv has a 3×3 receptive field; two stacked layers are equivalent to 5×5; N stacked layers are equivalent to (2N+1)×(2N+1).
  • RNN gradient explosion: gradient clipping is the standard treatment. Gradient vanishing: switch to LSTM or Transformer.
  • Self-attention has no position information—the original Transformer uses sine-cosine positional encoding to inject position information; BERT uses learnable absolute position embeddings; GPT-2 uses learnable position embeddings.
  • LayerNorm vs BatchNorm: LayerNorm normalizes across feature dimensions for each sample, suitable for NLP (variable sequence lengths); BatchNorm normalizes across samples for each feature dimension, suitable for CV (stable batch sizes).
  • Dropout is only enabled during training, disabled during inference—PyTorch's nn.Dropout handles this automatically (model.train() vs model.eval()).

Pass Challenges

  • Warm-up (10 min): Train SimpleCNN on MNIST. Set kernel sizes to 1, 3, 5, 7 and observe changes in parameter count and accuracy.
  • Challenge (45 min): Implement a simplified Transformer (2-layer encoder, d_model=128, n_heads=4) with PyTorch and train it on IMDB sentiment classification. Compare convergence speed between RNN and Transformer.
  • Observation: Print the attention weight matrix for a sentence (8 words, 8×8 matrix), observing which words have high attention on which positions—"it" usually has high weight on "the cat."

Traveler's Notes

CNN encodes image structure (locality, translation invariance) into the network architecture. Transformer pushes attention to its extreme, removing the sequentiality assumption and making parallel training possible. On top of these, modern generative AI begins to flourish.

-> Next Chapter Preview

Transformer enables large-scale pre-training—from BERT to GPT, from Tokenization to Prompt Engineering. Next chapter enters the world of modern generative AI.

Built with VitePress | Software Systems Atlas