Metadata Card
- Prerequisites: Chapter 10 — Linear Algebra
- Estimated time: 60 minutes
- Core difficulty: Intermediate
- Completion marker: Understand the meaning of derivatives and integrals, grasp the idea of gradient descent
Your Progress
The walls of the second floor of the Applied Tower flow — not static symbols, but curves and tangents. "Change isn't a series of jumps. Objects in freefall change velocity continuously — calculus is the tool for analyzing this kind of continuous change," says the Librarian.
Your Task
Optimizing loss functions in machine learning, simulating motion in physics engines, controlling error in numerical analysis — in these scenarios, things don't change discretely but flow continuously. Calculus tells you: at a given instant, how fast is it changing (derivative), and the total amount of change over an interval (integral). More critically, it gives you the method to "find the optimum" (gradient descent).
Chapter Layers
- Required reading: Derivatives and differentiation, integral concepts, gradients and optimization
- Optional reading: Multivariable calculus, Lagrange multipliers
Breakthrough · Origin Story
The Librarian's words strike a chord. You look down at the flowing floor beneath you — curves shifting before your eyes — and ask yourself: 'If I were building a catapult to maximize range, how does the range change for each degree of elevation?'
This is the core problem of calculus: when one thing changes, how does another follow? You don't need to understand neural networks first to grasp this — just walking uphill is enough.
You're standing on a mountain and your goal is the lowest point in the valley. You don't know where the bottom is, but with each step you can feel the slope beneath your feet. Steep slope, take a big step; gentle slope, take a small step — this is the core intuition behind gradient descent.
This process will appear repeatedly in Vol 13 on neural networks. But today, you're looking at the mathematics alone: how do you describe the 'slope' of a function? How do you find the bottom of the slope?
Start with the simplest case.
Derivative f'(x) is the instantaneous rate of change of function f at point x. Geometrically, it's the slope of the tangent line to the curve at x.
f'(x) = lim_{h→0} (f(x+h) - f(x)) / h
A few basic derivative rules:
- f(x) = xⁿ → f'(x) = nxⁿ⁻¹
- f(x) = sin(x) → f'(x) = cos(x)
- f(x) = eˣ → f'(x) = eˣ (the only function that is its own derivative)
- Chain rule: (f(g(x)))' = f'(g(x)) × g'(x) — this is the principle behind neural network backpropagation
The significance of the chain rule for all of deep learning: each layer of a neural network is a function; forward propagation is function composition; backpropagation uses the chain rule to compute gradients layer by layer from output back to input.
Integration is the inverse operation of differentiation — it finds the "accumulated quantity." The definite integral of f(x) over the interval [a, b] equals the area under the curve.
∫ₐᵇ f(x) dx = F(b) - F(a), where F' = f (the Newton-Leibniz formula)
In computer science, integration appears in probability theory for calculating the cumulative area under a probability density function, and in traffic monitoring for calculating "accumulated traffic" over time.
Gradient is the derivative of a multivariable function — it's a vector whose components are the partial derivatives of the function in each direction.
∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]ᵀ
The gradient points in the direction of the steepest increase of the function.
Gradient descent: The foundational optimization algorithm in machine learning.
This single equation is the core of the entire deep learning training engine. Whether the optimizer you use is called Adam, SGD, or RMSprop, they all do the same thing: take a step in the opposite direction of the gradient. This is where calculus makes its most direct landing in programming.
w_{t+1} = w_t - η × ∇f(w_t)where η is the learning rate — the size of each step. Too large η, and you overshoot the minimum; too small η, and convergence is too slow.
This simple formula drives modern ML. Every optimizer (Adam, SGD, RMSprop) builds on this foundation by adding adaptive step sizes or momentum.
Optimization theoretical framework: Convex functions guarantee convergence to the global optimum with gradient descent. Non-convex functions (like the loss surface of neural networks) converge to local optima, but in practice a "local optimum" is often good enough.
Common Pitfalls
- Neglecting the chain rule. Backpropagation is repeated application of the chain rule. Without understanding the chain rule, you can't understand why backprop "efficiently computes gradients" (rather than re-deriving from scratch every time).
- Confusing differentiability and continuity. Differentiability implies continuity, but continuity does not imply differentiability (e.g., f(x) = |x| is not differentiable at x=0). This has practical implications for the ReLU activation function: ReLU is not differentiable at x=0, so in practice its derivative is defined as 0 or 1.
- Thinking calculus is "just pure math." Gradient descent is the accelerator of the deep learning engine; integration is the basis of probability models making predictions. Calculus is everywhere in computer science.
Challenge Questions
- Compute the derivative of f(x) = 3x² + 2x + 1.
- Use numerical methods in Python to verify convergence of the derivative formula f'(x) ≈ (f(x+h) - f(x-h)) / (2h) as h → 0.
- Implement univariate gradient descent to find the minimum of f(x) = x² + 3x + 2. Verify on paper that the minimum occurs at x = -1.5.
Traveler's Notes
Derivatives tell you "how fast"; integrals tell you "how much." The gradient lets you find the direction of descent across hundreds of dimensions. These are the mathematical core of optimization — finding "the best one."
→ Next stop: Next, we move toward two areas that are both more abstract and more practical: information theory explains how you "quantify" information, and numerical computation explains how computers survive amidst error.