Metadata Card
- Prerequisites: ch04 ML Foundations, Vol 10 Linear Algebra & Calculus
- Estimated time: 50 minutes
- Core difficulty: Advanced
- Reading mode: High focus
- Completion: Able to implement linear regression and logistic regression with gradient descent, understand the effects of regularization
Your Progress
The data workbench in the Model Workshop is covered with data: house area vs. price, advertising spend vs. click-through rate, study time vs. exam scores.
Each dataset has a trend line hidden within—a line connecting data to predictions. You pick up the tools, starting from the simplest point: finding a formula that maps input to output.
Your Task
Linear models assume the output is a linear combination of inputs. This assumption works for both regression and classification: linear regression predicts continuous values, logistic regression outputs probabilities. You start with the closed-form solution of least squares, transition to gradient descent, and introduce regularization for high-dimensional data.
Chapter Layers
- Required: Linear regression (closed-form + gradient descent), logistic regression, L1/L2 regularization
- Optional: Softmax regression, generalized linear models
- Advanced: Convergence of linear models from a convex optimization perspective
Breaking Through · Tracing the Origin
Your friend recorded some data: daily study time and exam scores. Your intuition says they're related—more study time, higher scores. But what's the exact relationship between "more" and "higher"? You need a line to fit these points.
Linear regression does this: find a straight line y = wx + b that minimizes the sum of vertical distances from all points to the line.
Linear Regression
Least squares intuitively: minimize mean squared error.
The first modeling tool in the Model Workshop is finding a line—connecting scattered data points. sklearn calls LAPACK's closed-form solution behind the scenes, completing it in a single line of code.
import numpy as np
from sklearn.linear_model import LinearRegression
# Simulated data: y = 3x + 2 + noise
np.random.seed(42)
X = np.random.randn(100, 1)
y = 3 * X.ravel() + 2 + np.random.randn(100) * 0.5
model = LinearRegression()
model.fit(X, y)
print(f"w = {model.coef_[0]:.3f}, b = {model.intercept_:.3f}")sklearn's solution calls LAPACK's closed-form: w = (X^T X)^{-1} X^T y. But this is an O(n^3) operation, infeasible when feature dimension d is large.
Gradient Descent
Replace the closed-form solution with iteration:
When feature dimensions grow large, the O(n^3) matrix inversion of the closed-form solution becomes infeasible. Gradient descent uses iteration instead—computing only the gradient at each step, O(n) complexity, scalable to millions of samples. This is the prototype of deep learning training engines.
class LinearRegressionGD:
def __init__(self, lr=0.01, epochs=1000):
self.lr = lr
self.epochs = epochs
def fit(self, X, y):
X = np.c_[np.ones(X.shape[0]), X] # add bias column
self.weights = np.random.randn(X.shape[1]) * 0.01
self.loss_history = []
for _ in range(self.epochs):
y_pred = X @ self.weights
error = y_pred - y
# MSE gradient: (2/n) * X^T * (y_pred - y)
grad = (2 / len(y)) * X.T @ error
self.weights -= self.lr * grad
self.loss_history.append(np.mean(error ** 2))
def predict(self, X):
X = np.c_[np.ones(X.shape[0]), X]
return X @ self.weightsAfter 1000 epochs, your gradient descent approaches the same w and b as the closed-form solution.
Because gradient descent only depends on the gradient, it scales to arbitrarily large data—deep learning training engines are variants of gradient descent.
Logistic Regression
Linear regression predicts continuous values. But what if you want "pass/fail" (0 or 1)? Logistic regression wraps a sigmoid function around the linear combination, outputting a [0, 1] probability:
P(y=1|x) = 1 / (1 + exp(-(wx + b)))The loss function changes from MSE to binary cross-entropy (also called logistic loss):
When you need to answer "yes or no" instead of "how much," swap linear regression for logistic regression. The sigmoid function squeezes the linear output into the 0~1 range; cross-entropy loss makes the model care about the confidence of errors.
class LogisticRegressionGD:
def __init__(self, lr=0.01, epochs=1000):
self.lr = lr
self.epochs = epochs
def sigmoid(self, z):
return 1 / (1 + np.exp(-np.clip(z, -250, 250)))
def fit(self, X, y):
X = np.c_[np.ones(X.shape[0]), X]
self.weights = np.random.randn(X.shape[1]) * 0.01
for _ in range(self.epochs):
z = X @ self.weights
y_pred = self.sigmoid(z)
# Cross-entropy gradient: (1/n) * X^T * (y_pred - y)
grad = (1 / len(y)) * X.T @ (y_pred - y)
self.weights -= self.lr * grad
def predict_proba(self, X):
X = np.c_[np.ones(X.shape[0]), X]
return self.sigmoid(X @ self.weights)
def predict(self, X, threshold=0.5):
return (self.predict_proba(X) >= threshold).astype(int)Logistic regression's decision boundary is still linear—it just maps the linear output into probability space. But "linear boundary" means the two classes must be separable by a straight line (linearly separable). If the data isn't linearly separable, you need feature transformations or more complex models.
Regularization
When feature dimensions explode (e.g., every word is a feature in text classification), linear models easily overfit. Regularization adds a penalty on coefficients to the loss function:
- L2 (Ridge): adds w^2 penalty, encourages uniform weight distribution
- L1 (Lasso): adds |w| penalty, encourages sparsity (many become 0)
- Elastic Net: adds L1 + L2
from sklearn.linear_model import Ridge, Lasso, ElasticNet
# L2 regularization
ridge = Ridge(alpha=1.0)
ridge.fit(X, y)
# L1 regularization (automatic feature selection)
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
print(f"Number of non-zero coefficients: {np.sum(lasso.coef_ != 0)}")In text classification, L1 regularization automatically selects the most critical words—other words' weights become 0, significantly speeding up inference.
Common Pitfalls
- Learning rate in gradient descent: too large doesn't converge, too small is too slow. Try a few log-scale values (0.1, 0.01, 0.001) and observe the loss curve.
- Feature scale greatly affects gradient descent. When two features differ by 1000× in range, the gradient vector is dominated by the large feature. Standardization is essential preprocessing.
- Class imbalance in logistic regression: too many negative examples bias the model toward predicting negative. Use class_weight='balanced' or oversampling/undersampling.
- Linear regression assumes errors are independent, identically distributed with constant variance (homoscedasticity). If violated (e.g., larger predicted values have larger errors), standard error estimates become inaccurate.
- Collinearity: two highly correlated features cause coefficient estimates to fluctuate wildly, and signs may be counterintuitive.
Pass Challenges
- Warm-up (10 min): Use sklearn.datasets.load_breast_cancer, train a logistic regression, compare convergence speed and final accuracy with and without standardization.
- Challenge (30 min): What exactly happens near the optimal solution of linear regression? Fix x values, manually compute the Hessian matrix of the loss function, and verify it's positive definite (convex function condition).
- Observation: In Lasso, vary alpha from 0.0001 to 10 and observe how the number of non-zero coefficients changes with regularization strength.
Traveler's Notes
Linear models are the "Hello, World" of machine learning. They assume simplicity, have strong interpretability, and generalize well—but are limited by linear decision boundaries. When data is large, noise is low, and features are important, linear models remain the best starting point.
-> Next Chapter Preview
Linear models are limited by their linear decision boundaries. Next chapter, you'll remove this limitation—using trees and ensemble methods to fit arbitrarily complex boundaries.