Software Systems Atlas

Metadata Card

Prerequisites: ch04 ML Foundations, Vol 10 Linear Algebra & Calculus
Estimated time: 50 minutes
Core difficulty: Advanced
Reading mode: High focus
Completion: Able to implement linear regression and logistic regression with gradient descent, understand the effects of regularization

Your Progress

The data workbench in the Model Workshop is covered with data: house area vs. price, advertising spend vs. click-through rate, study time vs. exam scores.

Each dataset has a trend line hidden within—a line connecting data to predictions. You pick up the tools, starting from the simplest point: finding a formula that maps input to output.

Your Task

Linear models assume the output is a linear combination of inputs. This assumption works for both regression and classification: linear regression predicts continuous values, logistic regression outputs probabilities. You start with the closed-form solution of least squares, transition to gradient descent, and introduce regularization for high-dimensional data.

Chapter Layers
Required: Linear regression (closed-form + gradient descent), logistic regression, L1/L2 regularization
Optional: Softmax regression, generalized linear models
Advanced: Convergence of linear models from a convex optimization perspective

Breaking Through · Tracing the Origin

Your friend recorded some data: daily study time and exam scores. Your intuition says they're related—more study time, higher scores. But what's the exact relationship between "more" and "higher"? You need a line to fit these points.

Linear regression does this: find a straight line y = wx + b that minimizes the sum of vertical distances from all points to the line.

Linear Regression

Least squares intuitively: minimize mean squared error.

The first modeling tool in the Model Workshop is finding a line—connecting scattered data points. sklearn calls LAPACK's closed-form solution behind the scenes, completing it in a single line of code.

python

import numpy as np
from sklearn.linear_model import LinearRegression

# Simulated data: y = 3x + 2 + noise
np.random.seed(42)
X = np.random.randn(100, 1)
y = 3 * X.ravel() + 2 + np.random.randn(100) * 0.5

model = LinearRegression()
model.fit(X, y)
print(f"w = {model.coef_[0]:.3f}, b = {model.intercept_:.3f}")

sklearn's solution calls LAPACK's closed-form: w = (X^T X)^{-1} X^T y. But this is an O(n^3) operation, infeasible when feature dimension d is large.

Gradient Descent

Replace the closed-form solution with iteration:

When feature dimensions grow large, the O(n^3) matrix inversion of the closed-form solution becomes infeasible. Gradient descent uses iteration instead—computing only the gradient at each step, O(n) complexity, scalable to millions of samples. This is the prototype of deep learning training engines.

python

class LinearRegressionGD:
    def __init__(self, lr=0.01, epochs=1000):
        self.lr = lr
        self.epochs = epochs

    def fit(self, X, y):
        X = np.c_[np.ones(X.shape[0]), X]  # add bias column
        self.weights = np.random.randn(X.shape[1]) * 0.01
        self.loss_history = []

        for _ in range(self.epochs):
            y_pred = X @ self.weights
            error = y_pred - y
            # MSE gradient: (2/n) * X^T * (y_pred - y)
            grad = (2 / len(y)) * X.T @ error
            self.weights -= self.lr * grad
            self.loss_history.append(np.mean(error ** 2))

    def predict(self, X):
        X = np.c_[np.ones(X.shape[0]), X]
        return X @ self.weights

After 1000 epochs, your gradient descent approaches the same w and b as the closed-form solution.

Because gradient descent only depends on the gradient, it scales to arbitrarily large data—deep learning training engines are variants of gradient descent.

Logistic Regression

Linear regression predicts continuous values. But what if you want "pass/fail" (0 or 1)? Logistic regression wraps a sigmoid function around the linear combination, outputting a [0, 1] probability:

P(y=1|x) = 1 / (1 + exp(-(wx + b)))

The loss function changes from MSE to binary cross-entropy (also called logistic loss):

When you need to answer "yes or no" instead of "how much," swap linear regression for logistic regression. The sigmoid function squeezes the linear output into the 0~1 range; cross-entropy loss makes the model care about the confidence of errors.

python

class LogisticRegressionGD:
    def __init__(self, lr=0.01, epochs=1000):
        self.lr = lr
        self.epochs = epochs

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -250, 250)))

    def fit(self, X, y):
        X = np.c_[np.ones(X.shape[0]), X]
        self.weights = np.random.randn(X.shape[1]) * 0.01

        for _ in range(self.epochs):
            z = X @ self.weights
            y_pred = self.sigmoid(z)
            # Cross-entropy gradient: (1/n) * X^T * (y_pred - y)
            grad = (1 / len(y)) * X.T @ (y_pred - y)
            self.weights -= self.lr * grad

    def predict_proba(self, X):
        X = np.c_[np.ones(X.shape[0]), X]
        return self.sigmoid(X @ self.weights)

    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)

Logistic regression's decision boundary is still linear—it just maps the linear output into probability space. But "linear boundary" means the two classes must be separable by a straight line (linearly separable). If the data isn't linearly separable, you need feature transformations or more complex models.

Regularization

When feature dimensions explode (e.g., every word is a feature in text classification), linear models easily overfit. Regularization adds a penalty on coefficients to the loss function:

L2 (Ridge): adds w^2 penalty, encourages uniform weight distribution
L1 (Lasso): adds |w| penalty, encourages sparsity (many become 0)
Elastic Net: adds L1 + L2

python

from sklearn.linear_model import Ridge, Lasso, ElasticNet

# L2 regularization
ridge = Ridge(alpha=1.0)
ridge.fit(X, y)

# L1 regularization (automatic feature selection)
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)

print(f"Number of non-zero coefficients: {np.sum(lasso.coef_ != 0)}")

In text classification, L1 regularization automatically selects the most critical words—other words' weights become 0, significantly speeding up inference.

Common Pitfalls

Learning rate in gradient descent: too large doesn't converge, too small is too slow. Try a few log-scale values (0.1, 0.01, 0.001) and observe the loss curve.
Feature scale greatly affects gradient descent. When two features differ by 1000× in range, the gradient vector is dominated by the large feature. Standardization is essential preprocessing.
Class imbalance in logistic regression: too many negative examples bias the model toward predicting negative. Use class_weight='balanced' or oversampling/undersampling.
Linear regression assumes errors are independent, identically distributed with constant variance (homoscedasticity). If violated (e.g., larger predicted values have larger errors), standard error estimates become inaccurate.
Collinearity: two highly correlated features cause coefficient estimates to fluctuate wildly, and signs may be counterintuitive.

Pass Challenges

Warm-up (10 min): Use sklearn.datasets.load_breast_cancer, train a logistic regression, compare convergence speed and final accuracy with and without standardization.
Challenge (30 min): What exactly happens near the optimal solution of linear regression? Fix x values, manually compute the Hessian matrix of the loss function, and verify it's positive definite (convex function condition).
Observation: In Lasso, vary alpha from 0.0001 to 10 and observe how the number of non-zero coefficients changes with regularization strength.

Traveler's Notes

Linear models are the "Hello, World" of machine learning. They assume simplicity, have strong interpretability, and generalize well—but are limited by linear decision boundaries. When data is large, noise is low, and features are important, linear models remain the best starting point.

-> Next Chapter Preview

Linear models are limited by their linear decision boundaries. Next chapter, you'll remove this limitation—using trees and ensemble methods to fit arbitrarily complex boundaries.