Metadata Card
- Prerequisites: Vol 12 Data Processing & Feature Engineering, Vol 10 Probability & Statistics
- Estimated time: 45 minutes
- Core difficulty: Beginner to Advanced
- Reading mode: High focus
- Completion: Able to split datasets, understand overfitting and underfitting, complete an end-to-end ML pipeline
Your Progress
You look up from the first three workbenches in the Model Workshop and notice they share something in common: every system requires you to manually write rules. The search cost function is written by you, the reasoning logic rules are written by you, and the reinforcement learning reward function is also written by you.
You pick up Ahua's letter and read it again: "I wish it could learn on its own."
You change direction—instead of giving rules, give examples. The new workbench in the Model Workshop has a stack of labeled data.
Your Task
The essence of machine learning is learning a function from a limited set of samples that generalizes to unseen data. This chapter establishes the foundational framework: data splitting, feature engineering, the three phases of training/validation/testing, and diagnosing and addressing overfitting and underfitting.
Chapter Layers
- Required: Dataset splitting, bias-variance dilemma, overfitting and regularization
- Optional: Learning curves, AIC/BIC
- Advanced: Intuitive understanding of PAC learning theory
Breaking Through · Tracing the Origin
Suppose you're teaching a child to recognize cats. You wouldn't write a "Cat Recognition Algorithm Manual"—you just keep pointing: "This is a cat, this isn't." Pointing and pointing, and gradually they learn on their own.
Sounds simple, right? But the first problem immediately appears: the cat photos you use to teach them—and the photos you test them with at the end—should they be the same batch?
If the answer is "yes," then they haven't really "learned" to recognize cats—they've just memorized the answers. Show them a cat photo they've never seen before, and they won't recognize it at all. This is called overfitting: the model memorizes noise in the training data rather than the true pattern.
Conversely, if the photos are too few or too blurry—they haven't learned the difference between cats and dogs at all. This is called underfitting.
This is the core contradiction of supervised learning: you can only give the model a limited number of samples, but it must generalize to an infinite set of unseen data.
Dataset Splitting
The first step in every ML project: split the data into three parts. It's like separating materials in the Model Workshop into three piles—one for learning, one for tuning, one for the final exam.
- Training set: the model learns on this
- Validation set: tune hyperparameters, select models
- Test set: final evaluation (absolutely must not be touched until the very end)
from sklearn.model_selection import train_test_split
import numpy as np
# Generate example data
X = np.random.randn(1000, 10)
y = (X[:, 0] + X[:, 1] > 0).astype(int)
# First split out the test set
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Then split training and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42
)
print(f"Training: {len(X_train)}, Validation: {len(X_val)}, Test: {len(X_test)}")Why two steps? If you repeatedly look at the validation set using the same train_test_split to tune parameters, the validation set's information leaks into model selection—the validation set is no longer "unseen data," and you're indirectly overfitting to the validation set.
Feature Engineering
Raw data is rarely fed directly to the model. Feature engineering transforms raw data into a form the model can effectively use:
- Standardization (StandardScaler): subtract mean, divide by std, making features of different scales comparable
- Normalization (MinMaxScaler): scale to [0, 1]
- Categorical encoding: One-Hot or Label Encoding
- Feature crossing: combining features to create new ones
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val) # Note: uses training set statistics
X_test_scaled = scaler.transform(X_test) # Absolutely cannot call fitOverfitting and Underfitting
This is the most important diagnostic framework in machine learning.
- Underfitting: high training error → model too simple / features insufficient
- Overfitting: low training error but high validation error → model memorized noise instead of pattern
# Demonstrating overfitting on polynomial regression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
np.random.seed(42)
X_small = np.sort(np.random.rand(20, 1), axis=0)
y_small = np.sin(2 * np.pi * X_small).ravel() + np.random.normal(0, 0.1, 20)
for degree in [1, 3, 15]:
poly = PolynomialFeatures(degree)
X_poly = poly.fit_transform(X_small)
model = LinearRegression()
model.fit(X_poly, y_small)
train_pred = model.predict(X_poly)
train_mse = mean_squared_error(y_small, train_pred)
print(f"degree={degree:2d}, train MSE: {train_mse:.4f}")Running this tells a classic story:
- degree=1 (underfitting): a straight line desperately trying to fit a sine wave—high training error. Too simple, can't learn the true pattern.
- degree=3 (just right): a cubic curve roughly following the sine wave's shape—moderate training error. Not optimal but sufficient.
- degree=15 (overfitting): perfectly passing through every training point—training error near zero! But if you draw this curve, you'll see it oscillates wildly between data points, nothing like a sine wave. On new data, predictions are terrible.
Key insight: training error is not the truth. A model that performs perfectly on the training set isn't necessarily good—it might just have memorized the answers. A truly good model is one that still performs well on unseen data.
Methods to combat overfitting:
- Regularization: add penalties to the loss function
- Cross-validation: evaluate with multiple train/validation splits
- Early stopping: stop training when validation error starts rising
- Data augmentation: expand training data scale
- Dropout/noise: randomly drop units or add noise
# L2 regularized linear regression (Ridge)
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
print(f"Ridge train: {ridge.score(X_train, y_train):.3f}, "
f"val: {ridge.score(X_val, y_val):.3f}")Bias-Variance Dilemma
Bias measures the model's ability to approximate the true pattern. Variance measures the model's sensitivity to changes in training data.
- High bias → underfitting (assumptions too strong)
- High variance → overfitting (assumptions too weak, sensitive to data)
They trade off: reducing bias increases variance and vice versa. A good model finds the critical point between them.
Common Pitfalls
- Looking at the test set, then the test set again, then the test set again—when you "accidentally" look at test set performance to tune parameters, the test set is no longer an evaluation standard.
- Standardization must use the training set's fit, not fit on the entire data then split (data leakage).
- Class imbalance: if 99% is negative class, a model that always predicts negative has 99% accuracy—but it's useless. Switch to F1, PR curves, or resampling.
- Time series data cannot be randomly split—must be split by time order, training on earlier data, testing on later data.
Pass Challenges
- Warm-up (10 min): Use sklearn.datasets.make_classification to generate a binary classification dataset, manually split into training/validation/test, run a logistic regression, print accuracy on all three sets.
- Challenge (30 min): Train on 10 different polynomial degrees (1~20), plot training error and validation error as functions of degree—observe which degree overfitting starts.
- Observation: Add random noise labels (mislabel 10%) to the training set, and observe the accuracy trend on the validation set.
Traveler's Notes
Machine learning is 30% model, 70% data. Understanding overfitting is the key to understanding all of ML—from simple linear regression to trillion-parameter large language models, they all solve the same problem: learning generalizable knowledge from limited samples.
-> Next Chapter Preview
With the data framework in place, you start learning the first complete model family: linear models.