Chapter 6: Tree Models & Ensemble Learning

Metadata Card

Prerequisites: ch04 ML Foundations, ch05 Linear Models
Estimated time: 50 minutes
Core difficulty: Advanced
Reading mode: High focus
Completion: Able to implement decision tree training, understand the difference between Bagging and Boosting, use XGBoost

Your Progress

Linear models ran for a few days in the workshop and performed well—but the data kept getting more complex. The relationship between area and price isn't a simple straight line; there are inflection points and layers.

You looked at the nearby tool rack and picked up a coarser but more flexible tool: decision trees. It doesn't calculate formulas—it just keeps asking questions, splitting the data at each step until each group is pure enough.

Your Task

Decision trees are a completely different learning paradigm: partitioning the input space through a series of "yes/no" questions. A single tree easily overfits, but through ensembles—Bagging (Random Forest) and Boosting (XGBoost)—it becomes one of the most practical model families of the last decade.

Chapter Layers
Required: Decision tree splitting criteria (Gini/information gain), Random Forest, XGBoost
Optional: Mathematical derivation of gradient boosting
Advanced: Histogram acceleration, quantile approximation

Breaking Through · Tracing the Origin

Imagine you're classifying monsters. You pull out a ruler and start asking: taller than 2 meters? Heavier than 100kg? Does it have wings?

Each question cuts away some possibilities, splitting finer and finer—until each leaf node contains only one class. This "question tree" requires no mathematical transformation, no feature standardization—it's just how you'd classify intuitively.

Decision Trees

The core of decision tree training: at each node, select a feature and a split point that maximizes the "purity" of the split data.

Common purity measures:

Information gain = parent entropy - weighted children entropy
Gini impurity = 1 - sum(p_k^2)

Decision tree training is a continuous process of asking 'yes/no' questions, making each split's subset purer. The Gini coefficient measures 'if you guessed randomly, how often would you be wrong'—each split picks the cutoff that reduces impurity the most.

python

# Decision tree from scratch (simplified)
import numpy as np

class DecisionTreeNode:
    def __init__(self, depth=0, max_depth=5, min_samples=2):
        self.depth = depth
        self.max_depth = max_depth
        self.min_samples = min_samples
        self.feature = None
        self.threshold = None
        self.left = None
        self.right = None
        self.value = None

    def gini(self, y):
        classes, counts = np.unique(y, return_counts=True)
        prob = counts / len(y)
        return 1 - np.sum(prob ** 2)

    def build(self, X, y):
        n_samples, n_features = X.shape
        n_classes = len(np.unique(y))

        # Stopping conditions
        if (n_samples < self.min_samples
            or n_classes == 1
            or self.depth >= self.max_depth):
            self.value = np.bincount(y).argmax()
            return

        best_gain = -1
        for f in range(n_features):
            thresholds = np.unique(X[:, f])
            for t in thresholds:
                mask = X[:, f] <= t
                if sum(mask) == 0 or sum(~mask) == 0:
                    continue
                gain = self.gini(y) - (
                    sum(mask)/n_samples * self.gini(y[mask]) +
                    sum(~mask)/n_samples * self.gini(y[~mask]))
                if gain > best_gain:
                    best_gain = gain
                    self.feature = f
                    self.threshold = t

        if best_gain < 0:
            self.value = np.bincount(y).argmax()
            return

        mask = X[:, self.feature] <= self.threshold
        self.left = DecisionTreeNode(self.depth+1, self.max_depth, self.min_samples)
        self.right = DecisionTreeNode(self.depth+1, self.max_depth, self.min_samples)
        self.left.build(X[mask], y[mask])
        self.right.build(X[~mask], y[~mask])

A single decision tree is extremely prone to overfitting (unless pruned or depth-limited). But it has inherent advantages: strong interpretability (you can draw the tree and see it directly), robustness to missing values, and no need for feature scaling.

Random Forest

Random Forest uses Bagging (Bootstrap Aggregating) to solve decision tree overfitting: train multiple trees, each on a different data subset (sampling with replacement) and different feature subsets, then average or vote the results.

python

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    random_state=42
)
rf.fit(X_train, y_train)
print(f"RF accuracy: {rf.score(X_test, y_test):.3f}")

Random Forest's two randomization sources (data randomness + feature randomness) reduce correlation between trees, making voting results more stable. On tabular data, Random Forest is usually the first model to try.

Feature importance can be read directly from Random Forest: traverse all trees, count the total impurity decrease contributed by each feature at node splits.

python

importances = rf.feature_importances_
for i, imp in enumerate(importances):
    print(f"Feature {i}: {imp:.3f}")

XGBoost

Boosting has the opposite philosophy to Bagging: sequentially train a series of weak learners, with each new learner focusing on the previous one's prediction errors.

XGBoost is the engineered implementation of Gradient Boosted Decision Trees (GBDT), dominating Kaggle competitions for years.

Unlike Random Forest training trees in parallel and voting, Boosting trains sequentially—each new tree specifically corrects the previous tree's prediction residuals. With regularization and second-order gradient approximation, XGBoost achieves excellent accuracy.

python

import xgboost as xgb

dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

params = {
    'objective': 'binary:logistic',
    'max_depth': 6,
    'eta': 0.3,      # learning rate
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'eval_metric': 'logloss'
}

model = xgb.train(
    params, dtrain,
    num_boost_round=100,
    evals=[(dtest, 'eval')],
    early_stopping_rounds=10
)

Key engineering optimizations in XGBoost:

Second-order Taylor expansion to approximate the loss function (more accurate than first-order gradient)
Regularization terms directly added to the objective function (controls model complexity)
Column and row sampling (similar to Random Forest)
Quantile approximation to accelerate feature search
Sparsity-aware algorithm (automatically learns branch direction for missing values)

Bagging reduces variance, Boosting reduces bias. Random Forest can work with full-depth trees; XGBoost needs shallow trees (max_depth=3~6) with a learning rate to gradually approach the optimum.

Common Pitfalls

Decision trees are very sensitive to small changes in the data—removing one training sample can cause the entire tree to be rebuilt. That's why single trees have poor stability.
Random Forest's n_estimators doesn't easily overfit with more trees (unlike Boosting), but more trees require more memory and inference time.
XGBoost's early_stopping_rounds depends on the validation set—the validation set can't be too large or too small.
Tuning order: first tune tree depth/learning rate, then regularization parameters, then sampling ratios.
Tree models' natural robustness to feature scales is an advantage, but it doesn't mean feature engineering is unimportant—feature crossing still helps trees.

Pass Challenges

Warm-up (10 min): Use sklearn.datasets.load_iris, train a decision tree (max_depth=3), Random Forest (100 trees), and XGBoost (50 rounds). Compare the three models' accuracy.
Challenge (30 min): Use Random Forest's feature_importances_ for feature selection—keep only the 5 most important features and retrain with XGBoost. Observe performance changes.
Observation: Record training and validation logloss curves for each iteration in XGBoost. Observe how Boosting progressively reduces error.

Traveler's Notes

Ensemble learning embodies the "two heads are better than one" engineering mindset: combine multiple imperfect models, using their diversity to reduce overall error. Random Forest and XGBoost each excel in different areas and remain the kings of tabular data.

-> Next Chapter Preview

Supervised learning requires labels. But in reality, massive amounts of data have no labels—unsupervised learning takes the stage.