Skip to content

Metadata Card

  • Prerequisites: ch04-ch07 All supervised/unsupervised model fundamentals
  • Estimated time: 45 minutes
  • Core difficulty: Advanced
  • Reading mode: High focus
  • Completion: Able to correctly interpret confusion matrices and ROC curves, use cross-validation and grid search for model tuning

Your Progress

The walls of the Model Workshop are covered with training curves. You stare at one—training loss heading steadily downward, beautiful. But when you take the model out to the workshop on new data and test it, the results are terrible.

An old apprentice in the Model Workshop glanced over: "You overfit on the training set."

You realize: training the model is only the first half. The second half is—how do you know if your model is actually useful?

Your Task

Model evaluation isn't just looking at accuracy. When classes are imbalanced, misclassification costs differ, and model complexity varies, you need more refined metrics and more reliable validation strategies. This chapter fills out this toolbox, ultimately teaching you to systematically search for hyperparameters.

Chapter Layers

  • Required: Confusion matrix, Precision/Recall/F1, ROC-AUC, cross-validation, GridSearch
  • Optional: Bootstrap confidence intervals, Bayesian optimization
  • Advanced: Statistical hypothesis testing to compare models

Breaking Through · Tracing the Origin

You trained a disease detection model with 99% accuracy—sounds great. But the incidence rate is only 1%, so a model that always says "healthy" also achieves 99% accuracy. Accuracy is completely useless in this scenario.

You need a set of evaluation tools that reveal the model's true capability. Some problems care about "don't miss anything" (recall), some care about "don't misclassify" (precision), and some care about both (F1).

Confusion Matrix

The confusion matrix is the starting point for evaluation. Four cells: True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN).

             Predicted (+)   Predicted (-)
Actual (+)      TP              FN
Actual (-)      FP              TN

Accuracy is completely misleading with class imbalance—a model that always says "healthy" on a dataset with 1% incidence also achieves 99% accuracy. You need the confusion matrix to see the full picture.

python
from sklearn.metrics import confusion_matrix, classification_report

cm = confusion_matrix(y_test, y_pred)
print(cm)

# More detailed report
print(classification_report(y_test, y_pred))

Core metrics:

  • Accuracy: (TP+TN) / (TP+TN+FP+FN) — only valid on balanced data
  • Precision: TP / (TP+FP) — of your positive predictions, how many are truly positive
  • Recall (Sensitivity): TP / (TP+FN) — of all actual positives, how many did you find
  • F1 Score: 2 * (Precision * Recall) / (Precision + Recall) — harmonic mean of precision and recall
python
from sklearn.metrics import precision_score, recall_score, f1_score

print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred):.3f}")
print(f"F1:        {f1_score(y_test, y_pred):.3f}")

ROC Curve & AUC

Many models output probabilities, not hard classifications. You can adjust the threshold (e.g., predict positive if >0.5, or >0.3)—each threshold gives a different False Positive Rate (FPR) and True Positive Rate (TPR). The ROC curve plots all these points' trajectories.

AUC (Area Under the Curve) tells you the model's discrimination ability: AUC=0.5 is random guessing, AUC=1.0 is a perfect classifier.

The value of the ROC curve is that it's unaffected by the classification threshold—you only look at the model's inherent ranking ability.

python
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)

print(f"AUC: {auc:.3f}")

plt.plot(fpr, tpr, label=f'AUC={auc:.3f}')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# plt.show()

PR curves are more useful than ROC curves in imbalanced class scenarios: they focus on the minority class's performance, unaffected by the large number of negatives.

Cross-Validation

Results from a single train-validation split have high variance—if you're unlucky with the split and the validation set is too hard or too easy, results fluctuate. K-fold cross-validation splits the data into K parts, takes turns using K-1 parts for training and 1 part for validation, averaging the K validation results.

Cross-validation gives you a more reliable estimate of the model's generalization performance. K=5 or K=10 are common values—too small gives high variance, too large is computationally expensive.

python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)
print(f"Per-fold scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std():.3f})")

Stratified K-Fold maintains the class proportion in each fold, matching the overall proportion. Recommended for classification problems.

python
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in skf.split(X, y):
    X_train_fold = X[train_idx]
    y_train_fold = y[train_idx]
    # ...

Hyperparameter Search

Manual tuning is unreliable and unreproducible. Two systematic search methods:

  • Grid Search: enumerate all parameter combinations
  • Random Search: randomly sample in the parameter space
python
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'n_estimators': [50, 100, 200],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1,
    verbose=1
)
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best validation score: {grid_search.best_score_:.3f}")
print(f"Test set score: {grid_search.score(X_test, y_test):.3f}")

Random Search is more efficient in high-dimensional parameter spaces: Grid Search's enumeration count grows exponentially with dimensions, while Random Search covers more parameter "values" with the same budget.

Note: when cross-validation is nested inside GridSearchCV (the cv parameter), there are already three layers of data partitioning—inner cv does validation, outer test does final evaluation. This is the correct approach and doesn't leak test set information.

Common Pitfalls

  • Using the test set in GridSearchCV—if you pass X_test to fit, it gets used for evaluation. GridSearchCV has its own internal validation; there should be a separate test set externally.
  • When optimizing multiple metrics, specify the scoring parameter ('f1', 'roc_auc', 'accuracy', etc.), choosing the one most relevant to the business objective.
  • Cross-validation K value: K=5 or K=10 are common. K too small (like 2) gives high variance, K too large (like 20) causes high overlap between training sets and high computational cost.
  • Data leakage in cross-validation: any preprocessing that "looks at the entire dataset" (like PCA or standardization over the whole X) must be done inside the cross-validation loop.
  • Selecting the best model after multiple experiments can overfit the validation set due to multiple comparisons—nested cross-validation or an independent test set is needed.

Pass Challenges

  • Warm-up (10 min): On sklearn's breast_cancer dataset, train logistic regression, Random Forest, and XGBoost. Compare the three models' average AUC with 5-fold cross-validation.
  • Challenge (30 min): Use GridSearchCV to tune an XGBoost model with a search space including max_depth (3~10), learning_rate (0.01,0.1,0.3), subsample (0.6~1.0). Print the best parameters and test set F1 score.
  • Observation: On the ROC curve, annotate points corresponding to different thresholds (0.3, 0.5, 0.8). Observe how threshold changes affect the Precision-Recall trade-off.

Traveler's Notes

No evaluation, no progress. Model evaluation isn't a score-chasing game—it's the conversation between you and the model: you ask "what did you learn," and the evaluation metrics answer "these I know, these I'm not sure about." Master it, and you can truly compare models, select models, and trust models.

-> Next Chapter Preview

Evaluation ready, you stand at the end of classical ML and the starting line of deep learning. Next chapter, Neural Networks—from Perceptron to Backpropagation.

Built with VitePress | Software Systems Atlas