Skip to content

Metadata Card

  • Prerequisites: Chapter 5 (Statistical Inference), Math C (Linear Algebra)
  • Estimated time: 55 minutes
  • Core difficulty: Advanced
  • Reading mode: High focus
  • Completion: Able to build a linear regression model, perform complete model diagnostics, and use regularization when necessary

Your Progress

In the previous chapter you learned how to infer populations from samples in the Data Prophecy Hall—the A/B test showed that after the lighthouse redesign, the error rate dropped by 2.3 percentage points, p-value 0.03, a reliable conclusion.

But the intelligence officer asked again: 'Can you predict? If supplies increase by 10%, how much would the mission completion rate improve?' This is more than inference—it's modeling: finding relationships between variables and using them to predict.

You remembered the linear algebra you learned in the Math Tower. Those vectors and matrices finally found their battlefield.

Your Task

You want to predict a mission's success rate. You know some factors: execution time, team size, resource investment. Intuition tells you that you can use these factors to fit a model. Linear regression is the first modeling tool you should learn—it not only makes predictions but also tells you the magnitude of each factor's impact.


From Correlation to Regression

In Chapter 3, you drew a scatter plot of duration vs success_rate and saw they had a roughly linear relationship. Linear regression turns this "rough" line into a precise formula:

success_rate = b0 + b1 * duration + noise

b0 is the intercept, b1 is the slope—for each unit increase in duration, how much the success_rate is expected to change.

python
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

df = pd.read_csv("missions_clean.csv")

# Prepare features and target
X = df[["duration_minutes", "team_size", "resources_used"]]
y = df["success_rate"]

# statsmodels method — see the full statistical report
X_sm = sm.add_constant(X)  # add intercept term
model_sm = sm.OLS(y, X_sm).fit()
print(model_sm.summary())

Key fields in the output:

  • R-squared: How much of the target variance is explained by the model. 0.85 means 85% of the success rate variation can be explained by these features.
  • coef: The coefficient for each feature. Positive means positive correlation, negative means negative correlation.
  • P>|t|: The p-value for each coefficient. If p > 0.05, this feature is likely not significant.
  • F-statistic: The p-value for the overall model, determining if the model is statistically significant.

Model Diagnostics

Getting the model is just step one. You must check if the model is reasonable. The most common diagnostic checks are for four issues:

1. Are residuals randomly distributed?

Residuals = actual values - predicted values. In a good model, residuals should be randomly scattered around 0 with no obvious pattern.

python
predictions = model_sm.predict(X_sm)
residuals = y - predictions

plt.figure(figsize=(12, 4))

# Residuals vs fitted values
plt.subplot(1, 3, 1)
plt.scatter(predictions, residuals, alpha=0.3)
plt.axhline(y=0, color="r", linestyle="--")
plt.xlabel("Predicted")
plt.ylabel("Residuals")
plt.title("Residuals vs Fitted")

If this plot shows a "fan shape" (residuals expand as predicted values increase), it indicates heteroscedasticity—you need weighted least squares or a transformation of the target variable.

2. Are residuals normally distributed?

python
# Q-Q plot
plt.subplot(1, 3, 2)
sm.qqplot(residuals, line='s', ax=plt.gca())
plt.title("Q-Q Plot")

If the points are near the red line, residuals are approximately normal. If the endpoints deviate significantly, it may indicate a heavy-tailed distribution.

3. Are there high-influence outliers?

python
# Cook's distance
from statsmodels.stats.outliers_influence import OLSInfluence
influence = OLSInfluence(model_sm)
cooks_d = influence.cooks_distance[0]

plt.subplot(1, 3, 3)
plt.stem(range(len(cooks_d)), cooks_d)
plt.title("Cook's Distance")
plt.tight_layout()
plt.show()

Points with Cook's distance exceeding 4/n (where n is the sample size) are considered high-influence points.

4. Multicollinearity

If two features are highly correlated, the model can't distinguish their individual effects, and coefficients become unstable.

python
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)

VIF > 10 indicates severe multicollinearity. Solutions: drop one of the highly correlated features, or use regularized regression.

Regularized Regression

When features are too many, correlated, or you suspect overfitting, add a regularization term to the loss function.

python
from sklearn.linear_model import Ridge, Lasso

# Ridge regression (L2) — shrinks coefficients but doesn't zero them out
ridge = Ridge(alpha=1.0)
ridge.fit(X, y)

# Lasso (L1) — can push unimportant feature coefficients to zero
lasso = Lasso(alpha=0.01)
lasso.fit(X, y)

# Elastic Net — combines L1 and L2
from sklearn.linear_model import ElasticNet
elastic = ElasticNet(alpha=0.01, l1_ratio=0.5)
elastic.fit(X, y)

print("Lasso coefficients:", dict(zip(X.columns, lasso.coef_)))

Regularized regression is especially useful when the number of features exceeds the sample size, or when features are highly correlated.

Feature Selection Principles

More features doesn't mean a better model. Adding more features:

  • Increases training R² (even to 1.0, overfitting)
  • Decreases test set performance
  • Increases model complexity and interpretation difficulty

Use adjusted R² or AIC/BIC to balance fit and complexity.

python
print("Adjusted R-squared:", model_sm.rsquared_adj)
print("AIC:", model_sm.aic)
print("BIC:", model_sm.bic)

Common Pitfalls

  • Training without diagnostics. You fit a model with R²=0.99, but the residual plot shows a clear nonlinear pattern—the model is wrong.
  • Treating high R² as a good model. R²=0.999 could be overfitting, or there's data leakage between variables.
  • Ignoring collinearity. When two features are highly correlated, the model gives coefficients that "look reasonable but are actually meaningless."
  • Reusing the test set for selection. You tuned parameters on the test set and then used the same test set for evaluation—your evaluation is biased. Remember to keep a validation set.

Pass Challenges

  • Warm-up: Load a dataset, fit a linear regression with statsmodels, and explain the meaning of each coefficient.
  • Challenge: Perform a full model diagnostic—residual plots, Q-Q plot, VIF, Cook's Distance—find and fix at least one issue.
  • Troubleshooting: Your model has R²=0.99 but performs poorly on the test set. Use diagnostics to find the cause of overfitting.

Acceptance Criteria

  • Can build an OLS regression model and interpret the output report
  • Can perform four basic model diagnostics (residual randomness, normality, influence points, multicollinearity)
  • Knows when to use regularized regression
  • Understands the difference between R² and Adjusted R²

Traveler's Notes

Fitting the model is just the beginning. The real work is in the diagnostic phase—checking whether your model has made stupid mistakes.


Next Chapter Preview

The model is built. But what features can you feed into it? Raw data is almost never directly usable—next chapter, Feature Engineering teaches you how to extract useful features from raw data.

Built with VitePress | Software Systems Atlas