Software Systems Atlas

Metadata Card

Prerequisites: Chapter 4 (SQL for Analysis), Chapter 6 (Linear Regression)
Estimated time: 50 minutes
Core difficulty: Advanced
Reading mode: High focus
Completion: Able to perform complete feature engineering on raw data—encoding, scaling, binning, crossing, selection

Your Progress

Linear regression ran successfully. You used a formula to link supply levels to mission completion rates—each 10% increase in supply lifted the completion rate by approximately 2.1 percentage points.

But the intelligence officer pushed more data in front of you: temperature, humidity, sentry experience levels, equipment grades, days since last resupply... You realized these aren't readily available input variables. They need processing—temperature needs binning, equipment levels need encoding, time needs feature extraction.

The real modeling work isn't tuning the model—it's creating features.

Your Task

In Chapter 6, you used three numeric features—duration_minutes, team_size, resources_used—for regression. But in real data, features are rarely directly usable—there are categorical texts, timestamps, null values, and overly sparse columns. Feature engineering transforms raw data into a form that models can digest. Good feature engineering can have a bigger impact on results than model selection.

Feature Engineering Is Not Optional

Training a model on raw data vs well-engineered features can differ by multiples in performance. This is because models (especially linear models) have assumptions about the input form—numerical values should be continuous, categorical values should be encoded, and scales should be comparable.

Numerical Features: Scaling

When feature scales differ dramatically (one feature ranges 0-1, another 0-1000000), gradient descent oscillates, and regularization treats large-scale features unfairly.

Features don't auto-align like the crystal screens in the Data Prophecy Hall. StandardScaler, MinMaxScaler, and RobustScaler are three alignment tools—your choice depends on whether your data has outliers.

python

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import pandas as pd

df = pd.read_csv("missions_clean.csv")

# StandardScaler: subtract mean, divide by std → mean=0, variance=1
scaler = StandardScaler()
df[["duration_scaled"]] = scaler.fit_transform(df[["duration_minutes"]])

# MinMaxScaler: scale to [0, 1] range
mms = MinMaxScaler()
df[["resources_scaled"]] = mms.fit_transform(df[["resources_used"]])

# RobustScaler: uses median and IQR, insensitive to outliers
rs = RobustScaler()

How to choose? If your data has no outliers, StandardScaler is the default. If there are many outliers, RobustScaler is safer. If the model is sensitive to scale and you need a fixed range (e.g., neural networks with sigmoid activation), use MinMaxScaler.

Categorical Features: Encoding

Models eat numbers, not strings. You need to convert categorical text into numerical values.

Categorical encoding is one of the easiest steps to get wrong before modeling. One-Hot turns each category into a separate 0/1 column—simple and direct but inflates dimensionality. When facing hundreds of city names, you need a more compact encoding strategy.

python

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# One-Hot Encoding — suitable for low-cardinality categories
df_encoded = pd.get_dummies(df, columns=["mission_type"], prefix="type")

# Or use sklearn's OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
encoded = encoder.fit_transform(df[["mission_type"]])

If-Else decision:

Number of categories < 10, mutually exclusive → One-Hot Encoding
Number of categories = 2 → directly encode as 0/1
Many categories (hundreds of city names) → use target encoding or frequency encoding
Ordinal categories → Ordinal Encoding

python

# Label Encoding — suitable for ordinal categories (low < medium < high)
from sklearn.preprocessing import OrdinalEncoder
ordinal = OrdinalEncoder(categories=[["low", "medium", "high"]])
df["skill_level_encoded"] = ordinal.fit_transform(df[["skill_level"]])

Time Features: Decomposition

Timestamps are among the most underestimated information sources. A single timestamp can be decomposed into a dozen features.

The patterns behind timestamps—hour, weekday, month, whether it's a weekend—each column can potentially carry predictive signal. But note that linear models can't intuitively understand that 23 and 0 are close, so periodic features need sin/cos transformation.

python

df["timestamp"] = pd.to_datetime(df["timestamp"])

df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.dayofweek
df["month"] = df["timestamp"].dt.month
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
df["day_of_year"] = df["timestamp"].dt.dayofyear
df["elapsed_days"] = (df["timestamp"] - df["timestamp"].min()).dt.days

Some temporal patterns are periodic (hour, weekday). For linear models, using hour directly (0-23) leads the model to think that 23 and 0 are far apart. A better approach is to use sin/cos transformation:

python

df["hour_sin"] = np.sin(2 * np.pi * df["hour"] / 24)
df["hour_cos"] = np.cos(2 * np.pi * df["hour"] / 24)

Missing Value Handling

Missing value handling in feature engineering is more nuanced than in data cleaning—you can't simply fill and forget, because the model training also needs the same filling logic.

python

from sklearn.impute import SimpleImputer

# Numeric columns: fill with median
num_imputer = SimpleImputer(strategy="median")
df[["resources_used"]] = num_imputer.fit_transform(df[["resources_used"]])

# Categorical columns: fill with mode
cat_imputer = SimpleImputer(strategy="most_frequent")
df[["region"]] = cat_imputer.fit_transform(df[["region"]])

A better approach is to add an "is missing" indicator column:

python

df["resources_missing"] = df["resources_used"].isna().astype(int)

This tells the model that "this value was originally missing"—the information itself may have predictive power.

Feature Selection

You don't necessarily need to use all features. The three mainstream methods for feature selection:

python

from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.ensemble import RandomForestRegressor

# Method 1: Univariate selection
selector = SelectKBest(score_func=f_regression, k=10)
selected = selector.fit_transform(X, y)

# Method 2: Recursive feature elimination
selector = RFE(RandomForestRegressor(), n_features_to_select=10)
selector.fit(X, y)
print("Selected features:", X.columns[selector.support_])

# Method 3: Regularization-based automatic selection — Lasso pushes unimportant feature coefficients to zero

Common Pitfalls

Scaling and encoding before splitting data. This causes data leakage—the validation set's statistics leak into the training set. Always split first, then fit on the training set and transform on the test set.
Too many features, too few samples. Each additional feature requires more samples for stable estimation. Rule of thumb: sample count should be at least 10× the feature count.
Ignoring feature interactions. Combined features like duration * team_size may have more predictive power than the two separate features.

Pass Challenges

Warm-up: Find all categorical columns in your dataset that need encoding, and choose the right encoding strategy.
Challenge: Complete a pipeline from raw data to feature matrix—including scaling, encoding, time decomposition, missing value indicators—and demonstrate that feature engineering improved model performance.
Observation: Compare the differences in linear regression coefficients after using StandardScaler vs RobustScaler.

Acceptance Criteria

Can choose the correct encoding and scaling strategy based on feature type
Knows how to decompose temporal features
Understands why scaling should be done after data splitting (to avoid data leakage)
Can perform feature selection using at least one method

Traveler's Notes

Good features make simple models powerful. The time spent on feature engineering is one of the highest-ROI investments in the modeling phase.

Next Chapter Preview

Features are ready, the model is ready. But are you sure the "effect" you see is real? Next chapter, Sampling & Causal Inference.