Chapter 8: Sampling & Causal Inference

Metadata Card

Prerequisites: Chapter 5 (Statistical Inference)
Estimated time: 50 minutes
Core difficulty: Advanced
Reading mode: High focus
Completion: Able to design a sampling scheme, distinguish correlation from causation

Your Progress

You built a set of features, trained the model, and prediction accuracy was reasonable. But a sharper question surfaced:

You found in the training data that mission attendance rates were generally higher on the front lines than in the rear. Can you say 'increase attendance on the front lines'? No. It might simply be because more elite sentries were already deployed to the front lines—the difference in the data is not necessarily causal.

You've discovered that between 'correlation' and 'causation' in data, there lies a gap you hadn't fully appreciated before.

Your Task

In Chapter 3 you discovered that "longer training time correlates with higher mission success rates," and in Chapter 5 you confirmed the difference is statistically significant. But can you say "increasing training time causes higher success rates"? No. A possible explanation: teams that were already proficient would spend more time training—the causal direction is reversed. This chapter teaches you to distinguish correlation from causation and how to use data for causal inference.

Sampling: The First Gateway of Data Collection

The data you analyze never includes every individual in the population—you can only sample. The sampling method directly determines how broadly your analytical conclusions can be generalized.

The crystal screens in the Prophecy Hall only reflect what they can see. Whether you choose simple random sampling, stratified sampling, or systematic sampling determines whether your conclusions hold up.

python

import pandas as pd
import numpy as np

# Simple random sampling
sample = df.sample(n=1000, random_state=42)

# Stratified sampling — proportional sampling from each mission type
def stratified_sample(df, strata_col, frac=0.1):
    return df.groupby(strata_col, group_keys=False).apply(
        lambda x: x.sample(frac=frac, random_state=42)
    )

stratified = stratified_sample(df, "mission_type", frac=0.1)

# Systematic sampling — take every k-th item
k = len(df) // 1000
systematic = df.iloc[::k]

Each method has its use case:

Method	When to Use	Risk
Simple Random	Population is uniform, no stratification	Minority groups may be completely missed
Stratified	Uneven group sizes, need representation from each group	Need to know group size proportions
Systematic	Data is ordered, can't access randomly	May introduce bias if data has periodic patterns

Most biases come from sampling that isn't random. If you only analyze data from the past week, and that week happened to have a special event, your conclusions won't generalize to other time periods.

Correlation ≠ Causation

This is the first commandment of statistics. Three scenarios illustrate why:

Reverse causation: The relationship exists, but the direction is reversed. "Longer training → higher success" might actually be "high-success missions require more training."
Confounding factors: A hidden third variable affects both the cause and effect. "Missions traversing vast wilderness have higher success rates"—is it because crossing wilderness is beneficial, or because only the most experienced teams are assigned such tasks?
Coincidence: Out of 100 independent tests, you expect about 5 to show p < 0.05. Finding correlations is too easy.

Causal Inference Toolkit

When you need to infer causation from observational data, use these methods.

Randomized Controlled Experiments

The most reliable method. Randomly assign experimental units to treatment and control groups—randomization eliminates all confounding factors.

Randomized controlled experiments are the gold standard of causal inference—you randomly assign units to two groups, making all other features comparable in expectation. The observed difference can then be attributed to the treatment itself.

python

# Random assignment simulation
np.random.seed(42)
n_missions = len(df)

# Randomly assign to two groups
df["treatment"] = np.random.choice([0, 1], size=n_missions, p=[0.5, 0.5])

# Compare average success rates between groups
treatment_effect = (
    df[df["treatment"] == 1]["success_rate"].mean()
    - df[df["treatment"] == 0]["success_rate"].mean()
)
print(f"Treatment effect: {treatment_effect:.3f}")

The key value of randomization: in expectation, the two groups are comparable on all features except the treatment variable. So the observed difference can be attributed to the treatment.

But many scenarios can't be randomized (you can't randomly assign "team size" or "task difficulty"). That's when you need observational study tools.

DAG (Directed Acyclic Graph)

Before drawing a causal graph, ask yourself: which variables might affect both the cause and the outcome?

team_experience ──→ training_time
     ↓                    ↓
success_rate  ←───────────┘

This DAG tells you: team_experience is a confounder. If you only look at the correlation between training_time and success_rate, you'll be misled. After controlling for team_experience, the results might be completely different.

python

# Controlling for confounders — stratified analysis
for exp_level in df["team_experience"].unique():
    subset = df[df["team_experience"] == exp_level]
    corr = subset["training_time"].corr(subset["success_rate"])
    print(f"Experience={exp_level}: correlation between training time and success rate = {corr:.3f}")

Instrumental Variables

When there are unobservable confounding factors, find a variable that "only affects the outcome through the cause" as an instrument.

This is one of the core strategies for causal inference from observational data in data science. Instrumental variables must satisfy two conditions:

Relevance: The instrument is correlated with the treatment variable
Exclusion: The instrument only affects the outcome through the treatment variable

If you do A/B testing with random group assignment (the assignment variable), it's naturally a perfect instrumental variable—it randomly determines treatment status and only affects the outcome through the treatment.

Common Pitfalls

Treating observed correlation directly as causation. Drawing causal conclusions without controlling for confounders.
Over-relying on statistical significance for decisions. A significant association in a small sample may disappear in a larger sample.
Ignoring selection bias. You only analyzed data from completed missions—those that failed mid-mission were not recorded, making your sample biased.
Confusing "control variables" in causal inference. Improper control variables may introduce "collider bias"—variables affected by both the cause and outcome should not be controlled for.

Pass Challenges

Warm-up: Find an example of "correlation but not causation" from your surroundings, and draw its DAG.
Challenge: Using a dataset with confounding factors, demonstrate the difference in effects "before controlling for confounders" and "after controlling for confounders."
Troubleshooting: A report claims "teams that participated in extra training have a 10% higher success rate." List at least three alternative explanations.

Acceptance Criteria

Can explain why correlation does not equal causation
Can draw a simple DAG to identify confounding factors
Understands why randomized controlled experiments are the gold standard for causal inference
Knows in which scenarios instrumental variables are needed

Traveler's Notes

Correlation describes the past; causal inference guides future action. Knowing the difference is the most important judgment a data scientist can have.

Next Chapter Preview

You've learned to infer populations from samples and determine causation from correlation. But data continues to grow—when your DataFrame is too large to fit in memory, you need distributed data processing.