Metadata Card
- Prerequisites: Chapter 5 (Statistical Inference)
- Estimated time: 50 minutes
- Core difficulty: Advanced
- Reading mode: High focus
- Completion: Able to design a sampling scheme, distinguish correlation from causation
Your Progress
You built a set of features, trained the model, and prediction accuracy was reasonable. But a sharper question surfaced:
You found in the training data that mission attendance rates were generally higher on the front lines than in the rear. Can you say 'increase attendance on the front lines'? No. It might simply be because more elite sentries were already deployed to the front lines—the difference in the data is not necessarily causal.
You've discovered that between 'correlation' and 'causation' in data, there lies a gap you hadn't fully appreciated before.
Your Task
In Chapter 3 you discovered that "longer training time correlates with higher mission success rates," and in Chapter 5 you confirmed the difference is statistically significant. But can you say "increasing training time causes higher success rates"? No. A possible explanation: teams that were already proficient would spend more time training—the causal direction is reversed. This chapter teaches you to distinguish correlation from causation and how to use data for causal inference.
Sampling: The First Gateway of Data Collection
The data you analyze never includes every individual in the population—you can only sample. The sampling method directly determines how broadly your analytical conclusions can be generalized.
The crystal screens in the Prophecy Hall only reflect what they can see. Whether you choose simple random sampling, stratified sampling, or systematic sampling determines whether your conclusions hold up.
import pandas as pd
import numpy as np
# Simple random sampling
sample = df.sample(n=1000, random_state=42)
# Stratified sampling — proportional sampling from each mission type
def stratified_sample(df, strata_col, frac=0.1):
return df.groupby(strata_col, group_keys=False).apply(
lambda x: x.sample(frac=frac, random_state=42)
)
stratified = stratified_sample(df, "mission_type", frac=0.1)
# Systematic sampling — take every k-th item
k = len(df) // 1000
systematic = df.iloc[::k]Each method has its use case:
| Method | When to Use | Risk |
|---|---|---|
| Simple Random | Population is uniform, no stratification | Minority groups may be completely missed |
| Stratified | Uneven group sizes, need representation from each group | Need to know group size proportions |
| Systematic | Data is ordered, can't access randomly | May introduce bias if data has periodic patterns |
Most biases come from sampling that isn't random. If you only analyze data from the past week, and that week happened to have a special event, your conclusions won't generalize to other time periods.
Correlation ≠ Causation
This is the first commandment of statistics. Three scenarios illustrate why:
- Reverse causation: The relationship exists, but the direction is reversed. "Longer training → higher success" might actually be "high-success missions require more training."
- Confounding factors: A hidden third variable affects both the cause and effect. "Missions traversing vast wilderness have higher success rates"—is it because crossing wilderness is beneficial, or because only the most experienced teams are assigned such tasks?
- Coincidence: Out of 100 independent tests, you expect about 5 to show p < 0.05. Finding correlations is too easy.
Causal Inference Toolkit
When you need to infer causation from observational data, use these methods.
Randomized Controlled Experiments
The most reliable method. Randomly assign experimental units to treatment and control groups—randomization eliminates all confounding factors.
Randomized controlled experiments are the gold standard of causal inference—you randomly assign units to two groups, making all other features comparable in expectation. The observed difference can then be attributed to the treatment itself.
# Random assignment simulation
np.random.seed(42)
n_missions = len(df)
# Randomly assign to two groups
df["treatment"] = np.random.choice([0, 1], size=n_missions, p=[0.5, 0.5])
# Compare average success rates between groups
treatment_effect = (
df[df["treatment"] == 1]["success_rate"].mean()
- df[df["treatment"] == 0]["success_rate"].mean()
)
print(f"Treatment effect: {treatment_effect:.3f}")The key value of randomization: in expectation, the two groups are comparable on all features except the treatment variable. So the observed difference can be attributed to the treatment.
But many scenarios can't be randomized (you can't randomly assign "team size" or "task difficulty"). That's when you need observational study tools.
DAG (Directed Acyclic Graph)
Before drawing a causal graph, ask yourself: which variables might affect both the cause and the outcome?
team_experience ──→ training_time
↓ ↓
success_rate ←───────────┘This DAG tells you: team_experience is a confounder. If you only look at the correlation between training_time and success_rate, you'll be misled. After controlling for team_experience, the results might be completely different.
# Controlling for confounders — stratified analysis
for exp_level in df["team_experience"].unique():
subset = df[df["team_experience"] == exp_level]
corr = subset["training_time"].corr(subset["success_rate"])
print(f"Experience={exp_level}: correlation between training time and success rate = {corr:.3f}")Instrumental Variables
When there are unobservable confounding factors, find a variable that "only affects the outcome through the cause" as an instrument.
This is one of the core strategies for causal inference from observational data in data science. Instrumental variables must satisfy two conditions:
- Relevance: The instrument is correlated with the treatment variable
- Exclusion: The instrument only affects the outcome through the treatment variable
If you do A/B testing with random group assignment (the assignment variable), it's naturally a perfect instrumental variable—it randomly determines treatment status and only affects the outcome through the treatment.
Common Pitfalls
- Treating observed correlation directly as causation. Drawing causal conclusions without controlling for confounders.
- Over-relying on statistical significance for decisions. A significant association in a small sample may disappear in a larger sample.
- Ignoring selection bias. You only analyzed data from completed missions—those that failed mid-mission were not recorded, making your sample biased.
- Confusing "control variables" in causal inference. Improper control variables may introduce "collider bias"—variables affected by both the cause and outcome should not be controlled for.
Pass Challenges
- Warm-up: Find an example of "correlation but not causation" from your surroundings, and draw its DAG.
- Challenge: Using a dataset with confounding factors, demonstrate the difference in effects "before controlling for confounders" and "after controlling for confounders."
- Troubleshooting: A report claims "teams that participated in extra training have a 10% higher success rate." List at least three alternative explanations.
Acceptance Criteria
- Can explain why correlation does not equal causation
- Can draw a simple DAG to identify confounding factors
- Understands why randomized controlled experiments are the gold standard for causal inference
- Knows in which scenarios instrumental variables are needed
Traveler's Notes
Correlation describes the past; causal inference guides future action. Knowing the difference is the most important judgment a data scientist can have.
Next Chapter Preview
You've learned to infer populations from samples and determine causation from correlation. But data continues to grow—when your DataFrame is too large to fit in memory, you need distributed data processing.