Software Systems Atlas

Metadata Card

Prerequisites: Chapter 3 (EDA), Math B (probability/statistics)
Estimated time: 50 minutes
Core difficulty: Advanced
Reading mode: High focus
Completion: Able to perform hypothesis testing on two groups of data and correctly interpret p-values and confidence intervals

Your Progress

You stand in the Data Prophecy Hall, surrounded by battle report tables from various fortresses. Using SQL, you found specific numbers: Fortress A's mission completion rate is 83%, Fortress B's is 79%—a difference of 4 percentage points.

But is this difference meaningful? If you ran the statistics again tomorrow, would the results flip? You've discovered a more fundamental problem—what you have isn't necessarily the truth. The 83% you see might just be random fluctuation.

You need statistical inference—judging the population from the sample.

Your Task

In EDA you discovered that the task group using the new training regimen seemed to have a 5 percentage point higher success rate than the old regimen. But is this difference a real improvement or just random fluctuation? The problem you face can be summarized as: of the differences we observe, how much is signal and how much is noise? Statistical inference gives you a framework for judgment.

From Description to Inference

EDA tells you what the sample looks like. Statistical inference helps you answer: what is the truth about the population behind the sample?

python

import numpy as np
import pandas as pd
from scipy import stats

# Load data
df = pd.read_csv("missions_clean.csv")

# Two groups: old regimen vs new regimen
old = df[df["training"] == "old"]["success_rate"]
new = df[df["training"] == "new"]["success_rate"]

print(f"Old regimen: mean={old.mean():.3f}, std={old.std():.3f}, n={len(old)}")
print(f"New regimen: mean={new.mean():.3f}, std={new.std():.3f}, n={len(new)}")

You see that the new regimen's mean is 0.05 higher (5 percentage points). But these 5 percentage points could come from:

True effect (the training regimen is indeed better)
Random sampling (the new regimen group happened to have better data points)
Confounding factors (tasks in the new regimen group were inherently easier)

Statistical inference helps you distinguish between the first two possibilities.

Confidence Intervals

A confidence interval tells you: if you repeated the sampling many times, the true population mean has a 95% probability of falling within this interval.

python

def confidence_interval(data, confidence=0.95):
    n = len(data)
    mean = np.mean(data)
    se = stats.sem(data)  # standard error = std / sqrt(n)
    h = se * stats.t.ppf((1 + confidence) / 2, n - 1)
    return mean - h, mean + h

ci_old = confidence_interval(old)
ci_new = confidence_interval(new)
print(f"Old regimen 95% CI: ({ci_old[0]:.3f}, {ci_old[1]:.3f})")
print(f"New regimen 95% CI: ({ci_new[0]:.3f}, {ci_new[1]:.3f})")

If the two confidence intervals don't overlap, there's a significant difference between the groups. If they overlap, the sample size is insufficient to determine if the difference comes from a real effect.

Hypothesis Testing

A formalized judgment process. You propose two competing hypotheses:

Null hypothesis H0: There is no difference in success rates between the two regimens (the observed difference comes from random fluctuation)
Alternative hypothesis H1: There is a difference in success rates between the two regimens

Then use a t-test to calculate the p-value:

python

t_stat, p_value = stats.ttest_ind(new, old, equal_var=False)
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

The p-value means: assuming the null hypothesis is true, the probability of observing the current data (or more extreme data). If the p-value is less than your preset significance level (usually 0.05), you reject the null hypothesis and conclude that the difference is statistically significant.

Four Pitfalls of the p-value

The p-value is NOT "the probability that the null hypothesis is true." This is the most common misunderstanding. The p-value describes "the probability of seeing the current data under the null hypothesis," not "the probability that the null hypothesis holds."
The p-value is heavily influenced by sample size. With a very large sample, even trivially small differences can reach p < 0.05. With a very small sample, meaningful differences may have p > 0.05.
Multiple comparison problem. If you run hypothesis tests on 100 columns, you expect about 5 to be significant by random chance (at a significance level of 0.05). Without correction, the "significant correlations" you find may just be noise.
The p-value doesn't tell you effect size. p < 0.001 could correspond to an actual difference of only 0.01%. Statistical significance doesn't equal practical significance.

Effect Size

The p-value tells you "whether there's a difference"; effect size tells you "how big the difference is."

python

# Cohen's d: difference between group means / pooled standard deviation
def cohens_d(group1, group2):
    n1, n2 = len(group1), len(group2)
    s1, s2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
    pooled_std = np.sqrt(((n1 - 1) * s1 + (n2 - 1) * s2) / (n1 + n2 - 2))
    return (np.mean(group1) - np.mean(group2)) / pooled_std

d = cohens_d(new, old)
print(f"Cohen's d: {d:.3f}")

Rule of thumb: |d| ≈ 0.2 small effect, ≈ 0.5 medium, ≈ 0.8 large. With effect size, you can determine whether the difference is meaningful in the real-world context.

Decision Flow

Complete statistical inference decision flow:

Choose test method: two independent samples → t-test, paired samples → paired t-test, three+ groups → ANOVA, categorical variables → chi-squared test
Set significance level (usually α = 0.05)
Compute test statistic and p-value
Compute effect size
Report results including: direction of difference, statistical significance (p-value), practical significance (effect size), confidence interval

python

# Complete report
print("========== Statistical Inference Report ==========")
print(f"Old regimen success rate: {old.mean():.3f} (n={len(old)})")
print(f"New regimen success rate: {new.mean():.3f} (n={len(new)})")
print(f"Difference: {new.mean() - old.mean():.3f}")
print(f"95% CI difference: ({ci_old[0]:.3f} ~ {ci_new[0]:.3f})")
print(f"t-test: t={t_stat:.3f}, p={p_value:.4f}")
print(f"Cohen's d: {d:.3f}")
if p_value < 0.05:
    print("Conclusion: difference is statistically significant, but evaluate practical significance with effect size")
else:
    print("Conclusion: current data insufficient to reject the null hypothesis")

Common Pitfalls

Looking only at p-value, ignoring effect size. p = 0.001 but you find the actual difference is only 0.1%—what would you do?
Running hypothesis tests without pre-registration. Testing data repeatedly until you find a significant p-value—this is "p-hacking."
Choosing the wrong test method. Using an independent samples t-test on paired data will reduce statistical power.
Treating "not rejecting the null hypothesis" as "the null hypothesis is true." With insufficient sample size, you may simply lack enough evidence.

Pass Challenges

Warm-up: Generate two random datasets yourself (with the same mean), run t-tests 100 times, and count how many times p < 0.05. How many do you expect?
Challenge: Find a real dataset with two groups, perform a complete analysis from descriptive statistics to hypothesis testing to effect size, and write a conclusion report.
Troubleshooting: Your hypothesis test shows p = 0.06, but the manager says, "It's close to significant—just treat it as significant." How do you explain why this is wrong?

Acceptance Criteria

Can explain the precise meaning of p-value and common misconceptions
Can choose the correct test method based on data type
Can report both p-value and effect size in hypothesis test results
Knows what p-hacking is and why it's harmful

Traveler's Notes

p-value gives the answer, effect size gives the magnitude, confidence interval gives the range. Look at all three together, never just one.

Next Chapter Preview

You can now infer the population from the sample. But if you want to predict—use existing data to predict the outcome of a new mission—you need the next chapter: Linear Regression.