Software Systems Atlas

Metadata Card

Prerequisites: Vol 5 Chapter 1 (Database Fundamentals)
Estimated time: 45 minutes
Core difficulty: Advanced
Reading mode: High focus
Completion: Able to use window functions, CTEs, and OLAP queries for multi-step analytical tasks

Your Progress

Through EDA you learned the general shape of the data—which columns are worth digging into, which are almost all empty. But the intelligence officer at the Prophecy Hall looked at your charts and shook his head: 'Nice charts, but I can't look up specific numbers from them. Help me find: over the past three months, which fortress had the highest mission completion rate?'

You realize that pandas feels awkward for cross-table aggregations like this. What you need is a proper analytical weapon—SQL's analysis patterns.

Your Task

In Vol 5, you used SQL for CRUD—insert, update, delete, simple queries. That was day-to-day MySQL operations, querying a few rows, at most a few hundred. Now your data is in the data warehouse, and each query scans millions of rows. You need cumulative statistics across groups, moving averages, and period-over-period comparisons. Regular SELECT-GROUP BY won't cut it anymore. This chapter gives you the analytical SQL arsenal.

From Row Operations to Set Operations

The difference between analytical SQL and transactional SQL is the mindset:

Transactional SQL: I need this row, modify this field
Analytical SQL: I need this group of data, compute a value across rows

In Vol 5 you used GROUP BY for grouped aggregation. But GROUP BY has a limitation: after grouping, each group's detail rows are collapsed into a single aggregate value. If you want to keep every row while adding the grouped aggregate result alongside it—that's when you need window functions.

Window Functions

Window functions don't collapse rows. They compute an aggregate value within a "window" alongside each row.

GROUP BY compresses rows into groups at the cost of losing detail. Window functions don't lose rows—they add a column beside each row, preserving the original records while displaying aggregate results, like the crystal panels of the Prophecy Hall.

sql

-- Basic syntax:
-- function_name() OVER (PARTITION BY group_column ORDER BY sort_column)

Look at a concrete example. Suppose you have a mission logs table:

sql

-- Create example table
CREATE TABLE mission_logs (
    mission_id INT,
    log_date DATE,
    resources_used INT
);

INSERT INTO mission_logs VALUES
(1, '2026-06-01', 100),
(1, '2026-06-02', 120),
(1, '2026-06-03', 110),
(2, '2026-06-01', 200),
(2, '2026-06-02', 220);

-- You want: daily usage per mission, plus running total for that mission
SELECT
    mission_id,
    log_date,
    resources_used,
    SUM(resources_used) OVER (
        PARTITION BY mission_id
        ORDER BY log_date
    ) AS running_total
FROM mission_logs;

Result:

mission_id | log_date   | resources_used | running_total
-----------+------------+----------------+--------------
1          | 2026-06-01 | 100            | 100
1          | 2026-06-02 | 120            | 220
1          | 2026-06-03 | 110            | 330
2          | 2026-06-01 | 200            | 200
2          | 2026-06-02 | 220            | 420

See the difference? GROUP BY can only give you two aggregate values (Mission 1 = 330, Mission 2 = 420). Window functions retain every row and add the running cumulative value on the right.

Common window function use cases:

Ranking, moving averages, period-over-period comparisons—these three patterns cover 80% of analytical needs. Each pattern uses OVER (PARTITION BY ... ORDER BY ...) to define the computing window's range.

sql

-- Ranking: resource usage rank within each mission
SELECT
    mission_id, log_date, resources_used,
    ROW_NUMBER() OVER (PARTITION BY mission_id ORDER BY resources_used DESC) AS rank
FROM mission_logs;

-- Moving average: average resource usage over the past 3 days
SELECT
    mission_id, log_date, resources_used,
    AVG(resources_used) OVER (
        PARTITION BY mission_id
        ORDER BY log_date
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) AS moving_avg_3d
FROM mission_logs;

-- Lag: take the previous day's value for period-over-period comparison
SELECT
    mission_id, log_date, resources_used,
    LAG(resources_used, 1) OVER (
        PARTITION BY mission_id
        ORDER BY log_date
    ) AS prev_day,
    resources_used - LAG(resources_used, 1) OVER (
        PARTITION BY mission_id
        ORDER BY log_date
    ) AS daily_change
FROM mission_logs;

CTE: Breaking Multi-Step Queries into Readable Steps

Analytical queries often can't be done in a single SELECT. You need to compute A first, then B based on A, then C based on B. CTEs (Common Table Expressions) let you write step by step, with a name for each step.

sql

WITH daily_usage AS (
    -- Step 1: compute total daily usage
    SELECT
        log_date,
        SUM(resources_used) AS total_resources
    FROM mission_logs
    GROUP BY log_date
),
weekly_avg AS (
    -- Step 2: compute 7-day moving average
    SELECT
        log_date,
        total_resources,
        AVG(total_resources) OVER (
            ORDER BY log_date
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ) AS weekly_ma
    FROM daily_usage
)
-- Step 3: find anomaly days below 80% of the moving average
SELECT *
FROM weekly_avg
WHERE total_resources < weekly_ma * 0.8;

The biggest benefit of CTEs is that each step can be tested independently. Run daily_usage first, confirm the results, then add the next step.

OLAP Operations: ROLLUP and CUBE

When you want to see data at different dimension levels (total aggregate, by type, by type + region), you don't need to write three UNION ALLs. Use ROLLUP:

sql

SELECT
    mission_type,
    location,
    SUM(resources_used) AS total
FROM mission_logs
GROUP BY ROLLUP (mission_type, location);

This will return:

One row for total aggregate (mission_type = NULL, location = NULL)
Rows aggregated by mission_type (location = NULL)
Detailed rows aggregated by mission_type + location

CUBE is more thorough—it generates all possible dimension combinations.

Writing Analytical SQL in Python

Pandas' groupby + transform is equivalent to SQL window functions:

python

import pandas as pd
df = pd.read_sql("SELECT * FROM mission_logs", engine)

# Equivalent to SUM OVER (PARTITION BY mission_id ORDER BY log_date)
df["running_total"] = df.groupby("mission_id")["resources_used"].cumsum()

# Equivalent to moving average
df["moving_avg_3d"] = (
    df.groupby("mission_id")["resources_used"]
    .transform(lambda x: x.rolling(3, min_periods=1).mean())
)

Common Pitfalls

Forgetting PARTITION BY in window functions. Result is a global sort + aggregate, not the within-group computation you intended.
The default window for ORDER BY in window functions is from the start to the current row. If you only want the current row (e.g., ranking, moving average), you need to explicitly specify the window range.
Using non-reusable temporary table names in CTEs, making debugging difficult.

Pass Challenges

Warm-up: Write a window function on the missions table to compute each team member's average mission completion time.
Challenge: Use CTE + window functions to write a complete analytical query: find missions with abnormally high resource usage for 3 consecutive days (exceeding 2 standard deviations above the 30-day moving average).
Observation: Decompose a 5-step CTE query step by step, output each step to an intermediate table, and verify each step's intermediate results.

Acceptance Criteria

Can write window functions with PARTITION BY and ORDER BY
Can use CTEs to break multi-step analytical queries into readable steps
Understand the purpose of ROLLUP and CUBE
Can reproduce equivalent window calculations in Python with pandas

Traveler's Notes

SQL window functions are your handiest weapon for analytical problems. Master them, and your SQL ability goes from "can query" to "can analyze."

Next Chapter Preview

In the previous chapter, you saw patterns in the data. But are some patterns real or just coincidental? Next chapter—using statistical inference to make judgments.