Metadata Card
- Prerequisites: Vol 5 Chapter 1 (Database Fundamentals)
- Estimated time: 45 minutes
- Core difficulty: Advanced
- Reading mode: High focus
- Completion: Able to use window functions, CTEs, and OLAP queries for multi-step analytical tasks
Your Progress
Through EDA you learned the general shape of the data—which columns are worth digging into, which are almost all empty. But the intelligence officer at the Prophecy Hall looked at your charts and shook his head: 'Nice charts, but I can't look up specific numbers from them. Help me find: over the past three months, which fortress had the highest mission completion rate?'
You realize that pandas feels awkward for cross-table aggregations like this. What you need is a proper analytical weapon—SQL's analysis patterns.
Your Task
In Vol 5, you used SQL for CRUD—insert, update, delete, simple queries. That was day-to-day MySQL operations, querying a few rows, at most a few hundred. Now your data is in the data warehouse, and each query scans millions of rows. You need cumulative statistics across groups, moving averages, and period-over-period comparisons. Regular SELECT-GROUP BY won't cut it anymore. This chapter gives you the analytical SQL arsenal.
From Row Operations to Set Operations
The difference between analytical SQL and transactional SQL is the mindset:
- Transactional SQL: I need this row, modify this field
- Analytical SQL: I need this group of data, compute a value across rows
In Vol 5 you used GROUP BY for grouped aggregation. But GROUP BY has a limitation: after grouping, each group's detail rows are collapsed into a single aggregate value. If you want to keep every row while adding the grouped aggregate result alongside it—that's when you need window functions.
Window Functions
Window functions don't collapse rows. They compute an aggregate value within a "window" alongside each row.
GROUP BY compresses rows into groups at the cost of losing detail. Window functions don't lose rows—they add a column beside each row, preserving the original records while displaying aggregate results, like the crystal panels of the Prophecy Hall.
-- Basic syntax:
-- function_name() OVER (PARTITION BY group_column ORDER BY sort_column)Look at a concrete example. Suppose you have a mission logs table:
-- Create example table
CREATE TABLE mission_logs (
mission_id INT,
log_date DATE,
resources_used INT
);
INSERT INTO mission_logs VALUES
(1, '2026-06-01', 100),
(1, '2026-06-02', 120),
(1, '2026-06-03', 110),
(2, '2026-06-01', 200),
(2, '2026-06-02', 220);
-- You want: daily usage per mission, plus running total for that mission
SELECT
mission_id,
log_date,
resources_used,
SUM(resources_used) OVER (
PARTITION BY mission_id
ORDER BY log_date
) AS running_total
FROM mission_logs;Result:
mission_id | log_date | resources_used | running_total
-----------+------------+----------------+--------------
1 | 2026-06-01 | 100 | 100
1 | 2026-06-02 | 120 | 220
1 | 2026-06-03 | 110 | 330
2 | 2026-06-01 | 200 | 200
2 | 2026-06-02 | 220 | 420See the difference? GROUP BY can only give you two aggregate values (Mission 1 = 330, Mission 2 = 420). Window functions retain every row and add the running cumulative value on the right.
Common window function use cases:
Ranking, moving averages, period-over-period comparisons—these three patterns cover 80% of analytical needs. Each pattern uses OVER (PARTITION BY ... ORDER BY ...) to define the computing window's range.
-- Ranking: resource usage rank within each mission
SELECT
mission_id, log_date, resources_used,
ROW_NUMBER() OVER (PARTITION BY mission_id ORDER BY resources_used DESC) AS rank
FROM mission_logs;
-- Moving average: average resource usage over the past 3 days
SELECT
mission_id, log_date, resources_used,
AVG(resources_used) OVER (
PARTITION BY mission_id
ORDER BY log_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS moving_avg_3d
FROM mission_logs;
-- Lag: take the previous day's value for period-over-period comparison
SELECT
mission_id, log_date, resources_used,
LAG(resources_used, 1) OVER (
PARTITION BY mission_id
ORDER BY log_date
) AS prev_day,
resources_used - LAG(resources_used, 1) OVER (
PARTITION BY mission_id
ORDER BY log_date
) AS daily_change
FROM mission_logs;CTE: Breaking Multi-Step Queries into Readable Steps
Analytical queries often can't be done in a single SELECT. You need to compute A first, then B based on A, then C based on B. CTEs (Common Table Expressions) let you write step by step, with a name for each step.
WITH daily_usage AS (
-- Step 1: compute total daily usage
SELECT
log_date,
SUM(resources_used) AS total_resources
FROM mission_logs
GROUP BY log_date
),
weekly_avg AS (
-- Step 2: compute 7-day moving average
SELECT
log_date,
total_resources,
AVG(total_resources) OVER (
ORDER BY log_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS weekly_ma
FROM daily_usage
)
-- Step 3: find anomaly days below 80% of the moving average
SELECT *
FROM weekly_avg
WHERE total_resources < weekly_ma * 0.8;The biggest benefit of CTEs is that each step can be tested independently. Run daily_usage first, confirm the results, then add the next step.
OLAP Operations: ROLLUP and CUBE
When you want to see data at different dimension levels (total aggregate, by type, by type + region), you don't need to write three UNION ALLs. Use ROLLUP:
SELECT
mission_type,
location,
SUM(resources_used) AS total
FROM mission_logs
GROUP BY ROLLUP (mission_type, location);This will return:
- One row for total aggregate (mission_type = NULL, location = NULL)
- Rows aggregated by mission_type (location = NULL)
- Detailed rows aggregated by mission_type + location
CUBE is more thorough—it generates all possible dimension combinations.
Writing Analytical SQL in Python
Pandas' groupby + transform is equivalent to SQL window functions:
import pandas as pd
df = pd.read_sql("SELECT * FROM mission_logs", engine)
# Equivalent to SUM OVER (PARTITION BY mission_id ORDER BY log_date)
df["running_total"] = df.groupby("mission_id")["resources_used"].cumsum()
# Equivalent to moving average
df["moving_avg_3d"] = (
df.groupby("mission_id")["resources_used"]
.transform(lambda x: x.rolling(3, min_periods=1).mean())
)Common Pitfalls
- Forgetting
PARTITION BYin window functions. Result is a global sort + aggregate, not the within-group computation you intended. - The default window for
ORDER BYin window functions is from the start to the current row. If you only want the current row (e.g., ranking, moving average), you need to explicitly specify the window range. - Using non-reusable temporary table names in CTEs, making debugging difficult.
Pass Challenges
- Warm-up: Write a window function on the missions table to compute each team member's average mission completion time.
- Challenge: Use CTE + window functions to write a complete analytical query: find missions with abnormally high resource usage for 3 consecutive days (exceeding 2 standard deviations above the 30-day moving average).
- Observation: Decompose a 5-step CTE query step by step, output each step to an intermediate table, and verify each step's intermediate results.
Acceptance Criteria
- Can write window functions with
PARTITION BYandORDER BY - Can use CTEs to break multi-step analytical queries into readable steps
- Understand the purpose of
ROLLUPandCUBE - Can reproduce equivalent window calculations in Python with pandas
Traveler's Notes
SQL window functions are your handiest weapon for analytical problems. Master them, and your SQL ability goes from "can query" to "can analyze."
Next Chapter Preview
In the previous chapter, you saw patterns in the data. But are some patterns real or just coincidental? Next chapter—using statistical inference to make judgments.