Volume 12: Data Processing & Data Science
Refining gold from data—end-to-end data engineering and data science, from collection to governance.
Refining gold from field report data. I see the answers to the entire journey hidden in the data.
Prerequisites
Programming basics required. Familiarity with SQL, basic statistical concepts (mean/variance). Complete Vol 1 first.
Volume note: This volume was moved from its original position as Volume 14 to follow Compiler Principles (Vol 11) and precede AI/Machine Learning (Vol 13), because data preprocessing and feature engineering are prerequisites for ML.
What's Inside
Data is the new oil—but you first need to learn how to extract and refine it. This volume covers the complete data lifecycle from collection, cleaning, analysis to governance. The division of labor with Data Systems (Vol 5) is: Vol 5 covers how database kernels store and query data, while this volume covers how data is used, managed, and quality-assured.
Chapter Overview
| # | Chapter Title | Summary | Prerequisite |
|---|---|---|---|
| 1 | Data Lifecycle | Collection, storage, processing, analysis, archiving, destruction end-to-end | — |
| 2 | Data Cleaning | Missing values, outliers, duplicates, format unification, data quality rules | — |
| 3 | EDA & Visualization | Descriptive statistics, distribution analysis, correlation, pandas-profiling | Math B |
| 4 | SQL for Analysis | Analytical SQL, window functions, OLAP cube | Vol 5 ch1 |
| 5 | Statistical Inference | Hypothesis testing, confidence intervals, p-value, effect size | Math B |
| 6 | Linear Regression & Model Diagnostics | OLS, residual analysis, multicollinearity, regularized regression | ch5, Math C |
| 7 | Feature Engineering | Encoding, scaling, binning, cross features, feature selection | ch4 |
| 8 | Sampling & Causal Inference | Randomized control, natural experiments, DAG, instrumental variables | ch5 |
| 9 | Distributed Data Processing | Pandas limits, Dask/Spark SQL, data parallelism | ch4 |
| 10 | Data Ethics | Bias, fairness, transparency, data rights | — |
| 11 | Data Governance Fundamentals | Quality dimensions, data catalog, monitoring, SLA | ch1-2 |
| 12 | Data Lineage & Metadata | Lineage tracing, metadata management, impact analysis | ch11 |
| 13 | Data Mesh & Data Products | Domain data ownership, data products, federated governance | ch11 |
| 14 | Privacy Compliance & Data Security | GDPR, data masking, differential privacy, access control | ch11, Vol 8 |
Prerequisite knowledge: Programming basics (Vol 1), Math B (probability/statistics), Database fundamentals (Vol 5 ch1), Python basics
Completion: Able to independently complete the full pipeline from raw data to analysis report; able to design a data quality monitoring system; understand governance frameworks
This volume has 14 chapters, all completed
- Chapter 1: Data Lifecycle
- Chapter 2: Data Cleaning
- Chapter 3: EDA & Visualization
- Chapter 4: SQL for Analysis
- Chapter 5: Statistical Inference
- Chapter 6: Linear Regression & Model Diagnostics
- Chapter 7: Feature Engineering
- Chapter 8: Sampling & Causal Inference
- Chapter 9: Distributed Data Processing
- Chapter 10: Data Ethics
- Chapter 11: Data Governance Fundamentals
- Chapter 12: Data Lineage & Metadata
- Chapter 13: Data Mesh & Data Products
- Chapter 14: Privacy Compliance & Data Security