Skip to content

Volume 12: Data Processing & Data Science

Refining gold from data—end-to-end data engineering and data science, from collection to governance.

Refining gold from field report data. I see the answers to the entire journey hidden in the data.

Prerequisites

Programming basics required. Familiarity with SQL, basic statistical concepts (mean/variance). Complete Vol 1 first.

Volume note: This volume was moved from its original position as Volume 14 to follow Compiler Principles (Vol 11) and precede AI/Machine Learning (Vol 13), because data preprocessing and feature engineering are prerequisites for ML.

What's Inside

Data is the new oil—but you first need to learn how to extract and refine it. This volume covers the complete data lifecycle from collection, cleaning, analysis to governance. The division of labor with Data Systems (Vol 5) is: Vol 5 covers how database kernels store and query data, while this volume covers how data is used, managed, and quality-assured.

Chapter Overview

#Chapter TitleSummaryPrerequisite
1Data LifecycleCollection, storage, processing, analysis, archiving, destruction end-to-end
2Data CleaningMissing values, outliers, duplicates, format unification, data quality rules
3EDA & VisualizationDescriptive statistics, distribution analysis, correlation, pandas-profilingMath B
4SQL for AnalysisAnalytical SQL, window functions, OLAP cubeVol 5 ch1
5Statistical InferenceHypothesis testing, confidence intervals, p-value, effect sizeMath B
6Linear Regression & Model DiagnosticsOLS, residual analysis, multicollinearity, regularized regressionch5, Math C
7Feature EngineeringEncoding, scaling, binning, cross features, feature selectionch4
8Sampling & Causal InferenceRandomized control, natural experiments, DAG, instrumental variablesch5
9Distributed Data ProcessingPandas limits, Dask/Spark SQL, data parallelismch4
10Data EthicsBias, fairness, transparency, data rights
11Data Governance FundamentalsQuality dimensions, data catalog, monitoring, SLAch1-2
12Data Lineage & MetadataLineage tracing, metadata management, impact analysisch11
13Data Mesh & Data ProductsDomain data ownership, data products, federated governancech11
14Privacy Compliance & Data SecurityGDPR, data masking, differential privacy, access controlch11, Vol 8

Prerequisite knowledge: Programming basics (Vol 1), Math B (probability/statistics), Database fundamentals (Vol 5 ch1), Python basics

Completion: Able to independently complete the full pipeline from raw data to analysis report; able to design a data quality monitoring system; understand governance frameworks


Built with VitePress | Software Systems Atlas