Skip to content

Metadata Card

  • Prerequisites: Chapter 4 (HDFS Distributed Storage), Vol 5 (SQL Queries)
  • Estimated Time: 45 minutes
  • Core Difficulty: Intermediate
  • Reading Mode: High focus
  • Completion Milestone: Explain MapReduce computation flow, understand Spark RDD lineage and fault tolerance, know DataFrame benefits over RDD, run a simple PySpark job

Your Progress

100 forward stations send battle reports to the data warehouse daily via message queue. After a year, your HDFS cluster stores 3PB of logs.

General Lin says: "This data isn't just for storage. I need a report by tomorrow: which theater had the fastest combat growth in the past month, supply consumption trends, and correlations between troop fluctuations at each station."

You can't run SELECT ... GROUP BY on your laptop — 3PB of data, a single-machine SQL engine would take a month. You need to distribute the computation across machines, each processing its local data. Your Task

Two milestones: MapReduce (first general-purpose distributed computing model) and Spark (in-memory DAG execution engine).


MapReduce: Map (parallel, on each DataNode) → Shuffle (framework-automated sort/group) → Reduce (aggregate). Key insight: move computation to data, not data to computation.

Limitations: Intermediate results must write to disk. Programming model is limited (multi-stage tasks chain multiple MapReduce jobs). Batch mode only.

Spark: In-memory DAG execution. RDD (Resilient Distributed Dataset): immutable, partitioned, recoverable via lineage. Transformations (map, filter) are lazy; Actions (collect, take) trigger computation.

DataFrame: Higher-level abstraction with Catalyst optimizer (predicate pushdown, column pruning, join reordering) and Tungsten engine (columnar in-memory format).

Key comparison: MapReduce writes intermediate data to disk; Spark keeps it in memory. MapReduce has named Map/Reduce phases; Spark has arbitrary DAG. MapReduce is minute-level latency; Spark is second-level.


Common Pitfalls: Using groupByKey with large data volumes (use reduceByKey for local aggregation). Ignoring data skew (one key 1000x larger than others). Serializing large objects in closures (use broadcast variables). Using RDD instead of DataFrame.


Traveler's Notes

Distributed computing is "sending computation to data." MapReduce pioneered the model but was constrained by disk IO. Spark pushed it to extremes with in-memory DAG + lazy evaluation + lineage recovery. DataFrame and SQL let you express distributed computation declaratively.


Next: Microservices Architecture (Chapter 6).

Built with VitePress | Software Systems Atlas