Chapter 9: Distributed Data Processing

Metadata Card

Prerequisites: Chapter 4 (SQL for Analysis), Python pandas basics
Estimated time: 50 minutes
Core difficulty: Advanced
Reading mode: High focus
Completion: Able to identify data scale scenarios where pandas can no longer handle the job, and choose the correct distributed tool

Your Progress

Your model ran fine on 100,000 records. Your daily work at the Data Prophecy Hall was smooth—load a CSV, run a groupby, draw a chart, all done in seconds.

But today was different. The intelligence officer walked up to your desk holding a storage crystal and slammed it down: "Here are the complete logs from 10 outposts over the past five years, about 500GB. Analyze the relationship between equipment wear rate and combat frequency."

You opened pandas, pd.read_csv()—waited 10 seconds, memory spiked, then OOM. You tried chunked reading—so slow it felt like crawling, and you had to manually merge cross-chunk groupby() results.

"The Hall could handle everything before," you said.

The intelligence officer replied: "That's because what you had before was intelligence. This is history."

You realized a harsh truth: your tools had reached their limits—not because your code was wrong, but because the data had grown beyond what a single machine could process at once. In the Data Prophecy Hall, data scientists work with data that fits in memory; on the real battlefield, data is larger than memory.

You need distributed data processing.

Your Task

Your dataset went from a single CSV to hundreds of CSVs, from memory-friendly to dozens of gigabytes. Your df.groupby() ran for an hour without finishing—you stared at the progress bar, motionless, starting to question your career. pd.concat() went straight to OOM.

Pandas was designed assuming data fits in memory—like the manual workbench in the Prophecy Hall, which can only hold so many scrolls. When this assumption breaks, you need distributed data processing: splitting scrolls across multiple workbenches for parallel processing. This chapter guides you from pandas' limits gracefully into Dask and Spark SQL.

Pandas' Limits and Alternatives

Pandas performs well when single-machine memory < 16 GB and data < half of available memory. When data exceeds this threshold, you have three directions:

Iteration/chunking — Don't load all data at once, process in chunks
Lazy evaluation — Build a computation graph, trigger on demand
Distributed — Parallel computation across multiple machines

python

import pandas as pd

# Direction 1: Chunked reading
chunk_size = 100_000
results = []
for chunk in pd.read_csv("large_dataset.csv", chunksize=chunk_size):
    # Aggregate each chunk
    chunk_result = chunk.groupby("mission_type")["resources"].sum()
    results.append(chunk_result)

# Merge chunk results
final = pd.concat(results).groupby(level=0).sum()

Chunked reading solves memory problems, but doesn't offer parallelism—each chunk is processed sequentially. When you need parallelism, switch to Dask.

Dask: Distributed Extension of Pandas

Dask has an almost identical DataFrame API to pandas, but with lazy evaluation underneath.

python

import dask.dataframe as dd

# Read multiple files — single Dask DataFrame
ddf = dd.read_csv("missions_*.csv")

# Same syntax as pandas, but not yet computed
result = ddf.groupby("mission_type")["resources"].sum()

# Trigger computation — this is when it actually runs
final = result.compute()
print(final)

What Dask does:

read_csv only reads file partition information, doesn't load data
groupby builds a computation graph (task graph), doesn't execute
compute() triggers computation, utilizing multi-core or cluster parallelism

Dask's limitation: not all pandas operations are supported. Especially complex index operations and some uses of groupby().transform(). You need to run through Dask first and handle unsupported parts with pandas.

python

# Common operations — Dask handles these well
ddf["duration_hours"] = ddf["duration_minutes"] / 60
ddf.dropna(subset=["success_rate"])
ddf[ddf["success_rate"] > 0.5].groupby("region").size().compute()

# Complex operations — may need to convert to pandas
subset = ddf[ddf["region"] == "north"].compute()

Spark SQL: From SQL to Distributed

If your team is already using Spark, or the data is too large for Dask (exceeding dozens of GB and requiring a cluster), Spark SQL is the most universal choice.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("data_analysis").getOrCreate()

# Read from file
df_spark = spark.read.csv("missions_*.csv", header=True, inferSchema=True)

# Use SQL for analysis
df_spark.createOrReplaceTempView("missions")
result_sql = spark.sql("""
    SELECT mission_type, AVG(success_rate) as avg_success
    FROM missions
    WHERE resources > 100
    GROUP BY mission_type
    HAVING avg_success > 0.6
""")
result_sql.show()

# Or use DataFrame API
from pyspark.sql import functions as F

result_df = (
    df_spark
    .filter(F.col("resources") > 100)
    .groupBy("mission_type")
    .agg(F.avg("success_rate").alias("avg_success"))
    .filter(F.col("avg_success") > 0.6)
)
result_df.show()

Spark SQL's lazy mechanism is similar to Dask—computation is only triggered when show() or write() is called.

Selection Principles

Scenario	Tool	Reason
Data < 16GB, single machine	pandas	Simplest, richest ecosystem
Data 16GB-100GB, single/small cluster	Dask	API closest to pandas, lowest migration cost
Data > 100GB, needs cluster	Spark SQL	Mature and stable, good resource management
Streaming data	Spark Streaming / Flink	Latency-sensitive scenarios

Universal Performance Optimization Principles

Regardless of your tool, these principles apply:

Filter early. Do filter before groupby to reduce downstream data volume.
Reduce Shuffle. groupby on non-partition keys triggers full data reshuffling. Try to group by partition keys.
Read only needed columns. read_csv(usecols=[...]) is much faster than reading everything then dropping.
Choose the right file format. Parquet is 5-10× faster than CSV because columnar storage only reads required columns.

python

# pandas reading parquet
df = pd.read_parquet("missions.parquet")

# Dask reading parquet
ddf = dd.read_parquet("missions_*.parquet")

# Spark reading parquet
df_spark = spark.read.parquet("missions/")

Common Pitfalls

Calling compute() frequently in Dask. Each compute() triggers full computation. You should build the entire computation graph and call compute() only once at the end.
Using default partition count. Too few means each partition is too large → OOM, too many means each partition is too small → scheduling overhead exceeds computation. Adjust npartitions based on data size and cluster resources.
Forgetting data locality. If compute nodes need to read data remotely, the network becomes the bottleneck. Store data on the same cluster as the compute nodes.

Pass Challenges

Warm-up: Process a large file that exceeds memory using pandas' chunked reading functionality.
Challenge: Migrate one of your pandas analysis scripts to Dask, documenting the code changes. How many lines need changing? Which operations can't be directly migrated?
Observation: Compare execution times on the same dataset using pandas (chunked reading), Dask, and Spark SQL.

Acceptance Criteria

Knows when to switch from pandas to distributed tools
Can use Dask for DataFrame-level distributed computation
Understands the difference between lazy computation and triggered computation
Can write analytical queries in Spark SQL

Traveler's Notes

Distributed isn't magic—it splits computation across multiple machines but also brings network overhead and scheduling complexity. First assess data size, then choose the tool.

Next Chapter Preview

We've covered a lot of technical methods. But data science isn't just a technical problem—next chapter, we pause to consider Data Ethics.