Metadata Card
- Prerequisites: Chapter 4 (SQL for Analysis), Python pandas basics
- Estimated time: 50 minutes
- Core difficulty: Advanced
- Reading mode: High focus
- Completion: Able to identify data scale scenarios where pandas can no longer handle the job, and choose the correct distributed tool
Your Progress
Your model ran fine on 100,000 records. Your daily work at the Data Prophecy Hall was smooth—load a CSV, run a groupby, draw a chart, all done in seconds.
But today was different. The intelligence officer walked up to your desk holding a storage crystal and slammed it down: "Here are the complete logs from 10 outposts over the past five years, about 500GB. Analyze the relationship between equipment wear rate and combat frequency."
You opened pandas, pd.read_csv()—waited 10 seconds, memory spiked, then OOM. You tried chunked reading—so slow it felt like crawling, and you had to manually merge cross-chunk groupby() results.
"The Hall could handle everything before," you said.
The intelligence officer replied: "That's because what you had before was intelligence. This is history."
You realized a harsh truth: your tools had reached their limits—not because your code was wrong, but because the data had grown beyond what a single machine could process at once. In the Data Prophecy Hall, data scientists work with data that fits in memory; on the real battlefield, data is larger than memory.
You need distributed data processing.
Your Task
Your dataset went from a single CSV to hundreds of CSVs, from memory-friendly to dozens of gigabytes. Your df.groupby() ran for an hour without finishing—you stared at the progress bar, motionless, starting to question your career. pd.concat() went straight to OOM.
Pandas was designed assuming data fits in memory—like the manual workbench in the Prophecy Hall, which can only hold so many scrolls. When this assumption breaks, you need distributed data processing: splitting scrolls across multiple workbenches for parallel processing. This chapter guides you from pandas' limits gracefully into Dask and Spark SQL.
Pandas' Limits and Alternatives
Pandas performs well when single-machine memory < 16 GB and data < half of available memory. When data exceeds this threshold, you have three directions:
- Iteration/chunking — Don't load all data at once, process in chunks
- Lazy evaluation — Build a computation graph, trigger on demand
- Distributed — Parallel computation across multiple machines
import pandas as pd
# Direction 1: Chunked reading
chunk_size = 100_000
results = []
for chunk in pd.read_csv("large_dataset.csv", chunksize=chunk_size):
# Aggregate each chunk
chunk_result = chunk.groupby("mission_type")["resources"].sum()
results.append(chunk_result)
# Merge chunk results
final = pd.concat(results).groupby(level=0).sum()Chunked reading solves memory problems, but doesn't offer parallelism—each chunk is processed sequentially. When you need parallelism, switch to Dask.
Dask: Distributed Extension of Pandas
Dask has an almost identical DataFrame API to pandas, but with lazy evaluation underneath.
import dask.dataframe as dd
# Read multiple files — single Dask DataFrame
ddf = dd.read_csv("missions_*.csv")
# Same syntax as pandas, but not yet computed
result = ddf.groupby("mission_type")["resources"].sum()
# Trigger computation — this is when it actually runs
final = result.compute()
print(final)What Dask does:
read_csvonly reads file partition information, doesn't load datagroupbybuilds a computation graph (task graph), doesn't executecompute()triggers computation, utilizing multi-core or cluster parallelism
Dask's limitation: not all pandas operations are supported. Especially complex index operations and some uses of groupby().transform(). You need to run through Dask first and handle unsupported parts with pandas.
# Common operations — Dask handles these well
ddf["duration_hours"] = ddf["duration_minutes"] / 60
ddf.dropna(subset=["success_rate"])
ddf[ddf["success_rate"] > 0.5].groupby("region").size().compute()
# Complex operations — may need to convert to pandas
subset = ddf[ddf["region"] == "north"].compute()Spark SQL: From SQL to Distributed
If your team is already using Spark, or the data is too large for Dask (exceeding dozens of GB and requiring a cluster), Spark SQL is the most universal choice.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("data_analysis").getOrCreate()
# Read from file
df_spark = spark.read.csv("missions_*.csv", header=True, inferSchema=True)
# Use SQL for analysis
df_spark.createOrReplaceTempView("missions")
result_sql = spark.sql("""
SELECT mission_type, AVG(success_rate) as avg_success
FROM missions
WHERE resources > 100
GROUP BY mission_type
HAVING avg_success > 0.6
""")
result_sql.show()
# Or use DataFrame API
from pyspark.sql import functions as F
result_df = (
df_spark
.filter(F.col("resources") > 100)
.groupBy("mission_type")
.agg(F.avg("success_rate").alias("avg_success"))
.filter(F.col("avg_success") > 0.6)
)
result_df.show()Spark SQL's lazy mechanism is similar to Dask—computation is only triggered when show() or write() is called.
Selection Principles
| Scenario | Tool | Reason |
|---|---|---|
| Data < 16GB, single machine | pandas | Simplest, richest ecosystem |
| Data 16GB-100GB, single/small cluster | Dask | API closest to pandas, lowest migration cost |
| Data > 100GB, needs cluster | Spark SQL | Mature and stable, good resource management |
| Streaming data | Spark Streaming / Flink | Latency-sensitive scenarios |
Universal Performance Optimization Principles
Regardless of your tool, these principles apply:
- Filter early. Do
filterbeforegroupbyto reduce downstream data volume. - Reduce Shuffle.
groupbyon non-partition keys triggers full data reshuffling. Try to group by partition keys. - Read only needed columns.
read_csv(usecols=[...])is much faster than reading everything then dropping. - Choose the right file format. Parquet is 5-10× faster than CSV because columnar storage only reads required columns.
# pandas reading parquet
df = pd.read_parquet("missions.parquet")
# Dask reading parquet
ddf = dd.read_parquet("missions_*.parquet")
# Spark reading parquet
df_spark = spark.read.parquet("missions/")Common Pitfalls
- Calling
compute()frequently in Dask. Eachcompute()triggers full computation. You should build the entire computation graph and callcompute()only once at the end. - Using default partition count. Too few means each partition is too large → OOM, too many means each partition is too small → scheduling overhead exceeds computation. Adjust
npartitionsbased on data size and cluster resources. - Forgetting data locality. If compute nodes need to read data remotely, the network becomes the bottleneck. Store data on the same cluster as the compute nodes.
Pass Challenges
- Warm-up: Process a large file that exceeds memory using pandas' chunked reading functionality.
- Challenge: Migrate one of your pandas analysis scripts to Dask, documenting the code changes. How many lines need changing? Which operations can't be directly migrated?
- Observation: Compare execution times on the same dataset using pandas (chunked reading), Dask, and Spark SQL.
Acceptance Criteria
- Knows when to switch from pandas to distributed tools
- Can use Dask for DataFrame-level distributed computation
- Understands the difference between lazy computation and triggered computation
- Can write analytical queries in Spark SQL
Traveler's Notes
Distributed isn't magic—it splits computation across multiple machines but also brings network overhead and scheduling complexity. First assess data size, then choose the tool.
Next Chapter Preview
We've covered a lot of technical methods. But data science isn't just a technical problem—next chapter, we pause to consider Data Ethics.