Metadata Card
- Prerequisites: Chapter 2 (Data Cleaning), Math B (probability/statistics)
- Estimated time: 45 minutes
- Core difficulty: Beginner
- Reading mode: Casual stroll
- Completion: Able to perform systematic EDA on a DataFrame and generate a visual report
Your Progress
You spent most of the day cleaning the data—missing values filled, outliers flagged, formats unified.
Now you face a pristine data table. But you discover a problem: this table has 50 columns and 100,000 rows, and you have absolutely no idea what's inside. What's the average temperature? Which column has the widest value distribution? Do certain columns seem related to each other?
Staring at numbers alone won't tell you. You need the data to speak for itself.
Your Task
The data is clean. Before you is a neat table. But 200 columns × 100,000 rows—you can't absorb that. You can't understand what the data is saying row by row. EDA is your data reconnaissance method: use statistics and charts to quickly build intuition about the data—distributions, trends, relationships, anomalies—all made visible.
The Three Core Questions of EDA
Every round of EDA answers three questions:
- Univariate—What does this column look like? Distribution shape, central tendency, dispersion, missing values
- Bivariate—What is the relationship between two columns? Correlation, contrast, trend
- Multivariate—What patterns emerge from cross-examining multiple columns? Group comparison, dimension reduction
You start with question 1 and progressively go deeper to question 3. Don't skip steps.
Univariate Analysis
The first step to looking at a single column: draw its distribution.
The crystal screens of the Prophecy Hall won't directly tell you what shape the data takes—you have to draw it yourself. The histogram is the first key: it turns a string of numbers into a wall, and the wall's height tells you where the data concentrates.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("missions_clean.csv")
# Numeric column—histogram
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
df["duration_minutes"].hist(bins=50, ax=axes[0])
axes[0].set_title("Distribution of Duration")
# Categorical column—bar chart
df["mission_type"].value_counts().plot(kind="bar", ax=axes[1])
axes[1].set_title("Mission Types")
plt.tight_layout()
plt.show()The histogram tells you the shape of the data: Is it normally distributed? Is there more on the left or right? Are there two peaks (bimodal distribution, suggesting two different types of data mixed together)?
Beyond that, you need numerical summaries:
Visual inspection of the distribution isn't enough—you need precise numbers. df.describe() is the instinctive response of every data table in the Prophecy Hall: count, mean, std, quartiles.
print(df["duration_minutes"].describe())Pay attention to count, mean, std, min, 25%, 50% (median), 75%, max. If the mean is significantly larger than the median, the data is right-skewed—a few extremely large values are pulling the mean up.
Bivariate Analysis
Now you want to see the relationship between two columns.
The story of a single column is just the prelude. True insight comes from the conversation between two columns—each point on a scatter plot is a record of mission outcome. The direction of the cluster tells you: does more supply mean higher success rates?
# Scatter plot—relationship between two columns
plt.figure(figsize=(6, 6))
plt.scatter(df["duration_minutes"], df["success_rate"], alpha=0.3)
plt.xlabel("Duration (min)")
plt.ylabel("Success Rate")
plt.title("Duration vs Success Rate")
plt.show()
# Correlation matrix
corr = df[["duration_minutes", "success_rate", "team_size", "resources"]].corr()
print(corr)Each point on the scatter plot is a mission. The distribution pattern tells you:
- Upward slope → positive correlation: longer duration, higher success rate?
- Downward slope → negative correlation: longer duration, lower success rate?
- Random scatter → no correlation
corr() outputs a correlation coefficient matrix, ranging from -1 to 1. But it only captures linear relationships. Two variables can have a correlation coefficient near 0 but a strong nonlinear relationship (e.g., a U-shaped distribution). So the correlation matrix cannot replace looking at the chart.
Multivariate Analysis
Two columns aren't enough—you want to see "what's the relationship between team size and success rate under different mission types?"
Multivariate analysis breaks data apart and reassembles it—color-coded by mission type, what you see is no longer a tangled mess but clusters in different colors. Each mission type has its unique pattern, and box plots let you see the differences at a glance.
# Group comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Box plot: success rate distribution by mission type
df.boxplot(column="success_rate", by="mission_type", ax=axes[0])
axes[0].set_title("Success Rate by Mission Type")
# Faceted scatter: duration vs success_rate colored by type
for mt in df["mission_type"].unique():
subset = df[df["mission_type"] == mt]
axes[1].scatter(subset["duration_minutes"], subset["success_rate"],
label=mt, alpha=0.3)
axes[1].legend()
axes[1].set_xlabel("Duration")
axes[1].set_ylabel("Success Rate")
axes[1].set_title("By Mission Type")
plt.tight_layout()
plt.show()Box plots are a great tool for quickly understanding group distributions. They show the median (center line), interquartile range (box), and outliers (points outside the box). If the boxes of two groups don't overlap at all, they likely have significant differences.
EDA Automation
Manual charting is good for exploration. When you need a standardized report, use pandas-profiling (now called ydata-profiling) to generate one in a single command.
pip install ydata-profilingfrom ydata_profiling import ProfileReport
report = ProfileReport(df, title="Data Profiling Report", explorative=True)
report.to_file("eda_report.html")This HTML report includes: statistical summary for each column, distribution histograms, missing value matrix, correlation heatmap, and outlier annotations. You don't need to draw every chart from scratch on every project—first run an automated report, then dive deeper into suspicious columns.
The Rhythm of EDA
You don't need to finish all analysis in one go. The standard rhythm is:
- Run automated report → quick global overview
- Pick out "suspicious" columns from the report (weird distributions, high missing rates, high correlations)
- Manually draw charts and do cross-analysis on these columns
- Form hypotheses → record them
- After entering the modeling phase, return to EDA to validate model anomalies
Common Pitfalls
- Only looking at descriptive statistics, not charts. Four quarters of data can have identical means and variances but completely different distribution shapes (Anscombe's Quartet).
- Treating correlation as causation. Two columns being highly correlated doesn't mean one causes the other.
- Doing EDA before data cleaning. Dirty data will ruin your distribution chart—a -999 placeholder can pull the histogram to an absurd position.
- Over-interpreting. Seeing an interesting pattern and immediately assuming it's a major discovery, forgetting it could be sampling error.
Pass Challenges
- Warm-up: Use
df.describe()to find a column that looks odd, then draw its histogram to verify your intuition. - Challenge: Perform a complete EDA on a dataset—including univariate distribution, bivariate scatter plots, correlation matrix, grouped box plots—and form 3+ hypotheses about the data.
- Observation: Run an automated profiling report and find at least one data pattern you didn't notice manually.
Acceptance Criteria
- Can perform univariate, bivariate, and multivariate analysis on a dataset
- Can choose the right chart for the right data type (numeric → histogram/box plot, categorical → bar chart)
- Can interpret scatter plots and box plots
- Knows the limitations of the correlation matrix
Traveler's Notes
Visualization is not the final presentation—it's an interrogation of the data. You ask it questions, it answers, and you follow up.
Next Chapter Preview
You can now understand data through visualization and statistics. But when you need to ask more complex aggregate questions, you need a more powerful weapon—next chapter, Analytical SQL.