Skip to content

Metadata Card

  • Prerequisites: Chapter 1 (Data Lifecycle), Python / pandas basics
  • Estimated time: 45 minutes
  • Core difficulty: Beginner
  • Reading mode: High focus
  • Completion: Able to independently clean a real dataset with missing values, outliers, and duplicate rows

Your Progress

In the previous chapter you saw the big picture of data flow—acquisition, transfer, storage, processing, analysis, archiving, destruction. The pipeline diagrams in the Data Prophecy Hall looked beautiful. But when you actually started working, you hit the first roadblock immediately: the data you pulled was all dirty.

You sat down at the workbench in the Data Prophecy Hall and loaded the sensor logs sent back from the front lines. The archivist of the Data Prophecy Hall stood behind you and glanced at your screen—date formats were all over the place (2024-01-15, 01/15/2024, 15-Jan-2024), some fields were empty, and some values were clearly wrong—a temperature sensor reported 999°C.

"Not surprising," the archivist said, voice as flat as if saying "it rained today." "Raw data is always dirty. No dataset is born clean—sensors disconnect, people make typos, database fields change, encodings get corrupted. Every piece of data you receive needs to be cleaned before you can use it. That's the first lesson of data science: don't trust raw data."

Your Task

You loaded a CSV, called df.info(), and saw 1000 rows. But as you scrolled down, you noticed: some cells are empty, some date formats differ, a few rows look identical, and one column has values in an absurd range—a temperature reading of 999 degrees.

Raw data is almost always dirty, like the field reports from the front lines that always have ambiguities. This chapter teaches you how to systematically fix dirty data—not by staring at rows one by one, but by building a repeatable cleaning pipeline.


The Four Forms of Dirty Data

Almost all dirty data falls into these four categories:

  1. Missing values — Empty, or fillers like -999, N/A, None
  2. Outliers — Values outside the reasonable range
  3. Duplicate data — Fully duplicated or key-field duplicated
  4. Inconsistent formatting — Dates, string casing, encoding not uniform

Your toolbox doesn't need complex models. Start with your eyes, then quantify with simple statistics.


Step One: Overview of Data Health

python
import pandas as pd
import numpy as np

df = pd.read_csv("sensor_readings.csv")

# First look at the data overview
print(df.info())
print(df.describe())

df.info() tells you the non-null count for each column. If a column has significantly fewer rows than the total, that column has missing values. df.describe() shows statistics for numeric columns—if max is orders of magnitude above 75%, you have outliers.

Step Two: Handling Missing Values

Missing values arise from three causes:

  • Missing Completely at Random: e.g., sensors occasionally disconnect, records aren't written
  • Missing at Random (conditional): e.g., certain types of tasks don't produce a specific field
  • Not Missing at Random: e.g., sensors shut down under high temperatures, so high-temp readings are missing—this isn't random

Approach depends on the missing ratio and importance:

python
# Method 1: Delete — suitable for low missing percentage
df_clean = df.dropna(subset=["critical_field"])

# Method 2: Fill — suitable for numeric data
df["temperature"].fillna(df["temperature"].median(), inplace=True)

# Method 3: Forward fill — suitable for time series
df["status"].ffill(inplace=True)

# Method 4: Mark missing — preserve information
df["speed_missing"] = df["speed"].isna().astype(int)

A rule of thumb: if a column has more than 70% missing values, you should probably just drop that column. If you need to keep it, ask yourself: why is it so incomplete, and is it still valuable?

Step Three: Detecting Outliers—"Is this 999-degree sensor data a real anomaly or a broken sensor?"

Do you think machine learning can help you find outliers? Not necessarily. The simplest anomaly detection method isn't machine learning—it's the Interquartile Range (IQR). It requires no training, no parameter tuning—just sorting and quartiles.

The principle is simple: sort all values, find the middle 50% range (Q1 to Q3), then define a "normal range"—data falling outside Q1 - 1.5×IQR to Q3 + 1.5×IQR are potential outliers.

python
# IQR method
Q1 = df["measurement"].quantile(0.25)
Q3 = df["measurement"].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df["measurement"] < lower_bound) | (df["measurement"] > upper_bound)]
print(f"Detected {len(outliers)} outliers")

1.5× IQR is an empirical value, not a hard rule. If your data has specific physical constraints (like temperature can't exceed 200 degrees), domain knowledge takes priority. The Data Prophecy Hall archivist also tells you a rule of thumb: first check the data source, then decide whether to delete—if the sensor model's manual specifies a maximum range of 150 degrees, then 999 degrees is 100% a faulty reading. If it's just a bit higher than usual—it could be a real anomaly; keep and flag it.

Outlier Handling Strategies:

  • Values impossible in the domain → delete directly or set to null
  • Abnormal but physically possible values → keep and flag, don't easily delete
  • Boundary-oscillating data → use Winsorization to pull extreme values to the boundary
python
# Winsorize: pull values above the 99th percentile down to the 99th percentile
upper_limit = df["measurement"].quantile(0.99)
df["measurement_clipped"] = df["measurement"].clip(upper=upper_limit)

Step Four: Deduplication

Sometimes duplicates are obvious—entirely identical rows. Sometimes they're subtle—same key fields but different auxiliary fields.

python
# Full duplicates
df.drop_duplicates(inplace=True)

# Deduplication based on key fields (keep the latest record)
df.sort_values("timestamp", inplace=True)
df.drop_duplicates(subset=["mission_id", "sensor_id"], keep="last", inplace=True)

Step Five: Format Unification

String data often looks "different but has the same content" due to inconsistent casing, spacing, and encoding.

python
# Unify to lowercase + strip spaces
df["location"] = df["location"].str.strip().str.lower()

# Unify dates
df["date"] = pd.to_datetime(df["date"], errors="coerce")

errors="coerce" converts unparseable dates to NaT (missing), which you can handle uniformly in the next step.


The Iterative Nature of Cleaning—Data Is Never Cleaned Once

The Data Prophecy Hall archivist will tell you: don't expect to clean data in one pass. It's not realistic.

A typical cleaning session is like peeling an onion—each layer may reveal new problems:

  1. Load → view overview → discover missing values
  2. Fill missing → recheck → discover outliers
  3. Handle outliers → recheck → discover duplicates
  4. Deduplicate → recheck → discover format issues
  5. Fix formats → reach analysis stage → come back and realize some cleaning logic was wrong

This is normal—no, this is how data cleaning works. Not because your method is flawed, but because data itself has multiple layers of problems.

The right approach: build a cleaning script, add a rule each time you find a problem, instead of fixing things manually. Your script will gradually grow from dozens to hundreds of lines, eventually becoming a reusable data quality rule library. Next time you get a similar batch of data, just change the filename and run it.


Common Pitfalls

  • Modifying raw data directly. When you want to undo, there's no backup.
  • Filling highly skewed columns with the mean. The mean is sensitive to outliers; median is safer.
  • Assuming deduplication means "exact row matching." In real projects, it's almost always key-field deduplication.
  • Mixing cleaning and analysis in one script. Ensure you can rerun cleaning steps independently.

Pass Challenges

  • Warm-up: Find a CSV file, use df.info() and df.describe() to identify at least three data quality issues.
  • Challenge: Write a generic data cleaning function that accepts a DataFrame and config dictionary, automatically handling missing values, outliers, duplicates, and format issues.
  • Troubleshooting: Your cleaning script removed 20% of the rows. How do you determine if the cleaning rules are too strict or the data is genuinely that dirty?

Acceptance Criteria

  • Can systematically identify the four types of dirty data
  • Can choose the right strategy based on missing ratio
  • Can use the IQR method to detect outliers and distinguish between "keep" and "delete"
  • Can write reusable cleaning scripts

Traveler's Notes

Data cleaning has no magic. The time you spend on this stage directly pays off in the accuracy of your analysis stage.


Next Chapter Preview

Data is clean. Now you can start exploring it—next chapter, Exploratory Data Analysis and Visualization.

Built with VitePress | Software Systems Atlas