Software Systems Atlas

Metadata Card

Prerequisites: Chapter 1 (Data Lifecycle), Python / pandas basics
Estimated time: 45 minutes
Core difficulty: Beginner
Reading mode: High focus
Completion: Able to independently clean a real dataset with missing values, outliers, and duplicate rows

Your Progress

In the previous chapter you saw the big picture of data flow—acquisition, transfer, storage, processing, analysis, archiving, destruction. The pipeline diagrams in the Data Prophecy Hall looked beautiful. But when you actually started working, you hit the first roadblock immediately: the data you pulled was all dirty.

You sat down at the workbench in the Data Prophecy Hall and loaded the sensor logs sent back from the front lines. The archivist of the Data Prophecy Hall stood behind you and glanced at your screen—date formats were all over the place (2024-01-15, 01/15/2024, 15-Jan-2024), some fields were empty, and some values were clearly wrong—a temperature sensor reported 999°C.

"Not surprising," the archivist said, voice as flat as if saying "it rained today." "Raw data is always dirty. No dataset is born clean—sensors disconnect, people make typos, database fields change, encodings get corrupted. Every piece of data you receive needs to be cleaned before you can use it. That's the first lesson of data science: don't trust raw data."

Your Task

You loaded a CSV, called df.info(), and saw 1000 rows. But as you scrolled down, you noticed: some cells are empty, some date formats differ, a few rows look identical, and one column has values in an absurd range—a temperature reading of 999 degrees.

Raw data is almost always dirty, like the field reports from the front lines that always have ambiguities. This chapter teaches you how to systematically fix dirty data—not by staring at rows one by one, but by building a repeatable cleaning pipeline.

The Four Forms of Dirty Data

Almost all dirty data falls into these four categories:

Missing values — Empty, or fillers like -999, N/A, None
Outliers — Values outside the reasonable range
Duplicate data — Fully duplicated or key-field duplicated
Inconsistent formatting — Dates, string casing, encoding not uniform

Your toolbox doesn't need complex models. Start with your eyes, then quantify with simple statistics.

Step One: Overview of Data Health

python

import pandas as pd
import numpy as np

df = pd.read_csv("sensor_readings.csv")

# First look at the data overview
print(df.info())
print(df.describe())

df.info() tells you the non-null count for each column. If a column has significantly fewer rows than the total, that column has missing values. df.describe() shows statistics for numeric columns—if max is orders of magnitude above 75%, you have outliers.

Step Two: Handling Missing Values

Missing values arise from three causes:

Missing Completely at Random: e.g., sensors occasionally disconnect, records aren't written
Missing at Random (conditional): e.g., certain types of tasks don't produce a specific field
Not Missing at Random: e.g., sensors shut down under high temperatures, so high-temp readings are missing—this isn't random

Approach depends on the missing ratio and importance:

python

# Method 1: Delete — suitable for low missing percentage
df_clean = df.dropna(subset=["critical_field"])

# Method 2: Fill — suitable for numeric data
df["temperature"].fillna(df["temperature"].median(), inplace=True)

# Method 3: Forward fill — suitable for time series
df["status"].ffill(inplace=True)

# Method 4: Mark missing — preserve information
df["speed_missing"] = df["speed"].isna().astype(int)

A rule of thumb: if a column has more than 70% missing values, you should probably just drop that column. If you need to keep it, ask yourself: why is it so incomplete, and is it still valuable?

Step Three: Detecting Outliers—"Is this 999-degree sensor data a real anomaly or a broken sensor?"

Do you think machine learning can help you find outliers? Not necessarily. The simplest anomaly detection method isn't machine learning—it's the Interquartile Range (IQR). It requires no training, no parameter tuning—just sorting and quartiles.

The principle is simple: sort all values, find the middle 50% range (Q1 to Q3), then define a "normal range"—data falling outside Q1 - 1.5×IQR to Q3 + 1.5×IQR are potential outliers.

python

# IQR method
Q1 = df["measurement"].quantile(0.25)
Q3 = df["measurement"].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df["measurement"] < lower_bound) | (df["measurement"] > upper_bound)]
print(f"Detected {len(outliers)} outliers")

1.5× IQR is an empirical value, not a hard rule. If your data has specific physical constraints (like temperature can't exceed 200 degrees), domain knowledge takes priority. The Data Prophecy Hall archivist also tells you a rule of thumb: first check the data source, then decide whether to delete—if the sensor model's manual specifies a maximum range of 150 degrees, then 999 degrees is 100% a faulty reading. If it's just a bit higher than usual—it could be a real anomaly; keep and flag it.

Outlier Handling Strategies:

Values impossible in the domain → delete directly or set to null
Abnormal but physically possible values → keep and flag, don't easily delete
Boundary-oscillating data → use Winsorization to pull extreme values to the boundary

python

# Winsorize: pull values above the 99th percentile down to the 99th percentile
upper_limit = df["measurement"].quantile(0.99)
df["measurement_clipped"] = df["measurement"].clip(upper=upper_limit)

Step Four: Deduplication

Sometimes duplicates are obvious—entirely identical rows. Sometimes they're subtle—same key fields but different auxiliary fields.

python

# Full duplicates
df.drop_duplicates(inplace=True)

# Deduplication based on key fields (keep the latest record)
df.sort_values("timestamp", inplace=True)
df.drop_duplicates(subset=["mission_id", "sensor_id"], keep="last", inplace=True)

Step Five: Format Unification

String data often looks "different but has the same content" due to inconsistent casing, spacing, and encoding.

python

# Unify to lowercase + strip spaces
df["location"] = df["location"].str.strip().str.lower()

# Unify dates
df["date"] = pd.to_datetime(df["date"], errors="coerce")

errors="coerce" converts unparseable dates to NaT (missing), which you can handle uniformly in the next step.

The Iterative Nature of Cleaning—Data Is Never Cleaned Once

The Data Prophecy Hall archivist will tell you: don't expect to clean data in one pass. It's not realistic.

A typical cleaning session is like peeling an onion—each layer may reveal new problems:

Load → view overview → discover missing values
Fill missing → recheck → discover outliers
Handle outliers → recheck → discover duplicates
Deduplicate → recheck → discover format issues
Fix formats → reach analysis stage → come back and realize some cleaning logic was wrong

This is normal—no, this is how data cleaning works. Not because your method is flawed, but because data itself has multiple layers of problems.

The right approach: build a cleaning script, add a rule each time you find a problem, instead of fixing things manually. Your script will gradually grow from dozens to hundreds of lines, eventually becoming a reusable data quality rule library. Next time you get a similar batch of data, just change the filename and run it.

Common Pitfalls

Modifying raw data directly. When you want to undo, there's no backup.
Filling highly skewed columns with the mean. The mean is sensitive to outliers; median is safer.
Assuming deduplication means "exact row matching." In real projects, it's almost always key-field deduplication.
Mixing cleaning and analysis in one script. Ensure you can rerun cleaning steps independently.

Pass Challenges

Warm-up: Find a CSV file, use df.info() and df.describe() to identify at least three data quality issues.
Challenge: Write a generic data cleaning function that accepts a DataFrame and config dictionary, automatically handling missing values, outliers, duplicates, and format issues.
Troubleshooting: Your cleaning script removed 20% of the rows. How do you determine if the cleaning rules are too strict or the data is genuinely that dirty?

Acceptance Criteria

Can systematically identify the four types of dirty data
Can choose the right strategy based on missing ratio
Can use the IQR method to detect outliers and distinguish between "keep" and "delete"
Can write reusable cleaning scripts

Traveler's Notes

Data cleaning has no magic. The time you spend on this stage directly pays off in the accuracy of your analysis stage.

Next Chapter Preview

Data is clean. Now you can start exploring it—next chapter, Exploratory Data Analysis and Visualization.