Skip to content

Metadata Card

  • Prerequisites: Chapters 1-2 (Data Lifecycle, Data Cleaning)
  • Estimated time: 40 minutes
  • Core difficulty: Advanced
  • Reading mode: Casual stroll
  • Completion: Able to design a data quality dimension framework, set up a basic data catalog and monitoring

Your Progress

Data ethics made you realize: data is both an asset and a risk. The Prophecy Hall was piling up more and more data, but no one could answer a basic question: which data is trustworthy? Who is using it? How long should it be kept?

You flipped open the archives' rulebook and found it filled with rules—data quality, data catalog, data lifecycle policies. Not the sexiest topic, but without it, the data in your hands is just a pile of risky numbers.

Your Task

Your team processes 100 data pipelines daily. This morning, one pipeline suddenly emptied, causing all downstream reports to error out. You think—someone should be watching these pipelines. But you can't have people watching everything. Data governance is a systematic approach: defining data quality, building catalogs, automated monitoring, ensuring data is trustworthy and usable.


What Data Governance Is Not

Data governance is not "adding a bunch of approval processes to make data analysts unable to work." It's infrastructure building—as natural a practice as CI/CD for code. You want data pipeline issues to be automatically discovered, not for the business side to say "the report data is wrong" two weeks later.

Six Dimensions of Data Quality

The industry-standard data quality framework defines six dimensions:

DimensionDefinitionCheck Method
CompletenessWhether data is missingdf.isnull().sum()
AccuracyWhether data reflects realityValidation rules (e.g., age > 0)
ConsistencyWhether same data agrees across systemsCross-system reconciliation
TimelinessWhether data is available within expected timeCheck delay duration
UniquenessWhether there are duplicate recordsdf.duplicated().sum()
ValidityWhether data conforms to format specificationsRegex validation, enum value validation
python
# Data quality check function — basic version
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

def quality_check(df, table_name):
    checks = {}
    
    # Completeness
    missing = df.isnull().sum()
    checks["missing_rate"] = (missing / len(df)).to_dict()
    
    # Accuracy — example rules
    if "age" in df.columns:
        checks["invalid_age"] = int((df["age"] < 0).sum())
    if "email" in df.columns:
        checks["invalid_email"] = int(
            (~df["email"].str.contains("@", na=False)).sum()
        )
    
    # Uniqueness
    checks["duplicate_rate"] = df.duplicated().mean()
    
    # Timeliness
    if "load_timestamp" in df.columns:
        max_age = (datetime.now() - df["load_timestamp"].max()).total_seconds()
        checks["data_max_age_seconds"] = max_age
    
    return pd.Series(checks, name=table_name)

# Run checks on each table
results = pd.DataFrame([
    quality_check(df_missions, "missions"),
    quality_check(df_logs, "logs"),
])
print(results)

The key to automating rule-defining code is patternization: run the same check function against all tables, outputting a unified format DataFrame. Every piece of data passes through this sieve before entering the Prophecy Hall, filtering out missing, duplicate, outdated, and anomalous entries.

However, these six-dimension scores are just the starting point—the real governance lies in continuous tracking and alert response.

Each team's data quality rules differ. Your job is to define key rules and enforce checks with code, not rely on manual inspection.

Data Catalog

As data grows, no one remembers "where is the mission_status table" or "what does the total_resources field mean." A Data Catalog helps you answer these questions.

A basic data catalog records, for each table:

Technical information can be automatically extracted from the code—column names, types, non-null counts—but definition (business meaning of the field) and owner (who maintains it) need to be filled in manually. The catalog is like the book index of the Prophecy Hall, letting anyone who sees a field immediately know its meaning.

python
# Build data catalog structure
catalog_entries = []

for table_name, df in {"missions": df_missions, "logs": df_logs}.items():
    for col in df.columns:
        entry = {
            "table": table_name,
            "column": col,
            "dtype": str(df[col].dtype),
            "non_null_count": df[col].notna().sum(),
            "unique_values": df[col].nunique(),
            "sample_values": df[col].dropna().unique()[:3].tolist(),
            "definition": "",  # manual entry
            "owner": "",       # manual entry
        }
        catalog_entries.append(entry)

catalog = pd.DataFrame(catalog_entries)
catalog.to_csv("data_catalog.csv", index=False)

The most critical information in the catalog is definition (field meaning) and owner (who maintains it). Technical information can be auto-extracted from data; semantic information needs human maintenance.

Data Monitoring

Quality checks and catalogs are static views. But data changes continuously—you need to monitor whether newly loaded data suddenly degrades.

python
# Monitoring baseline — using row count as example
from collections import deque

history = deque(maxlen=30)  # record last 30 days' row counts

def monitor_row_count(table_name, current_count, history):
    """Alert if current row count deviates from baseline by more than 3 standard deviations"""
    if len(history) < 7:
        history.append(current_count)
        return  # baseline establishment period, no alert
    
    mean_ = np.mean(history)
    std_ = np.std(history)
    
    if abs(current_count - mean_) > 3 * std_:
        print(f"ALERT: {table_name} row count anomaly!")
        print(f"  baseline: {mean_:.0f} +/- {3*std_:.0f}")
        print(f"  current:  {current_count}")
        return True
    
    history.append(current_count)
    return False

This is the most basic version of monitoring: row count fluctuations. In practice, you also need to monitor missing rates, distribution changes, and freshness (how long since data was last updated).

SLAs and Service Levels

Core output of data governance: each table has an SLA record.

  • Data availability time: data should be available by 8:00 AM daily
  • Minimum quality requirements: missing rate < 5%, duplicate rate < 1%
  • Maximum lag: data delay should not exceed 6 hours
  • Alert response time: P0 issues responded to within 1 hour
python
# Record SLAs
sla_dashboard = pd.DataFrame({
    "table": ["missions", "logs", "reports"],
    "availability_time": ["08:00", "07:00", "09:00"],
    "max_missing_rate": [0.05, 0.03, 0.01],
    "max_duplicate_rate": [0.01, 0.01, 0.00],
    "max_lag_hours": [6, 4, 12],
    "p0_response_minutes": [60, 60, 120],
})

Common Pitfalls

  • One-time cleanup without ongoing monitoring. Data quality degradation is continuous—you fix it today, it may break again tomorrow.
  • Quality rules too strict. 100% rejected data is worse than 90% quality data. Accept first, flag anomalies, then gradually raise standards.
  • Catalog built and then abandoned. The catalog should be continuously updated—with both automated imports and manual maintenance channels.
  • Too many alerts become noise. Every small fluctuation triggers an email, and within a week no one reads the alerts. Layer them: P0 phone call, P1 email, P2 dashboard.

Pass Challenges

  • Warm-up: Pick a dataset you commonly use and list its scores on all six data quality dimensions.
  • Challenge: Build a complete quality monitoring system for a set of data pipelines: define check rules → establish baselines → run daily checks → alert notifications.
  • Troubleshooting: A pipeline keeps triggering the "too few rows" alert. How do you determine whether the data is genuinely insufficient or the code is wrong?

Acceptance Criteria

  • Can define and apply the six dimensions of data quality
  • Can set up a basic data catalog
  • Can build simple data quality monitoring
  • Knows how to layer alert thresholds

Traveler's Notes

Data quality is not a one-time project. You need continuous checking, continuous improvement—like testing code.


Next Chapter Preview

After knowing whether data has problems, the next question is: where does the data come from, what transformations does it go through, and which downstream systems does it affect? Next chapter, Data Lineage & Metadata.

Built with VitePress | Software Systems Atlas