Software Systems Atlas

Metadata Card

Prerequisites: Programming basics (Vol 1)
Estimated time: 40 minutes
Core difficulty: Beginner
Reading mode: Casual stroll
Completion: Able to describe the complete chain from data creation to destruction, and determine key engineering decisions at each stage

Refining gold from field report data. I see the answers to the entire journey hidden in the data.

Your Progress

You emerge from the Spell Tower, walk through a long underground passage, and step into a vast hall—the Data Prophecy Hall.

The hall's walls are covered with glowing crystal screens, each displaying different numbers: sensor readings from all corners of the battlefield, communication traffic on the magic relay routes, query logs from the data fortresses, build status from the Artisan City—every corner of the entire system constantly generates data.

As you watch this flowing information, you realize a question: data arrives, but then what?

Your Task

You have a batch of raw data on hand—from system logs, sensor readings, user operation records. They're scattered across various places, in different formats, and some are even corrupted. You want to extract insights from them, but don't know where to start. This chapter helps you build a global perspective: what stages data goes through, what you should and shouldn't do at each stage.

The Data Journey

Data doesn't appear out of thin air, nor does it clean itself. From creation to serving decisions, it follows a clear chain. Understanding this chain allows you to locate the right stage when you encounter specific problems.

Typical stages of a data lifecycle:

Acquisition — Where data comes from, how it enters
Transfer — From source to storage system
Storage — How and for how long to store
Processing — Cleaning, transforming, aggregating
Analysis — Exploration, modeling, visualization
Archival — Cold storage, long-term retention
Destruction — Compliant deletion, secure erasure

You don't need to follow each step linearly. In real projects, data might be analyzed right after acquisition, or repeatedly processed after storage. But this map helps you answer a key question: Which stage am I at now, and what should I do next?

Stage One: Acquisition

This is the origin of data. The method of acquisition determines the quality of data you can obtain.

The three most common acquisition patterns:

Batch Import: Periodically pull data from external systems, e.g., exporting the previous day's task data from MySQL to CSV at midnight, then loading it into the analysis environment with Python. High latency, but high throughput.
Streaming: Data arrives one record at a time, e.g., user click behavior sent in real-time through Kafka. Low latency, but requires additional infrastructure.
Log Ingestion: Collecting server logs using tools like tail, syslog, Fluentd, etc. This is the most common "passive" collection method.

Whether exporting CSV from MySQL or accessing a real-time stream through Kafka, you need to land it in Python. Below is your first moment of contact with this data—pd.read_csv is your key to the Prophecy Hall.

python

# Batch acquisition — loading from CSV
import pandas as pd

# You received a batch of daily snapshots from an external system
df = pd.read_csv("missions_snapshot_2026-06-24.csv")
print(df.info())

The output shows column names, non-null counts, and data types. This is your first contact with the data and your starting point for judging whether it's "dirty."

Stage Two: Transfer

The path data takes from source to target system. Key constraints:

Bandwidth limitations—the network is not infinitely fast
Packet loss and retries—data may be corrupted during transfer
Ordering guarantees—the arrival order may differ from the sending order

The most common problem in the transfer stage is "data loss." You sent 1 million records at the source but only received 990,000 at the target. Where did those 10,000 records go? Debugging this is extremely painful. A rule of thumb: perform count validation at both endpoints of the transfer.

Stage Three: Storage

Where data is placed once it arrives. You choose different storage systems based on the usage scenario:

Storage Type	Suitable For	Typical Tools
Relational Database	Structured, requires transactions	PostgreSQL
Data Warehouse	Analytical queries, large table aggregation	ClickHouse, Snowflake
Object Storage	Files, backups, raw data	S3, MinIO
Columnar Storage	Column scans, aggregations	Parquet
Cache	High-frequency reads	Redis

Your selection principle is simple: query patterns determine storage format. If you frequently aggregate by day, don't use row-based storage; if you randomly query individual records, don't use columnar storage.

Stage Four: Processing

This is where you'll spend the most time. Processing includes:

Cleaning—filling missing values, deduplication, fixing formats
Transformation—field splitting, type conversion, derived fields
Aggregation—summarizing by time/category
Feature extraction—preparing inputs for modeling

This process typically requires repeated iteration. You find one cleaning rule isn't enough, so you add another; you discover a field hides outliers, so you fix it again. It's rarely a one-shot effort.

Stage Five: Analysis

Data has been processed to a usable state. Now you can ask questions:

"What is the most data-intensive mission type?"
"Is there a periodic pattern in the success rate?"
"What is the trend of a certain metric over the past 30 days?"

Analysis can be exploratory (EDA) or confirmatory (hypothesis testing, modeling). At this stage, you use Jupyter Notebook, SQL queries, and visualization tools to interrogate the data repeatedly.

Stage Six: Archival

Not all data needs to be instantly queryable. Historical data from months ago has sharply declining access frequency. Move it to cheaper storage to free up hot storage space.

Archival decisions:

Retention policy: Raw data retained for 30 days, aggregated data retained for 1 year
Storage migration: Move from SSD to HDD or S3 Glacier
Compression: gzip/zstd compression can reduce space by 80%

Stage Seven: Destruction

Data is not kept forever. Compliance requirements (like GDPR) mandate that certain data must be deleted after a specific time. Destruction isn't just "deleting files"—filesystem deletion only marks bits as available; data can still be recovered. True destruction requires:

Overwriting (with random data)
Cryptographic erasure (delete the key, data becomes unrecoverable)
Physical destruction (shredding hard drives)

Not Every Time Goes Through the Full Process

In real projects, you'll encounter various simplified paths:

One-time analysis tasks: jump from acquisition directly to analysis, no archiving needed
Continuous monitoring pipelines: discard raw data after processing
Debugging scenarios: only pull a few rows from storage for inspection

This lifecycle model is a map, not a mandatory route. Knowing where each path leads helps you make choices at the fork.

Common Pitfalls

Skipping transfer validation. 1000 rows at source, 999 at destination—you think it doesn't matter, but that missing row could be a critical anomaly.
Choosing the wrong storage format. Storing column-scan data as JSON makes queries 10x slower.
Archiving too early. You just archived a batch of data and need to query it the next day—you'll regret the 20-minute decompression wait.
Never planning for destruction. Data stored for five years, now non-compliant, and you discover you have no secure deletion tools.

Pass Challenges

Warm-up: List the data you've come across in the past week. Which lifecycle stages did they go through?
Challenge: Take a real dataset and track its full journey from raw CSV to analysis report. Record how long each stage takes.
Observation: In an existing data processing pipeline, identify missing transfer validation points.

Acceptance Criteria

Can draw a complete chain from data acquisition to destruction
Can select the correct storage medium for a given scenario
Knows why data destruction cannot simply use rm

Traveler's Notes

The data lifecycle is a map. It doesn't tell you every step of the way, but it tells you where you are and what lies ahead.

Next Chapter Preview

With the global map in hand, the next chapter gets down to the toughest part of working with real data—cleaning.