Skip to content

Metadata Card

  • Prerequisites: Programming basics (Vol 1)
  • Estimated time: 40 minutes
  • Core difficulty: Beginner
  • Reading mode: Casual stroll
  • Completion: Able to describe the complete chain from data creation to destruction, and determine key engineering decisions at each stage

Refining gold from field report data. I see the answers to the entire journey hidden in the data.

Your Progress

You emerge from the Spell Tower, walk through a long underground passage, and step into a vast hall—the Data Prophecy Hall.

The hall's walls are covered with glowing crystal screens, each displaying different numbers: sensor readings from all corners of the battlefield, communication traffic on the magic relay routes, query logs from the data fortresses, build status from the Artisan City—every corner of the entire system constantly generates data.

As you watch this flowing information, you realize a question: data arrives, but then what?

Your Task

You have a batch of raw data on hand—from system logs, sensor readings, user operation records. They're scattered across various places, in different formats, and some are even corrupted. You want to extract insights from them, but don't know where to start. This chapter helps you build a global perspective: what stages data goes through, what you should and shouldn't do at each stage.


The Data Journey

Data doesn't appear out of thin air, nor does it clean itself. From creation to serving decisions, it follows a clear chain. Understanding this chain allows you to locate the right stage when you encounter specific problems.

Typical stages of a data lifecycle:

  1. Acquisition — Where data comes from, how it enters
  2. Transfer — From source to storage system
  3. Storage — How and for how long to store
  4. Processing — Cleaning, transforming, aggregating
  5. Analysis — Exploration, modeling, visualization
  6. Archival — Cold storage, long-term retention
  7. Destruction — Compliant deletion, secure erasure

You don't need to follow each step linearly. In real projects, data might be analyzed right after acquisition, or repeatedly processed after storage. But this map helps you answer a key question: Which stage am I at now, and what should I do next?


Stage One: Acquisition

This is the origin of data. The method of acquisition determines the quality of data you can obtain.

The three most common acquisition patterns:

  • Batch Import: Periodically pull data from external systems, e.g., exporting the previous day's task data from MySQL to CSV at midnight, then loading it into the analysis environment with Python. High latency, but high throughput.
  • Streaming: Data arrives one record at a time, e.g., user click behavior sent in real-time through Kafka. Low latency, but requires additional infrastructure.
  • Log Ingestion: Collecting server logs using tools like tail, syslog, Fluentd, etc. This is the most common "passive" collection method.

Whether exporting CSV from MySQL or accessing a real-time stream through Kafka, you need to land it in Python. Below is your first moment of contact with this data—pd.read_csv is your key to the Prophecy Hall.

python
# Batch acquisition — loading from CSV
import pandas as pd

# You received a batch of daily snapshots from an external system
df = pd.read_csv("missions_snapshot_2026-06-24.csv")
print(df.info())

The output shows column names, non-null counts, and data types. This is your first contact with the data and your starting point for judging whether it's "dirty."

Stage Two: Transfer

The path data takes from source to target system. Key constraints:

  • Bandwidth limitations—the network is not infinitely fast
  • Packet loss and retries—data may be corrupted during transfer
  • Ordering guarantees—the arrival order may differ from the sending order

The most common problem in the transfer stage is "data loss." You sent 1 million records at the source but only received 990,000 at the target. Where did those 10,000 records go? Debugging this is extremely painful. A rule of thumb: perform count validation at both endpoints of the transfer.

Stage Three: Storage

Where data is placed once it arrives. You choose different storage systems based on the usage scenario:

Storage TypeSuitable ForTypical Tools
Relational DatabaseStructured, requires transactionsPostgreSQL
Data WarehouseAnalytical queries, large table aggregationClickHouse, Snowflake
Object StorageFiles, backups, raw dataS3, MinIO
Columnar StorageColumn scans, aggregationsParquet
CacheHigh-frequency readsRedis

Your selection principle is simple: query patterns determine storage format. If you frequently aggregate by day, don't use row-based storage; if you randomly query individual records, don't use columnar storage.

Stage Four: Processing

This is where you'll spend the most time. Processing includes:

  • Cleaning—filling missing values, deduplication, fixing formats
  • Transformation—field splitting, type conversion, derived fields
  • Aggregation—summarizing by time/category
  • Feature extraction—preparing inputs for modeling

This process typically requires repeated iteration. You find one cleaning rule isn't enough, so you add another; you discover a field hides outliers, so you fix it again. It's rarely a one-shot effort.

Stage Five: Analysis

Data has been processed to a usable state. Now you can ask questions:

  • "What is the most data-intensive mission type?"
  • "Is there a periodic pattern in the success rate?"
  • "What is the trend of a certain metric over the past 30 days?"

Analysis can be exploratory (EDA) or confirmatory (hypothesis testing, modeling). At this stage, you use Jupyter Notebook, SQL queries, and visualization tools to interrogate the data repeatedly.

Stage Six: Archival

Not all data needs to be instantly queryable. Historical data from months ago has sharply declining access frequency. Move it to cheaper storage to free up hot storage space.

Archival decisions:

  • Retention policy: Raw data retained for 30 days, aggregated data retained for 1 year
  • Storage migration: Move from SSD to HDD or S3 Glacier
  • Compression: gzip/zstd compression can reduce space by 80%

Stage Seven: Destruction

Data is not kept forever. Compliance requirements (like GDPR) mandate that certain data must be deleted after a specific time. Destruction isn't just "deleting files"—filesystem deletion only marks bits as available; data can still be recovered. True destruction requires:

  • Overwriting (with random data)
  • Cryptographic erasure (delete the key, data becomes unrecoverable)
  • Physical destruction (shredding hard drives)

Not Every Time Goes Through the Full Process

In real projects, you'll encounter various simplified paths:

  • One-time analysis tasks: jump from acquisition directly to analysis, no archiving needed
  • Continuous monitoring pipelines: discard raw data after processing
  • Debugging scenarios: only pull a few rows from storage for inspection

This lifecycle model is a map, not a mandatory route. Knowing where each path leads helps you make choices at the fork.


Common Pitfalls

  • Skipping transfer validation. 1000 rows at source, 999 at destination—you think it doesn't matter, but that missing row could be a critical anomaly.
  • Choosing the wrong storage format. Storing column-scan data as JSON makes queries 10x slower.
  • Archiving too early. You just archived a batch of data and need to query it the next day—you'll regret the 20-minute decompression wait.
  • Never planning for destruction. Data stored for five years, now non-compliant, and you discover you have no secure deletion tools.

Pass Challenges

  • Warm-up: List the data you've come across in the past week. Which lifecycle stages did they go through?
  • Challenge: Take a real dataset and track its full journey from raw CSV to analysis report. Record how long each stage takes.
  • Observation: In an existing data processing pipeline, identify missing transfer validation points.

Acceptance Criteria

  • Can draw a complete chain from data acquisition to destruction
  • Can select the correct storage medium for a given scenario
  • Knows why data destruction cannot simply use rm

Traveler's Notes

The data lifecycle is a map. It doesn't tell you every step of the way, but it tells you where you are and what lies ahead.


Next Chapter Preview

With the global map in hand, the next chapter gets down to the toughest part of working with real data—cleaning.

Built with VitePress | Software Systems Atlas