Software Systems Atlas

Metadata Card

Prerequisites: Chapter 6 (Microservices), Chapter 7 (Containerization & K8s)
Estimated Time: 45 minutes
Core Difficulty: Intermediate
Reading Mode: High focus
Completion Milestone: Understand Metrics/Tracing/Logging differences and relationships, configure Prometheus metric collection and Grafana dashboards in K8s, understand OpenTelemetry architecture and span propagation

Your Progress

20 microservices running on K8s, deployment standardized. But a new problem: a user reports slow report generation. You start debugging — kubectl logs shows no errors, kubectl top pods shows normal CPU/memory. Then you freeze — you don't know which services the request traversed, or how long each took.

General Lin points at the command map: "You can see where each fortress is, but you don't know what route the signal took, or where it got delayed. You need a global battlefield awareness system."

In distributed systems, this awareness system is called observability. Your Task

Three pillars: Metrics (aggregated, numerical system state), Tracing (trace ID + spans connect complete request path), Logging (structured text records with context).

Metrics: Prometheus (pull model, time-series DB) + Grafana (visualization). Counter (increases only), Gauge (up/down), Histogram (distribution).

Tracing: OpenTelemetry — standardized API for generating, collecting, and exporting telemetry data. Trace ID propagates across service boundaries via HTTP/gRPC headers.

Logging: Structured (JSON format with trace_id, span_id, service, level, message). ELK/Loki for storage and search.

Three-pillar integration: OpenTelemetry unifies the API. A K8s operator deploys the full OTel Collector stack — collect, process, export.

Jaeger: Visualizes traces, shows service dependencies and latencies per span.

Application: Observability enables (1) daily monitoring — Grafana dashboard shows P99 latency spike, Jaeger pinpoints the slow service, Prometheus reveals resource saturation; (2) fault localization; (3) performance optimization.

Common Pitfalls: Only deploying observability in production. Logs not structured or over-structured. Over-sampling traces (high-throughput systems should use head/tail sampling). Using Prometheus for high-cardinality labels.

Traveler's Notes

Observability is the distributed system's "eyes." Metrics tell you "something happened," Tracing tells you "where," Logging tells you "why." OpenTelemetry provides a unified context system — every request carries its "ID bracelet" across 10 services without losing the trail.

Next: SRE Basics (Chapter 9).