Metadata Card
- Prerequisites: Chapter 6 (Microservices), Chapter 7 (Containerization & K8s)
- Estimated Time: 45 minutes
- Core Difficulty: Intermediate
- Reading Mode: High focus
- Completion Milestone: Understand Metrics/Tracing/Logging differences and relationships, configure Prometheus metric collection and Grafana dashboards in K8s, understand OpenTelemetry architecture and span propagation
Your Progress
20 microservices running on K8s, deployment standardized. But a new problem: a user reports slow report generation. You start debugging — kubectl logs shows no errors, kubectl top pods shows normal CPU/memory. Then you freeze — you don't know which services the request traversed, or how long each took.
General Lin points at the command map: "You can see where each fortress is, but you don't know what route the signal took, or where it got delayed. You need a global battlefield awareness system."
In distributed systems, this awareness system is called observability. Your Task
Three pillars: Metrics (aggregated, numerical system state), Tracing (trace ID + spans connect complete request path), Logging (structured text records with context).
Metrics: Prometheus (pull model, time-series DB) + Grafana (visualization). Counter (increases only), Gauge (up/down), Histogram (distribution).
Tracing: OpenTelemetry — standardized API for generating, collecting, and exporting telemetry data. Trace ID propagates across service boundaries via HTTP/gRPC headers.
Logging: Structured (JSON format with trace_id, span_id, service, level, message). ELK/Loki for storage and search.
Three-pillar integration: OpenTelemetry unifies the API. A K8s operator deploys the full OTel Collector stack — collect, process, export.
Jaeger: Visualizes traces, shows service dependencies and latencies per span.
Application: Observability enables (1) daily monitoring — Grafana dashboard shows P99 latency spike, Jaeger pinpoints the slow service, Prometheus reveals resource saturation; (2) fault localization; (3) performance optimization.
Common Pitfalls: Only deploying observability in production. Logs not structured or over-structured. Over-sampling traces (high-throughput systems should use head/tail sampling). Using Prometheus for high-cardinality labels.
Traveler's Notes
Observability is the distributed system's "eyes." Metrics tell you "something happened," Tracing tells you "where," Logging tells you "why." OpenTelemetry provides a unified context system — every request carries its "ID bracelet" across 10 services without losing the trail.
Next: SRE Basics (Chapter 9).