Software Systems Atlas

Metadata Card

Prerequisites: Chapter 8 (Observability Basics), Chapter 9 (SRE Basics)
Estimated Time: 45 minutes
Core Difficulty: Intermediate
Reading Mode: High focus
Completion Milestone: Understand the four golden signals and USE/RED methodology, master advanced PromQL and cardinality management, design alert strategies and Grafana panels, understand Observability as Code and RUM/synthetic monitoring

Your Progress

The Expeditionary Army's observability stack is deployed. Prometheus collects metrics, Grafana shows dashboards, and the alert system wakes you up at night.

But after two months, new problems surface:

Too many alerts. Each new PrometheusRule creates more alerts. You once received 47 alert notifications in one night — 46 were cascading from the same root cause. Your Grafana panels grew out of control, someone wrote wrong PromQL that took down Grafana itself.

General Lin says: "Your alert system sends more warnings than battle reports. A soldier reported 'enemy spotted' 47 times, but it was all shrapnel from a single assault." Your Task

Go from "able to use observability tools" to "doing observability engineering well."

Four Golden Signals: Latency, Traffic, Errors, Saturation.

USE Method (Brendan Gregg): For each resource — Utilization, Saturation, Errors. Infrastructure-level diagnosis.

RED Method (Tom Wilkie): For each service — Rate, Errors, Duration. Microservice-level diagnosis.

Cardinality Management: High-cardinality fields (user IDs, IPs, session IDs) don't belong as Prometheus labels. Use recording rules for pre-aggregation.

Advanced PromQL: Subqueries, recording rules, multi-value aggregations. Recording rules pre-compute complex expressions.

Alert Fatigue: Deduplication (aggregate alerts, not per-instance). Inhibition (suppress cascading alerts). Routing (P0 → PagerDuty, P1 → Slack on-call, P2 → team channel, P3 → ticket).

Grafana Panel Design: Three-tier architecture: Level 1 (executive summary, 6-8 panels), Level 2 (service dashboards — RED matrix per service), Level 3 (deep-dive for troubleshooting). Variables for dimension switching.

Observability as Code: Alert rules, Grafana dashboards, notification routes managed as code in Git. Argo CD syncs to cluster.

RUM (Real User Monitoring): Client-side performance data (page load, JS errors, geography-based latency). OpenTelemetry Browser SDK + Grafana Faro.

Synthetic Monitoring: Scripted user behavior simulation, periodic external probing. Catches CDN, DNS, SSL, API gateway issues before real users are affected.

Common Pitfalls: Cardinality explosion (each new variable label multiplies time series exponentially). Single-threshold alerts (day vs night traffic patterns differ). Too many panels (30+ panels = no panels). Ignoring client-side monitoring. Alert for parameter too short (network jitter triggers false alarms).

Traveler's Notes

Observability is "useful but easy to abuse." Prometheus can't handle high cardinality, too many Grafana panels means nobody reads them, too many alerts means nobody trusts them. From three pillars to four golden signals, from USE to RED — these methodologies help you upgrade from "graphing all the data" to "purposefully observing system health." Alert deduplication, Observability as Code, RUM, and synthetic monitoring are engineering wisdom built on top of the basics.

This concludes Volume VII.
You learned RPC, consensus algorithms, distributed storage and computing, microservices, container orchestration, SRE reliability quantification, IaC and GitOps, production K8s, and deep observability. Your toolbox is full.

Beyond lies Vol 8: Security — to continue the expedition, you must learn to protect your systems and data.