Software Systems Atlas

Metadata Card

Prerequisites: Chapter 8 (Observability), Basic K8s knowledge
Estimated Time: 40 minutes
Core Difficulty: Intermediate
Reading Mode: High focus
Completion Milestone: Understand SLI/SLO/Error Budget definitions and significance, design a reasonable SLO and derive error budget, understand toil and automation classification

Your Progress

You've established your position in the Expeditionary Army. The command system runs on K8s, Prometheus collects metrics, Jaeger traces each recon request. Two months of stable operation — then problems appear in unexpected places.

Intelligence Analysis Service fails every two weeks, requiring all-night debugging. Each time the cause is different: once disk full, once a downstream tower timeout, once a misconfigured rollout causing cascade failure.

General Lin asks: "How reliable is our system? Can it handle 3x traffic for next month's new theater operations?"

You can't answer. You don't know the system's upper reliability limit or how much "safety margin" remains. Your Task

SRE core mechanisms: SLI (quantify service quality), SLO (define reliability targets), Error Budget (balance reliability and development velocity).

SLI (Service Level Indicator): Quantifiable measure of service quality. Availability, latency, throughput, error rate, saturation.

SLO (Service Level Objective): Target value for SLI. "99.9% availability over 30-day rolling window."

Error Budget: 1 - SLO. If SLO is 99.9%, error budget is 0.1% (43.2 minutes per 30 days). When budget remains → deploy freely. When budget exhausted → freeze non-critical changes.

Toil: Manual, repetitive, automatable, no long-term value work. SRE teams should keep toil < 50%.

On-call: Rotation (primary/secondary), escalation (P0-P4), runbooks (executable incident response procedures). Postmortem principle: blame-free, find root cause.

Common Pitfalls: Chasing 100% reliability (cost is 10x for 99.99% vs 99.9%). Setting SLO without measuring first. Over-alerting (causes alert fatigue). Using SLO for punishment/performance reviews. Runbooks not maintained.

Traveler's Notes

SRE's core isn't making systems 100% reliable — that's impossible. SRE's core is using a quantifiable metric (error budget) to guide "when to invest in stability vs when to accelerate development." SLI gives you data, SLO gives you targets, error budget gives you strategy.

Next: IaC & GitOps (Chapter 10).