Software Systems Atlas

Metadata Card

Prerequisites: ch11-ch13 All
Estimated time: 50 minutes
Core difficulty: Advanced
Reading mode: High focus
Completion: Able to design a Cache-Augmented Generation system, understand multi-agent collaboration patterns, use evaluation frameworks for LLM applications

Your Progress

You've walked the complete path from classical search to LLMs in the Model Workshop. On the workshop walls hang every system you built along the way.

But the old craftsman of the Model Workshop looked at your work and asked the most practical question: "How do you maintain this system in production? Will it crash when user concurrency spikes? How do you continuously monitor response quality?"

The final lesson's teaching tool isn't an algorithm—it's a system architecture diagram.

Your Task

A single LLM call is just the starting point. Real-world AI systems need caching layers for efficiency, multi-agent collaboration for complex tasks, and systematic ongoing evaluation mechanisms. This chapter presents three battle-tested patterns.

Chapter Layers
Required: Cache-Augmented Generation, Multi-Agent Collaboration Architecture, Evaluation Frameworks
Optional: Design trade-offs between semantic cache and exact cache
Advanced: LLM system observability, A/B evaluation design

Breaking Through · Tracing the Origin

Your RAG system has been running for three months. Users say, "Every time I ask the same question, it goes back to the vector database and retrieves the same thing." This is wasteful—identical queries, returning the same documents, producing the same answers, but running through the full retrieval + generation pipeline every time.

Your Agent system, when handling complex tasks, discovers that a single LLM frequently makes errors (hallucination, forgetting context, inability to self-correct). You need multiple models/agents working together, each playing its role.

You need a reliable evaluation system—not just measuring once before release, but tracking every metric's change during continuous operation.

This is where system patterns come in. They don't answer "how to train the model"—they answer "how to design the system."

Cache-Augmented Generation

Caching strategies aren't new—but in LLM systems, caching can occur at multiple levels.

Semantic caching is smarter than exact caching: it doesn't require the question to be exactly the same—semantically similar queries return cached results. Different phrasings of the same meaning—'What's the task completion rate' and 'What is the task completion rate'—are very close in embedding space, hitting the cache directly.

python

# Semantic cache: caches not exact matches but semantically similar queries
class SemanticCache:
    def __init__(self, embedding_model, similarity_threshold=0.95):
        self.embedding_model = embedding_model
        self.threshold = similarity_threshold
        self.cache = {}  # embedding_key -> (response, metadata)

    def get(self, query):
        query_emb = self.embedding_model.embed(query)
        for cached_emb, (response, meta) in self.cache.items():
            similarity = cosine_similarity(query_emb, cached_emb)
            if similarity > self.threshold:
                return response
        return None

    def set(self, query, response):
        query_emb = self.embedding_model.embed(query)
        self.cache[tuple(query_emb)] = (response, {"timestamp": time.time()})

Multi-layer caching strategy:

Layer	Strategy	Hit Scenario	Effect
Exact cache	Exactly identical questions	FAQs, repeated queries	0ms hit
Semantic cache	Similar questions	Different phrasings of same meaning	Millisecond hit
Retrieval cache	Same retrieval results	Multi-turn conversations repeatedly referencing same knowledge	Reduces vector queries
Generation cache	Same context, same output	Template generation, fixed-format answers	Reduces LLM calls

Multi-Agent Collaboration

Complex tasks require multiple Agents with different specializations to collaborate. Standard pattern:

python

# Multi-Agent collaboration framework (simplified)
class Orchestrator:
    def __init__(self):
        self.agents = {}  # name -> Agent
        self.workflows = {}  # task_type -> [agent_sequence]

    def add_agent(self, name, agent):
        self.agents[name] = agent

    def register_workflow(self, task_type, agent_sequence):
        """Register a workflow: call Agents in sequence"""
        self.workflows[task_type] = agent_sequence

    def execute(self, task):
        task_type = self.classify_task(task)
        workflow = self.workflows.get(task_type)

        context = {"task": task, "history": []}
        for agent_name in workflow:
            agent = self.agents[agent_name]
            result = agent.run(context)
            context["history"].append({
                "agent": agent_name,
                "result": result
            })

        return context["history"][-1]["result"]

Common multi-agent collaboration patterns:

Orchestrator-Worker: one coordinator dispatches sub-tasks, multiple worker nodes each play their role
Debate: multiple Agents each propose solutions, then discuss, rebut, and improve
Reflection: one Agent executes, another checks output quality
Pipeline: A's output feeds as B's input, progressively refining results

python

# Reflection pattern: generation + quality check
class ReflectiveAgent:
    def __init__(self, generator, critic):
        self.generator = generator
        self.critic = critic

    def generate_with_review(self, prompt, max_iterations=3):
        for i in range(max_iterations):
            response = self.generator(prompt)
            review = self.critic.evaluate(prompt, response)

            if review.passed:
                return response
            prompt = f"{prompt}\nPrevious attempt: {response}\nImprove: {review.feedback}"

        return self.generator(prompt)  # fallback to last generation

Evaluation Framework

"If you can't measure it, you can't improve it." Production-grade LLM evaluation needs multiple dimensions, automation, and continuous tracking.

python

# Production-grade LLM evaluation framework
class LLMEvaluationFramework:
    def __init__(self):
        self.metrics = {}

    def register_metric(self, name, func):
        """Register an evaluation dimension"""
        self.metrics[name] = func

    def evaluate(self, dataset, model_func):
        """Run all registered evaluations on the dataset"""
        results = {name: [] for name in self.metrics}

        for example in dataset:
            model_output = model_func(example["input"])
            for name, metric_func in self.metrics.items():
                score = metric_func(example["expected"], model_output)
                results[name].append(score)

        return {
            name: {
                "mean": np.mean(scores),
                "std": np.std(scores),
                "p5": np.percentile(scores, 5),   # low percentile (worst case)
                "p95": np.percentile(scores, 95)   # high percentile
            }
            for name, scores in results.items()
        }

# Usage example
eval_framework = LLMEvaluationFramework()
eval_framework.register_metric("accuracy", exact_match)
eval_framework.register_metric("faithfulness", check_faithfulness)  # whether faithful to context
eval_framework.register_metric("safety", check_safety)
eval_framework.register_metric("usefulness", llm_as_judge)

Production evaluation dimensions should cover at least:

Quality: accuracy, faithfulness, usefulness
Performance: latency (P50/P95/P99), throughput
Cost: cost per token, cost per call
Safety: harmful content rate, jailbreak success rate
Operations: error rate, degradation rate, cache hit rate

Integrate evaluation into the CI/CD pipeline: every change to model/prompt/system architecture automatically triggers full evaluation, with results fed into a dashboard.

Epilogue: The Model Workshop's Final Lesson

"Moving programs from rule execution to data-driven decision-making." That's the full story of Vol 13. You started from zero in the Model Workshop: search algorithms let it find paths, knowledge reasoning let it think, reinforcement learning let it try and error. Then from classical ML's linear models to tree models, from neural network layers to Transformer's attention revolution, and finally to today's LLM ecosystem—alignment, retrieval augmentation, multi-agent collaboration.

This journey isn't a stack of technical terms. Each layer answers the same question: how do we make machines autonomously discover patterns from data and use them to make effective decisions. From manually writing rules to pre-training trillion-parameter models, the source of "intelligence" has shifted from the engineer's mind to the statistical structure of massive data.

The Model Workshop's door is now open. Your tools are no longer just if-else and for loops—they're datasets, model weights, vector indices, and Attention matrices. The letter Ahua sent from afar—you can finally open it and read: yes, machines can learn on their own.

Common Pitfalls

Cache expiration policies are easily overlooked: outdated knowledge gets cached, users get three-month-old answers. Need TTL or versioned caching.
Communication cost in multi-agent systems is non-negligible—each message exchange consumes tokens. Need to design lean context.
Trade-offs between evaluation metrics: safety and usefulness often conflict (excessive safety leads to "Sorry, I can't answer this question" everywhere).
The quality of the evaluation set determines the reliability of evaluations—if the test set itself has bias or is outdated, evaluation results are meaningless.
When Agent system observability is insufficient, troubleshooting a bad response may require backtracking through dozens of interaction steps.

Pass Challenges

Warm-up (15 min): Design a three-layer caching strategy for your RAG system. Draw a flowchart, annotate expected hit rates and latency for each layer.
Challenge (45 min): Implement an "Orchestrator-Worker" multi-agent system: one Agent for planning (decomposing task into sub-steps), two worker Agents (one for search, one for generation), one quality-check Agent to review the final output.
Final Challenge (60 min): Build a complete LLM application evaluation pipeline. Including: test set construction, multi-dimensional metric computation (accuracy + faithfulness + latency), performance degradation detection (compared to baseline version). Present results visually.

Traveler's Notes

AI system design isn't just "deploy the model to a server." CAG makes the system efficient, multi-agent makes the system smart, and evaluation frameworks make the system trustworthy. Combined, these three patterns give you a true production-grade AI system. The Model Workshop's curriculum is now complete—but the thinking patterns you've learned here will continue to serve you in the wider world.

-> Next Stop

Vol 13 concludes here. The AI and Machine Learning journey went from classical search to LLM systems engineering. The road ahead is long—but you now have both a map and a compass.