Skip to content

Metadata Card

  • Prerequisites: ch11-ch13 All
  • Estimated time: 50 minutes
  • Core difficulty: Advanced
  • Reading mode: High focus
  • Completion: Able to design a Cache-Augmented Generation system, understand multi-agent collaboration patterns, use evaluation frameworks for LLM applications

Your Progress

You've walked the complete path from classical search to LLMs in the Model Workshop. On the workshop walls hang every system you built along the way.

But the old craftsman of the Model Workshop looked at your work and asked the most practical question: "How do you maintain this system in production? Will it crash when user concurrency spikes? How do you continuously monitor response quality?"

The final lesson's teaching tool isn't an algorithm—it's a system architecture diagram.

Your Task

A single LLM call is just the starting point. Real-world AI systems need caching layers for efficiency, multi-agent collaboration for complex tasks, and systematic ongoing evaluation mechanisms. This chapter presents three battle-tested patterns.

Chapter Layers

  • Required: Cache-Augmented Generation, Multi-Agent Collaboration Architecture, Evaluation Frameworks
  • Optional: Design trade-offs between semantic cache and exact cache
  • Advanced: LLM system observability, A/B evaluation design

Breaking Through · Tracing the Origin

Your RAG system has been running for three months. Users say, "Every time I ask the same question, it goes back to the vector database and retrieves the same thing." This is wasteful—identical queries, returning the same documents, producing the same answers, but running through the full retrieval + generation pipeline every time.

Your Agent system, when handling complex tasks, discovers that a single LLM frequently makes errors (hallucination, forgetting context, inability to self-correct). You need multiple models/agents working together, each playing its role.

You need a reliable evaluation system—not just measuring once before release, but tracking every metric's change during continuous operation.

This is where system patterns come in. They don't answer "how to train the model"—they answer "how to design the system."

Cache-Augmented Generation

Caching strategies aren't new—but in LLM systems, caching can occur at multiple levels.

Semantic caching is smarter than exact caching: it doesn't require the question to be exactly the same—semantically similar queries return cached results. Different phrasings of the same meaning—'What's the task completion rate' and 'What is the task completion rate'—are very close in embedding space, hitting the cache directly.

python
# Semantic cache: caches not exact matches but semantically similar queries
class SemanticCache:
    def __init__(self, embedding_model, similarity_threshold=0.95):
        self.embedding_model = embedding_model
        self.threshold = similarity_threshold
        self.cache = {}  # embedding_key -> (response, metadata)

    def get(self, query):
        query_emb = self.embedding_model.embed(query)
        for cached_emb, (response, meta) in self.cache.items():
            similarity = cosine_similarity(query_emb, cached_emb)
            if similarity > self.threshold:
                return response
        return None

    def set(self, query, response):
        query_emb = self.embedding_model.embed(query)
        self.cache[tuple(query_emb)] = (response, {"timestamp": time.time()})

Multi-layer caching strategy:

LayerStrategyHit ScenarioEffect
Exact cacheExactly identical questionsFAQs, repeated queries0ms hit
Semantic cacheSimilar questionsDifferent phrasings of same meaningMillisecond hit
Retrieval cacheSame retrieval resultsMulti-turn conversations repeatedly referencing same knowledgeReduces vector queries
Generation cacheSame context, same outputTemplate generation, fixed-format answersReduces LLM calls

Multi-Agent Collaboration

Complex tasks require multiple Agents with different specializations to collaborate. Standard pattern:

python
# Multi-Agent collaboration framework (simplified)
class Orchestrator:
    def __init__(self):
        self.agents = {}  # name -> Agent
        self.workflows = {}  # task_type -> [agent_sequence]

    def add_agent(self, name, agent):
        self.agents[name] = agent

    def register_workflow(self, task_type, agent_sequence):
        """Register a workflow: call Agents in sequence"""
        self.workflows[task_type] = agent_sequence

    def execute(self, task):
        task_type = self.classify_task(task)
        workflow = self.workflows.get(task_type)

        context = {"task": task, "history": []}
        for agent_name in workflow:
            agent = self.agents[agent_name]
            result = agent.run(context)
            context["history"].append({
                "agent": agent_name,
                "result": result
            })

        return context["history"][-1]["result"]

Common multi-agent collaboration patterns:

  • Orchestrator-Worker: one coordinator dispatches sub-tasks, multiple worker nodes each play their role
  • Debate: multiple Agents each propose solutions, then discuss, rebut, and improve
  • Reflection: one Agent executes, another checks output quality
  • Pipeline: A's output feeds as B's input, progressively refining results
python
# Reflection pattern: generation + quality check
class ReflectiveAgent:
    def __init__(self, generator, critic):
        self.generator = generator
        self.critic = critic

    def generate_with_review(self, prompt, max_iterations=3):
        for i in range(max_iterations):
            response = self.generator(prompt)
            review = self.critic.evaluate(prompt, response)

            if review.passed:
                return response
            prompt = f"{prompt}\nPrevious attempt: {response}\nImprove: {review.feedback}"

        return self.generator(prompt)  # fallback to last generation

Evaluation Framework

"If you can't measure it, you can't improve it." Production-grade LLM evaluation needs multiple dimensions, automation, and continuous tracking.

python
# Production-grade LLM evaluation framework
class LLMEvaluationFramework:
    def __init__(self):
        self.metrics = {}

    def register_metric(self, name, func):
        """Register an evaluation dimension"""
        self.metrics[name] = func

    def evaluate(self, dataset, model_func):
        """Run all registered evaluations on the dataset"""
        results = {name: [] for name in self.metrics}

        for example in dataset:
            model_output = model_func(example["input"])
            for name, metric_func in self.metrics.items():
                score = metric_func(example["expected"], model_output)
                results[name].append(score)

        return {
            name: {
                "mean": np.mean(scores),
                "std": np.std(scores),
                "p5": np.percentile(scores, 5),   # low percentile (worst case)
                "p95": np.percentile(scores, 95)   # high percentile
            }
            for name, scores in results.items()
        }

# Usage example
eval_framework = LLMEvaluationFramework()
eval_framework.register_metric("accuracy", exact_match)
eval_framework.register_metric("faithfulness", check_faithfulness)  # whether faithful to context
eval_framework.register_metric("safety", check_safety)
eval_framework.register_metric("usefulness", llm_as_judge)

Production evaluation dimensions should cover at least:

  • Quality: accuracy, faithfulness, usefulness
  • Performance: latency (P50/P95/P99), throughput
  • Cost: cost per token, cost per call
  • Safety: harmful content rate, jailbreak success rate
  • Operations: error rate, degradation rate, cache hit rate

Integrate evaluation into the CI/CD pipeline: every change to model/prompt/system architecture automatically triggers full evaluation, with results fed into a dashboard.

Epilogue: The Model Workshop's Final Lesson

"Moving programs from rule execution to data-driven decision-making." That's the full story of Vol 13. You started from zero in the Model Workshop: search algorithms let it find paths, knowledge reasoning let it think, reinforcement learning let it try and error. Then from classical ML's linear models to tree models, from neural network layers to Transformer's attention revolution, and finally to today's LLM ecosystem—alignment, retrieval augmentation, multi-agent collaboration.

This journey isn't a stack of technical terms. Each layer answers the same question: how do we make machines autonomously discover patterns from data and use them to make effective decisions. From manually writing rules to pre-training trillion-parameter models, the source of "intelligence" has shifted from the engineer's mind to the statistical structure of massive data.

The Model Workshop's door is now open. Your tools are no longer just if-else and for loops—they're datasets, model weights, vector indices, and Attention matrices. The letter Ahua sent from afar—you can finally open it and read: yes, machines can learn on their own.

Common Pitfalls

  • Cache expiration policies are easily overlooked: outdated knowledge gets cached, users get three-month-old answers. Need TTL or versioned caching.
  • Communication cost in multi-agent systems is non-negligible—each message exchange consumes tokens. Need to design lean context.
  • Trade-offs between evaluation metrics: safety and usefulness often conflict (excessive safety leads to "Sorry, I can't answer this question" everywhere).
  • The quality of the evaluation set determines the reliability of evaluations—if the test set itself has bias or is outdated, evaluation results are meaningless.
  • When Agent system observability is insufficient, troubleshooting a bad response may require backtracking through dozens of interaction steps.

Pass Challenges

  • Warm-up (15 min): Design a three-layer caching strategy for your RAG system. Draw a flowchart, annotate expected hit rates and latency for each layer.
  • Challenge (45 min): Implement an "Orchestrator-Worker" multi-agent system: one Agent for planning (decomposing task into sub-steps), two worker Agents (one for search, one for generation), one quality-check Agent to review the final output.
  • Final Challenge (60 min): Build a complete LLM application evaluation pipeline. Including: test set construction, multi-dimensional metric computation (accuracy + faithfulness + latency), performance degradation detection (compared to baseline version). Present results visually.

Traveler's Notes

AI system design isn't just "deploy the model to a server." CAG makes the system efficient, multi-agent makes the system smart, and evaluation frameworks make the system trustworthy. Combined, these three patterns give you a true production-grade AI system. The Model Workshop's curriculum is now complete—but the thinking patterns you've learned here will continue to serve you in the wider world.

-> Next Stop

Vol 13 concludes here. The AI and Machine Learning journey went from classical search to LLM systems engineering. The road ahead is long—but you now have both a map and a compass.

Built with VitePress | Software Systems Atlas