Skip to content

Metadata Card

  • Prerequisites: ch11 Pre-training & Fine-tuning
  • Estimated time: 50 minutes
  • Core difficulty: Advanced/In-depth
  • Reading mode: High focus
  • Completion: Able to explain the difference between RLHF and DPO, build a basic RAG system, understand the Agent workflow loop

Your Progress

In the Model Workshop, your Transformer completed pre-training on a vast amount of text. It can continue sentences, answer questions, even write code. But you notice a problem—it "doesn't obey."

You ask it to output in JSON format, and it writes you a paragraph of prose. You tell it not to fabricate facts, and it confidently gives a wrong answer.

The last few machines in the Model Workshop are labeled: Alignment, Retrieval-Augmented Generation, Agent Loop—making models both knowledgeable and truly usable.

Your Task

For an LLM to go from pre-trained to usable, it must overcome three hurdles: Alignment (making the model speak like a human, not lie), Knowledge Updates (the model is frozen after training, but knowledge goes stale—RAG solves this), and Capability Expansion (letting the model call tools, query databases, plan multi-step tasks—Agent). This chapter unpacks each one.

Chapter Layers

  • Required: RLHF/DPO alignment principles, RAG retrieval-augmented generation, Agent loop
  • Optional: LLM Evaluation benchmarks, hallucination detection
  • Advanced: Reward hacking problem in RLHF

Breaking Through · Tracing the Origin

You have a powerful language model. You ask it a simple geography question:

You: "What is the capital of Tibet?" Model: "Shigatse."

—Wrong. The correct answer is Lhasa. You frown. Try another question:

You: "What is 1+1?" Model: "That depends on an infinite number of possibilities depending on the base system you're using. In binary 1+1=10, in ternary 1+1=2..."

—Nonsensical. You asked about elementary arithmetic; it gave you a general mathematics lecture.

This isn't the model being "not smart enough." During pre-training, it memorized vast amounts of knowledge—including the fact that "1+1 differs across different bases." But it doesn't know your real intent is "give me a concise answer."

It hasn't been aligned to human communication norms: it doesn't know when to be brief, when to be detailed; doesn't know that "factual questions should be answered accurately" is more important than "showing off my knowledge." Pre-training only taught it the next word in language, not how to converse with humans.

Alignment: RLHF and DPO

RLHF (Reinforcement Learning from Human Feedback) is the alignment method proposed by OpenAI in InstructGPT:

  1. Use human annotators to compare model outputs, train a "reward model"
  2. Use PPO (Proximal Policy Optimization) to fine-tune the LLM using the reward model as signal
python
# RLHF pseudocode
def rlhf_training(policy_model, reward_model, ref_model, prompts):
    """Single step PPO training"""
    for prompt in prompts:
        response = policy_model.generate(prompt)
        reward = reward_model.score(prompt, response)
        # PPO constraint: new policy shouldn't deviate too far from reference (KL penalty)
        kl_div = kl_divergence(policy_model(prompt), ref_model(prompt))
        loss = -reward + beta * kl_div
        loss.backward()
    optimizer.step()

DPO (Direct Preference Optimization) simplifies this: without explicitly training a reward model, directly optimize the LLM using human preference data.

Key difference: RLHF first learns a reward function then optimizes with policy gradients; DPO directly optimizes the policy using preference data.

# DPO loss intuition:
# If humans prefer A over B, increase P(A) and decrease P(B)
# while not deviating too far from the original model

RAG: Retrieval-Augmented Generation

A pre-trained model's knowledge is frozen at the training data collection time. RAG retrieves relevant documents from an external knowledge base during generation, injecting the results as context into the prompt.

python
# Standard RAG pipeline
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

# 1. Build vector database
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)

# 2. Query: retrieval + generation
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(temperature=0),
    retriever=vectorstore.as_retriever(search_k=3)
)

query = "Based on our latest sales data, how was Q3 performance?"
answer = qa_chain.run(query)
print(answer)

RAG pipeline:

User input → Vectorize → Retrieve Top-K relevant documents

Combine documents + user input into augmented prompt

LLM generates answer based on augmented context

RAG solves three pressing needs for LLMs: knowledge updates (no need to retrain), traceable sources (can see which documents were cited), and domain adaptation (enterprise private data).

Agent

Agent lets LLMs do more than "answer questions"—it lets them actively execute multi-step tasks. The core is the ReAct (Reasoning + Acting) pattern:

Loop:
  1. Thought: current state, what to do next
  2. Action: call a tool (search, compute, read file)
  3. Observation: see what the tool returned
  4. Repeat until task is complete
python
# Simplified Agent loop
class SimpleAgent:
    def __init__(self, llm, tools):
        self.llm = llm
        self.tools = {t.name: t for t in tools}

    def run(self, task, max_steps=10):
        messages = [{"role": "user", "content": task}]
        for step in range(max_steps):
            response = self.llm(messages)

            if response.finished:  # model decides to output final answer
                return response.content

            # Parse tool calls
            tool_calls = self.parse_tool_calls(response)
            for call in tool_calls:
                tool = self.tools[call.name]
                result = tool.run(**call.args)
                messages.append({"role": "tool",
                                  "content": str(result),
                                  "name": call.name})
        return "Max steps reached"

Key capabilities of an Agent:

  • Tool use: model needs to learn when to call tools and what arguments to pass
  • Planning: decomposing complex tasks into sub-steps
  • Memory: conversation history / long-term memory (can use vector databases)
  • Error recovery: re-plan after a tool call fails

Evaluation

LLM evaluation is harder than traditional ML—correct answers are often not unique. Evaluation dimensions:

  • Usefulness: does the answer satisfy the user's needs (human scoring, LLM-as-judge)
  • Safety: does it produce harmful content
  • Hallucination rate: proportion of fabricated facts
  • Alignment: does it follow instruction format
python
# LLM-as-judge: use another LLM to evaluate generation quality
eval_prompt = f"""
Task: Evaluate if the assistant's response is helpful, accurate, and follows instructions.

User query: {query}
Assistant response: {response}

Score 1-5 for: helpfulness, accuracy, instruction-following.
"""
score = eval_llm(eval_prompt)

Common Pitfalls

  • Reward hacking in RLHF: the model learns superficial strategies to please the reviewer rather than actually doing the task well. The reward model needs continuous updates.
  • RAG's retrieval quality determines everything—if the correct answer isn't in the Top-K results, the LLM either fabricates or admits it doesn't know.
  • The Agent loop easily enters an infinite loop: the LLM repeatedly calls the same tool. Need to set max_steps and termination conditions.
  • Token consumption: Agents consume massive tokens in multi-step interactions (each step repeats the history), requiring careful budgeting.
  • Hallucination isn't a "bug," it's a "feature"—the pre-training language modeling objective itself encourages the model to fill in the most probable words, not fact-check.

Pass Challenges

  • Warm-up (10 min): Load a dialogue model from HuggingFace transformers (e.g., microsoft/DialoGPT or LLaMA variant), test the same question 3 times—observe the differences from randomness.
  • Challenge (45 min): Use LangChain + Chroma to build a RAG system. Split a local document (PDF or Markdown) into paragraphs, build an index, and do Q&A.
  • Observation: Construct a multi-step reasoning question (e.g., "Xiao Ming traveled from Beijing to Shanghai to Guangzhou. Where was the last stop?"), compare answer quality between zero-shot and CoT prompting.

Traveler's Notes

Pre-training gives LLMs knowledge, alignment gives them "how to speak," RAG gives them "the ability to acquire new knowledge," and Agents give them "the ability to proactively do things." Combined, these four components transform LLMs from internet chatbots into genuine work assistants.

-> Next Chapter Preview

Greater capability means greater responsibility. LLMs can produce bias, leak privacy, and be maliciously exploited—next chapter discusses AI Ethics & Safety.

Built with VitePress | Software Systems Atlas