Chapter 11: Pre-training & Fine-tuning

Metadata Card

Prerequisites: ch10 Transformer, Vol 12 Text Data Processing
Estimated time: 50 minutes
Core difficulty: Advanced/In-depth
Reading mode: High focus
Completion: Able to explain Tokenization principles, understand the difference between BERT vs GPT, complete a Prompt Engineering experiment

Your Progress

The attention-based Transformer is running in the workshop. But training a Transformer requires massive amounts of data—far more than you could collect.

You were worrying about this when Ahua sent a hard drive from afar. Inside was a pre-trained base model, with a note attached: "I trained a basic version—you should be able to just adapt it for your needs."

This is the pre-training and fine-tuning paradigm.

Your Task

Pre-training and Fine-tuning took deep learning from "train separately for each task" to "a single base model serving multiple tasks." You start with Tokenization, go through BERT's encoder paradigm to GPT's decoder paradigm, and finally master Prompt Engineering—a new way of interacting with pre-trained models.

Chapter Layers
Required: Tokenization, BERT/GPT pre-training objectives, Prompt Engineering patterns
Optional: LoRA parameter-efficient fine-tuning, RLHF alignment introduction
Advanced: Pre-training data composition and deduplication strategies

Breaking Through · Tracing the Origin

Before you are dozens of TBs of text—articles, books, code from the entire internet. This data has no labels (nobody tagged them "what this article is about"), but they contain the full structure of language.

If you let the model first "self-study" on this data (pre-training), it can learn the basic rules of language—word meanings, grammar, common sense. Then you only need a small number of "teaching examples" (fine-tuning) to adapt it to specific tasks. This is the birth of the "Foundation Model."

Tokenization

Computers don't understand "text," only numbers. Tokenization converts text into integer sequences.

The first step in the Model Workshop for processing natural language isn't training—it's chopping up human-written text into fragments the model can chew on. Each token is a numeric ID, and the entire text becomes an array of integers.

python

# Using Hugging Face Tokenizers
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "The cat sat on the mat."
tokens = tokenizer.tokenize(text)
ids = tokenizer.convert_tokens_to_ids(tokens)

print(f"Original: {text}")
print(f"Tokens: {tokens}")
print(f"IDs:   {ids}")

Three types of tokenizers:

Word-based: Split by spaces/punctuation. Simple but huge vocabulary—"run""runs""running" are three different words.
Character-based: Split by characters. Small vocabulary but longer sequences, lower learning efficiency.
Subword (BPE/WordPiece/SentencePiece): Statistically find the most common subword segments as tokenization units. High-frequency words stay complete, rare words get split. BERT uses WordPiece, GPT uses BPE.

The first step at the large language model workbench in the Model Workshop is converting text into numbers. Choosing which tokenization method directly affects the vocabulary range the model can handle—Subword is the current mainstream choice.

# BPE tokenization example: "unhappiness" → ["un", "happiness"]
# "playing" → ["play", "##ing"]  (WordPiece's ## means continuation)

BERT: Bidirectional Encoder Pre-training

BERT (2018) uses Transformer Encoder, seeing both left and right context in all layers. Pre-training objectives:

MLM (Masked Language Model): randomly mask 15% of tokens, let the model predict what's masked
NSP (Next Sentence Prediction): whether two text segments are adjacent (found less important in later research)

The hard drive Ahua sent contained exactly this kind of pre-trained model. BERT learns general language representations from unlabeled text—what word is masked? Do adjacent sentences connect? These aren't 'tasks' in the traditional sense—they're self-supervised methods for the model to understand language structure.

python

from transformers import BertForSequenceClassification, BertTokenizer

model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

text = "This movie was fantastic!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)
print(f"Logits: {outputs.logits}")

BERT fine-tuning: add a classifier head on top of the pre-trained model, do a small amount of full parameter updates on labeled data. Since the model already learned language representations during pre-training, fine-tuning typically only needs a few hundred to a few thousand labeled samples.

GPT: Autoregressive Decoder Pre-training

GPT (2018~2023 series) uses Transformer Decoder (or decoder-only causal language model), unidirectionally predicting the next token from left to right. The pre-training objective is standard Language Modeling (LM): given the preceding context, predict the next token.

# GPT's pre-training loss:
# L = -sum_t log P(token_t | token_1, ..., token_{t-1})

Key differences:

Feature	BERT	GPT
Architecture	Encoder-only	Decoder-only
Context direction	Bidirectional (full visibility)	Unidirectional (left only)
Pre-training objective	MLM + NSP	Autoregressive LM
Suitable tasks	Classification/Understanding/Labeling	Generation/Dialogue/Continuation
Typical models	BERT, RoBERTa, DeBERTa	GPT-2/3/4, LLaMA, Mistral

Prompt Engineering

The new paradigm brought by GPT series models: you don't fine-tune the model, you design the input (Prompt) to guide the generated output.

python

from transformers import pipeline

generator = pipeline('text-generation', model='gpt2')

# Zero-shot: ask directly
prompt = "Translate to French: 'The cat is on the table.'"
result = generator(prompt, max_length=50, num_return_sequences=1)
print(result[0]['generated_text'])

# Few-shot: give examples then ask
prompt = """
English: Hello -> French: Bonjour
English: Goodbye -> French: Au revoir
English: Thank you -> French:"""

result = generator(prompt, max_length=30, num_return_sequences=1)
print(result[0]['generated_text'])

Core patterns of Prompt Engineering:

Zero-shot: direct instruction, no examples given
Few-shot / In-Context Learning: put a few examples in the prompt, the model "learns on the spot"—no weight updates
Chain-of-Thought (CoT): "Let's think step by step" → guide the model to output its reasoning process
System Prompt: set the model's role and behavior guidelines

This paradigm upends the traditional "train → deploy" flow—you can now control model behavior with just strings.

Common Pitfalls

Implicit bias in Tokenization: different languages have different tokenization efficiency (English: 1~2 tokens per word, Chinese: ~1 token per character), which affects inference cost.
BERT's MLM pre-training needs to compute [MASK] positions and can't handle generation tasks. GPT uses the same autoregressive objective for both generation and fine-tuning, which is simpler.
Prompts are extremely sensitive to wording. Changing "Translate this to Chinese" to "Please translate the following sentence" might yield completely different outputs.
Limited context window (GPT-3: 2048 tokens → GPT-4: 32K → newer models: 128K~1M), content beyond the window is completely invisible.
Few-shot quality matters more than quantity: 3 carefully chosen examples may be more effective than 10 random ones.

Pass Challenges

Warm-up (10 min): Use HuggingFace's AutoTokenizer to tokenize the same English text with bert-base-uncased and gpt2. Compare token count differences.
Challenge (40 min): Find a text classification dataset on huggingface.co/datasets (e.g., IMDB sentiment analysis), fine-tune DistilBERT, reaching 90%+ accuracy.
Observation: Design 3 different prompts for GPT-2 to complete the same task (e.g., summarizing a text). Observe which prompt is more stable.

Traveler's Notes

The "pre-training + fine-tuning" paradigm evolved NLP from training separate models for each task to sharing a single large foundation model. Tokenization is the first step, BERT vs GPT are two different pre-training strategies, and Prompt Engineering creates a new interaction paradigm of "conversing with models." The stronger the model, the less fine-tuning needed.

-> Next Chapter Preview

Pre-training and fine-tuning are just the beginning. Deploying LLMs to the real world requires alignment, knowledge updates, and tool use—RLHF, RAG, and Agents are the answers.