Software Systems Atlas

Metadata Card

Prerequisites: Chapter 8 — Probability Theory Basics, Chapter 11 — Calculus Basics
Estimated time: 50 minutes
Core difficulty: Intermediate
Completion marker: Understand the concept of information entropy, understand sources of floating-point error

Your Progress

The top floor of the Applied Tower — and the pinnacle of the entire Mathematics Tower. There are no symbols, no graphics here — just a single slowly rotating number. "Information can be measured. Error can be controlled. The final lesson of this tower is to help you understand — mathematical computation on real computers is not perfect," says the Librarian.

Your Task

How much "information" does a message contain? Why is the limit of lossless compression a specific theoretical value? Your program has 0.1 + 0.2 != 0.3 — what is the mathematical root of this bug? This chapter concludes two topics: information theory tells you that information is "the reduction of uncertainty" — a measurable quantity; numerical analysis tells you that under finite precision, mathematical formulas need re-examination.

Chapter Layers
Required reading: Information entropy, floating-point principles and error
Optional reading: Cross-entropy and KL divergence, numerical stability

Breakthrough · Origin Story

Your deep learning model training keeps diverging. The loss function decreases and then becomes NaN. You scoured the code: gradient explosion. Why? — Because numerical computation has upper bounds. Mathematically, the loss function can approach 0, but in the world of floating-point numbers, underflow becomes 0 and overflow becomes infinity.

Information entropy H(X) is the core quantity of information theory — it measures the "uncertainty" of a random variable X:

H(X) = -Σ P(xᵢ) × log₂ P(xᵢ)

The unit is bits. Entropy is highest when all outcomes are equally probable — this is maximum uncertainty. The "most random coin" is fair (50/50), with entropy = 1 bit. A "biased coin" (90/10) has lower entropy because the outcome is more predictable.

The meaning of this formula goes beyond intuition. If the weather forecast says "90% chance of rain tomorrow," this message contains less information than "50% chance of rain tomorrow" — because the former is more certain.

Cross-entropy H(P, Q) = -Σ P(x) × log₂ Q(x) is the standard loss function for classification problems in ML. You compare the true distribution P (one-hot encoded) with the predicted distribution Q. The smaller the cross-entropy, the better the prediction.

How floating-point numbers work in computers: The IEEE 754 standard uses 1 sign bit, 8 exponent bits, and 23 mantissa bits to represent single-precision floating-point numbers.

While most programmers don't need to manually compute floating-point encodings, understanding this formula explains a bug you've definitely encountered: why 0.1 + 0.2 != 0.3. The answer lies in the shifting of exponents and the truncation of significant digits.

value = (-1)^sign × (1 + mantissa) × 2^(exponent - 127)

The main problem with this representation: numbers are not uniformly distributed on the number line. They are very dense near zero and increasingly sparse away from zero.

The truth about 0.1 + 0.2 ≠ 0.3: 0.1 in binary is an infinite repeating decimal (just like 1/3 in decimal is 0.333...). The computer must truncate. Two truncated approximate values are added, the error accumulates, and the result differs from what you expect when compared.

Try this in a Python interactive terminal. Your first reaction might be that the language has a bug, but it demonstrates a mathematical limitation common to all languages. It's not a bug — it's a physical law of floating-point numbers.

// Python
>>> 0.1 + 0.2
0.30000000000000004
>>> 0.1 + 0.2 == 0.3
False

This isn't a bug — it's a physical fact. The "real numbers" in a computer are actually rational numbers with finite precision.

Numerical stability teaches you how to avoid amplifying errors during computation:

Avoid subtracting nearly equal numbers: a - b when a ≈ b dramatically amplifies relative error (catastrophic cancellation).
Avoid division by zero or very small quantities.
The softmax implementation typically subtracts the maximum value (x_i - max) before computing e^x, preventing exponent overflow.

Common Pitfalls

Directly comparing floating-point numbers for equality. Use abs(a - b) < epsilon instead of a == b.
Equating information-theoretic "entropy" with thermodynamic entropy. They are related but distinct concepts. Information entropy was defined by Claude Shannon in his 1948 mathematical theory of communication — it does not involve physical temperature.
Thinking "double precision is enough." For iterative algorithms (such as numerical integration, ODE solving), rounding errors can accumulate to unacceptable levels. Compensation algorithms like Kahan summation can mitigate but not eliminate them.

Challenge Questions

Compute the information entropy of a fair 4-sided die.
In Python, verify the floating-point error of 0.1 + 0.2, and use the Decimal module to implement precise decimal arithmetic.
Implement a numerically stable softmax function: given an input vector x, output the component-wise e^x_i / Σ e^x_j. Be sure to subtract max(x) to prevent overflow.

Traveler's Notes

Information entropy defines the theoretical limit of "how much you can compress." Floating-point arithmetic is the error tax you pay with every computation. Numerical analysis teaches you how to prevent error from destroying the entire result — this is the final lesson from theory to practice.

→ The Mathematics Tower ends here. You started from logic and sets, passed through proof, recursion, relations and functions, then combinatorics, graph theory, probability, and number theory, finally ascending to linear algebra, calculus, information theory, and numerical computation. The Librarian stands at the peak, gazing into the distance: "Mathematics doesn't end with Volume 12 — every new tower will need it. When you're ready, continue onward."