Software Systems Atlas

Metadata Card

Prerequisites: ch04 ML Foundations, Vol 10 Linear Algebra
Estimated time: 45 minutes
Core difficulty: Advanced
Reading mode: High focus
Completion: Able to implement K-means and PCA, understand the basic principles of dimensionality reduction and anomaly detection

Your Progress

You've spent so much time in the Model Workshop that you're used to preparing labeled data for every training session—every record tagged with the correct answer. But in the corner of the workshop, there's a pile of never-labeled archives: tens of thousands of behavioral logs, thousands of unclassified drawings, a heap of ungrouped task records.

Ahua wrote in her letter: "You can see patterns on your own too, right?"

You look toward the workshop's "unsupervised" workbench.

Your Task

Unsupervised learning mines intrinsic patterns in data without labels. You learn three core tasks: clustering (discovering natural groupings), dimensionality reduction (compressing data without losing structure), and anomaly detection (finding outliers). All three share the same belief: data itself contains structure—it doesn't need external answers to define it.

Chapter Layers
Required: K-means clustering, PCA principal component analysis, simple anomaly detection
Optional: DBSCAN density clustering, t-SNE visualization
Advanced: Spectral clustering and Laplacian eigenmaps

Breaking Through · Tracing the Origin

You have a warehouse filled with thousands of items. No labels telling you "this is a tool" or "this is food"—but you notice iron items mostly sit in one pile, wooden ones in another. They "clustered" on their own.

Clustering is discovering these natural aggregations in feature space. And if you find an item far away from all piles—it might be broken, or simply doesn't belong here (anomaly detection).

K-means Clustering

K-means is the most intuitive clustering algorithm. You specify K groups, and the algorithm iteratively optimizes:

Randomly initialize K centroids
Assign each point to the nearest centroid
Update each centroid to the mean of its assigned points
Repeat until convergence

The workshop corner has piles of unlabeled data. K-means is the first tool for finding intrinsic groupings: initialize centers, assign, update—two alternating steps until convergence. No labels needed to see data automatically split into clusters.

python

import numpy as np

class KMeans:
    def __init__(self, n_clusters=3, max_iter=100):
        self.n_clusters = n_clusters
        self.max_iter = max_iter

    def fit(self, X):
        # Randomly initialize centroids
        idx = np.random.choice(len(X), self.n_clusters, replace=False)
        self.centroids = X[idx]

        for _ in range(self.max_iter):
            # Assignment step
            distances = np.array([
                np.sum((X - c) ** 2, axis=1) for c in self.centroids
            ])
            self.labels = np.argmin(distances, axis=0)

            # Update step
            new_centroids = np.array([
                X[self.labels == k].mean(axis=0)
                for k in range(self.n_clusters)
            ])

            if np.allclose(self.centroids, new_centroids):
                break
            self.centroids = new_centroids

    def predict(self, X):
        distances = np.array([
            np.sum((X - c) ** 2, axis=1) for c in self.centroids
        ])
        return np.argmin(distances, axis=0)

K-means problem: K must be specified in advance. The elbow method (plotting within-cluster sum of squares for different K) and silhouette score can help choose K. If clusters aren't convex (e.g., ring-shaped), K-means completely fails—that's when you need DBSCAN.

python

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)
score = silhouette_score(X, labels)
print(f"Silhouette score: {score:.3f}")  # closer to 1 is better (compact clusters, well-separated)

PCA: Principal Component Analysis

The core motivation for dimensionality reduction: high-dimensional data has two problems—the curse of dimensionality (required sample size grows exponentially with dimensions) and visualization difficulty (hard to plot in high dimensions).

PCA finds the directions of maximum variance (principal components) and projects data onto lower dimensions.

The mathematical core of PCA is singular value decomposition of the covariance matrix. Take the top K eigenvectors as new coordinate axes, and the data is reduced to K dimensions.

python

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Use PCA to reduce to 2 dimensions for visualization
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)

plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='viridis')
plt.xlabel('PC1')
plt.ylabel('PC2')
# plt.show()

# View variance explained by each principal component
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Cumulative explained variance: {pca.explained_variance_ratio_.cumsum()}")

The mathematical core of PCA: perform SVD on the covariance matrix, take the eigenvectors corresponding to the top K eigenvalues. These eigenvectors are the directions of maximum variance in the data.

PCA is not feature selection—it's feature transformation. Each principal component is a linear combination of original features—interpretability decreases, but information retention is higher.

Anomaly Detection

Anomaly detection is finding "different" samples. A simple method: if a point's distance to its nearest cluster center exceeds a threshold, it might be an anomaly.

python

# Simple distance-based anomaly detection
from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(contamination=0.1, random_state=42)
anomaly_labels = iso_forest.fit_predict(X)  # -1 for anomalies
anomaly_score = iso_forest.decision_function(X)

n_anomalies = sum(anomaly_labels == -1)
print(f"Detected {n_anomalies} anomaly points")

Isolation Forest is a more efficient anomaly detector. Its idea is clever: anomalous points are easier to isolate. In randomly partitioned feature space, anomalous points typically need fewer splits to be isolated.

Common Pitfalls

K-means is sensitive to initialization. Different initial centroids yield different results. Run multiple times and pick the best (k-means++ initialization helps).
K-means assumes clusters are of similar size—a large cluster next to a small one will "absorb" the small cluster.
PCA assumes that directions of maximum variance are most informative—but small variance can also contain important discriminative information.
What contamination parameter to set in anomaly detection? If the true anomaly rate is 1% but you set 10%, you'll get many false positives.
Visualization after dimensionality reduction can be misleading—PCA only preserves variance information, not all structural relationships in the original data.

Pass Challenges

Warm-up (10 min): Use sklearn.datasets.make_blobs to generate 3 clusters of data, run K-means, print the silhouette score. Try K=2,3,4,5 and see which K is best.
Challenge (30 min): Run PCA on MNIST (handwritten digits). Reduce to 2 dimensions and visualize—observe whether points of the same digit cluster together and different digits separate.
Observation: In anomaly detection, change a normal point's feature value to an extreme value (e.g., from 0.5 to 50) and observe whether it gets flagged as anomalous.

Traveler's Notes

Unsupervised learning is learning without standard answers. Clustering reveals grouping structures in data, dimensionality reduction compresses information while preserving core semantics, and anomaly detection draws a fuzzy line between "normal" and "anomalous." All three share a common prerequisite: the data itself must have enough structure.

-> Next Chapter Preview

No matter what model you use, you face the same question: how do you know if the model is good? Next chapter, the toolbox for evaluation and tuning.