Software Systems Atlas

Metadata Card

Prerequisites: Chapter 1 (encryption & hashing), Chapter 3 (authentication & authorization), basic legal concepts
Estimated time: 45 minutes
Core difficulty: Advanced
Completion mark: Can explain GDPR's core principles and legal bases for data processing, can distinguish pseudonymization from anonymization, can design a privacy-compliant data retention strategy

Your Progress

You've come a long way from encryption spells, through station application security, operating formation isolation, and magic conduit monitoring. You've erected every line of defense on the magic courier routes.

But there's one issue more subtle than SQL injection: the data you store—coordinate information in teleport logs, messenger names and contact details, timestamps in communication records—this data has value in itself, and it's associated with specific individuals.

This isn't a spell vulnerability problem. It's a privacy protection problem. If the courier routes are breached, attackers get not just system control—but all associated privacy data. Even worse, even if the courier routes aren't breached, if you're collecting data you shouldn't be collecting, you've already violated the privacy regulations of the Wizards' Council.

Your Task

Understand the core legal framework for data privacy (GDPR), technical implementation methods (encrypted storage, pseudonymization, anonymization), and data lifecycle management (collection, use, retention, deletion). These aren't just "legal team matters"—engineering decisions directly affect data protection capabilities.

Chapter Layers
Required: GDPR's seven principles, data subject rights, pseudonymization vs anonymization, data retention strategies
Optional: Data Protection Impact Assessment (DPIA), Data Portability
Advanced: Differential Privacy basics, Homomorphic Encryption basics

Breaking Ground · Tracing the Origin

Problem: Your patrol report system stores GPS coordinates for each patrol route, the executing sentry's badge ID, report time, and notes. You think this is "normal business need"—until one day a sentry asks: "Why are you storing my patrol routes? Can I delete data older than 3 months?"

You don't know the answer. You can't figure it out—the data is stored properly, doesn't take up much space, why delete it?

But the issue isn't that simple:

This data is associated with specific individuals (through badge ID -> real name)
You can't answer "why store this data"
You can't answer "how long is this data kept"
You can't answer "how to process a user's deletion request"
If the database is leaked, this data directly reveals specific individuals' daily activity patterns

First Piece: GDPR Core Principles

GDPR (General Data Protection Regulation) is an EU data protection regulation that took effect in 2018. Its impact extends far beyond the EU—any service that serves or is used by EU residents must comply.

GDPR's seven core principles (Art.5):

Principle	Meaning	Engineering Impact
Lawfulness, fairness, transparency	Clear legal basis for collecting data; tell users what you're collecting	Need to obtain consent, write privacy policy
Purpose limitation	Data only used for the purpose it was collected	Don't use data across purposes
Data minimization	Only collect data relevant to the purpose	Store less, not more
Accuracy	Data should be accurate and up to date	Provide data modification features
Storage limitation	Data only kept for as long as necessary	Design automatic data deletion policies
Integrity and confidentiality	Protect data with appropriate security measures	Encryption, access control, audit logs
Accountability	Can demonstrate compliance with above principles	Record data processing activities, do DPIA

Legal Bases (Art.6):

Under GDPR, you must have a valid legal basis to process personal data. Common legal bases:

Legal Basis	Applicable Scenario	Example
Consent	User actively agrees	Subscribing to email notifications
Contract	Data processing is part of a contract	Assigning patrol tasks requires sentry's post ID
Legal Obligation	Law requires retention	Tax data retained for 7 years
Legitimate Interest	Reasonable interest for both parties	Anti-fraud detection (cannot be abused)

For non-essential data collection, you generally need the user's explicit consent. And consent must be:

Active: Not a pre-checked box
Informed: The user knows what they're consenting to
Revocable: The user can withdraw consent at any time
Easy to revoke: Revoking consent shouldn't be harder than giving it

Second Piece: Data Subject Rights

GDPR grants users eight data rights (relevant articles in parentheses):

1. Right to be informed (Art.13-14): Know what data is collected, why, and how long it's kept
2. Right of access (Art.15): Give me a copy of my own data
3. Right to rectification (Art.16): My age/phone is wrong, correct it
4. Right to erasure / "Right to be forgotten" (Art.17): Delete my data
5. Right to restrict processing (Art.18): Stop processing my data pending verification
6. Right to data portability (Art.20): Transfer my data to another service (structured, common, machine-readable format)
7. Right to object (Art.21): I don't want personalized recommendations
8. Rights related to automated decision-making (Art.22): Don't let algorithms decide for me automatically

As an engineer, you need to ensure the system can fulfill each of these rights. Especially the Right to Erasure (also known as the "Right to be forgotten").

When a sentry requests deletion of all their patrol data, you can't just delete one record in the main database—backups, logs, caches, and search indexes, every copy must be handled:

python

# Designing data deletion (at the API level)

@app.delete("/api/v1/users/{user_id}/data")
@require_auth
async def delete_user_data(user_id: str, current_user = Depends(get_current_user)):
    """Implement GDPR 'Right to be forgotten'"""
    
    # Confirm the requester is the user themselves or their authorized agent
    if current_user.id != user_id:
        raise HTTPException(403, "Can only delete your own data")
    
    # 1. Delete personal data from production database
    db.execute("DELETE FROM user_profiles WHERE user_id = %s", (user_id,))
    
    # 2. Mark for deletion in backups (auto-clean after restore)
    db.execute(
        "INSERT INTO deletion_queue (user_id, scheduled_for) VALUES (%s, NOW())",
        (user_id,)
    )
    
    # 3. Redact from logs (logs may need to be retained, but redact PII)
    db.execute(
        "UPDATE audit_logs SET user_id = 'ANONYMIZED', details = 'ANONYMIZED' "
        "WHERE user_id = %s",
        (user_id,)
    )
    
    # 4. Delete from search index
    # await search_index.delete_document(user_id)
    
    # 5. Delete from cache
    # await cache.delete(f"user:{user_id}")
    
    # 6. Log deletion operation (for compliance audit)
    db.execute(
        "INSERT INTO compliance_audit (action, target_user, timestamp) "
        "VALUES ('GDPR_DELETION', %s, NOW())",
        (user_id,)
    )
    
    return {"status": "deleted", "note": "Data will be permanently deleted after backup retention period"}

Note the challenge here: data in backups. You won't rewrite your backup tapes just because one user requests data deletion. GDPR doesn't require you to destroy backup media—it requires you to ensure that after a backup is restored, the deleted data is not re-imported into production. The usual practice is to maintain a deletion manifest, and clean up based on it after data restoration.

Third Piece: Pseudonymization & Anonymization

The core tension in data protection: you want to analyze data, but you can't associate the data with specific individuals. Two technical approaches:

Pseudonymization:

python

# Pseudonymization: replace direct identifiers with random identifiers
# You can recover the mapping (if you have the mapping table)

import hashlib
import secrets

class Pseudonymizer:
    def __init__(self):
        # Secure random salt, unique per project
        self.salt = secrets.token_hex(16)
    
    def pseudonymize(self, identifier: str) -> str:
        """Convert a user identifier into an irreversible pseudonym"""
        # HMAC ensures same input gets same pseudonym (can correlate same user's records)
        # But the original value cannot be reversed from the pseudonym (even knowing the algorithm, because the salt is secret)
        return hashlib.sha256(
            (self.salt + identifier).encode()
        ).hexdigest()[:16]

p = Pseudonymizer()

# Original: badge_10086, badge_10087, badge_10088
# After pseudonymization: a1b2c3..., d4e5f6..., g7h8i9...

# Advantage: You can still analyze based on the same user's records (same pseudonym)
# Disadvantage: If the salt leaks, it can be reversed (salt is equivalent to a key)
#              If the attacker has multiple records, they can build a user profile

Anonymization:

Anonymization is irreversible—processed data can no longer be linked back to a specific individual. Once truly anonymized, GDPR no longer applies.

python

# Aggregation + generalization is the most common anonymization method in practical engineering

def anonymize_location_data(records, min_k=5):
    """
    k-anonymity: ensure each output group has at least k records
    
    Input: precise GPS coordinates + timestamps
    Output: grid area identifier + date (no precise time or precise location)
    """
    def generalize(lat, lng):
        # Map precise GPS coordinates to a 1km x 1km grid
        grid_lat = round(lat, 2)  # ≈ 1.1km precision
        grid_lng = round(lng, 2)
        return f"{grid_lat}_{grid_lng}"
    
    def generalize_time(ts):
        # Keep only the date, discard hours/minutes/seconds
        return ts.date().isoformat()
    
    anonymized = []
    for record in records:
        anonymized.append({
            "area": generalize(record["lat"], record["lng"]),
            "date": generalize_time(record["timestamp"]),
            # Drop user_id and other identifiers
        })
    
    # Check k-anonymity condition
    from collections import Counter
    counts = Counter((r["area"], r["date"]) for r in anonymized)
    for key, count in counts.items():
        if count < min_k:
            print(f"Warning: {key} has only {count} records, insufficient for k={min_k} anonymity")
    
    return anonymized

Pseudonymization vs Anonymization:

Feature	Pseudonymization	Anonymization
Reversibility	Recoverable (has mapping table or salt)	Irreversible
GDPR Applicable	Still applies (still personal data)	Does not apply
Analysis Use	Can track same user across records	Aggregate statistics only
Typical Techniques	Hashing + salt, encryption, tokenization	k-anonymity, l-diversity, differential privacy
Risk	Salt leak = data re-identification	Re-identification attacks (multiple data sources correlation)

Differential Privacy:

A powerful version of anonymization—adding carefully calibrated noise to query results so that an attacker cannot infer whether a single record exists.

Apple, Google, and many survey organizations already use differential privacy to collect user statistics.

python

# Differential privacy: adding Laplacian noise

import numpy as np

def epsilon_dp_query(true_count, epsilon=1.0, sensitivity=1):
    """
    epsilon differential privacy
    - Smaller epsilon = stronger privacy protection (more noise)
    - epsilon 0.1-1.0: Strong protection
    - epsilon 10+: Weak protection
    """
    # Laplacian noise
    noise = np.random.laplace(0, sensitivity / epsilon)
    noisy_count = true_count + noise
    return max(0, round(noisy_count))  # Count can't be negative

# Original data: how many sentries passed through the fortress north gate between 14:00-15:00 today?
true_count = 47  # Exact number

# When a user queries, return the noisy result
print(f"Actual count: {true_count}")
print(f"Differential privacy result: {epsilon_dp_query(true_count, epsilon=0.5)}")
print(f"Differential privacy result: {epsilon_dp_query(true_count, epsilon=0.5)}")
# Each query returns a different result (but statistically averages close to the true value)
# An attacker cannot determine the exact value from multiple queries

The cost of differential privacy: precision loss. The smaller the epsilon, the larger the noise, the less accurate the statistics.

Fourth Piece: Data Lifecycle Management

Data doesn't become more valuable the longer it's kept. Keeping it too long increases risk instead.

Collection → Use → Archive → Deletion
   │          │       │         │
   │          │       │         └── Compliant deletion (irrecoverable)
   │          │       │
   │          │       └── Cold storage (not used for daily operations)
   │          │
   │          └── Active use (normal business operations)
   │
   └── Data classification + legal consent

Data Classification Example (with GDPR context):

python

# Data retention policy pseudocode
DATA_RETENTION_POLICY = {
    "patrol_reports": {
        "retention_days": 365,  # Keep at most one year
        "anonymize_after_days": 90,  # Auto-anonymize after 90 days
        "legal_basis": "legitimate_interest",
        "purpose": "Patrol route analysis",
    },
    "audit_logs": {
        "retention_days": 730,  # Security audit logs kept for 2 years
        "anonymize_after_days": None,  # Don't anonymize—audit needs integrity
        "legal_basis": "legal_obligation",
        "purpose": "Security audit and compliance",
    },
    "user_sessions": {
        "retention_days": 30,  # Session logs kept at most one month
        "anonymize_after_days": 7,
        "legal_basis": "consent",
        "purpose": "User activity records",
    },
}

def apply_retention_policy():
    """Scheduled task: clean up expired data"""
    for table, policy in DATA_RETENTION_POLICY.items():
        # Delete data past retention period
        if policy["retention_days"]:
            db.execute(f"""
                UPDATE {table}
                SET deleted_at = NOW()
                WHERE created_at < NOW() - INTERVAL %s DAY
            """, (policy["retention_days"],))
        
        # Anonymize records past threshold
        if policy.get("anonymize_after_days"):
            db.execute(f"""
                UPDATE {table}
                SET user_id = 'ANONYMIZED',
                    ip_address = NULL,
                    details = REGEXP_REPLACE(
                        details,
                        '([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{{2,}}|'
                        '[0-9]{{1,3}}\\.[0-9]{{1,3}}\\.[0-9]{{1,3}}\\.[0-9]{{1,3}})',
                        '[REDACTED]'
                    )
                WHERE created_at < NOW() - INTERVAL %s DAY
                  AND user_id != 'ANONYMIZED'
            """, (policy["anonymize_after_days"],))

This scheduled task can run in cron or periodic workflows. The key point: automate execution, don't rely on manual operations.

Data Protection Impact Assessment (DPIA):

GDPR requires a DPIA to be conducted in advance for data processing that may pose high risks to individuals' rights and freedoms.

The DPIA answers these questions:

What data is being processed? How?
Necessity and proportionality (why is this data being processed?)
What are the risks to individual privacy?
What mitigation measures are in place?

Scenarios requiring DPIA:

Automated evaluation of personal characteristics (e.g., credit scoring)
Large-scale processing of special category data (health, religion, etc.)
Systematic large-scale monitoring of publicly accessible areas

Common Pitfalls

"Storage minimization" ignored. The most common violation—collecting all possible data because "it might be useful later." GDPR explicitly requires only collecting data necessary for the processing purpose.
Forgetting backups and logs during deletion. A user requests data deletion; you delete the record in the production database, but backup tapes and access logs still contain the original data. A true "right to be forgotten" implementation must cover all data copies.
Thinking anonymization is just removing names and emails. Removing only explicit identifiers (name, email) is not anonymization. Combining quasi-identifiers (age, gender, zip code, job title) can re-identify individuals with high probability. Netflix released "anonymized" rating data in 2006, and researchers re-identified users by correlating it with IMDb data.
Treating pseudonymization and encryption as "foolproof." Pseudonymized data is still personal data under GDPR. If an attacker can correlate it back to individuals through other data sources, pseudonymization loses its meaning.
No records of data processing. GDPR requires organizations to document data processing activities (Art.30 record-keeping obligation). Not knowing what data is in your system, where it flows, and who can access it—that's non-compliance.
Privacy policies written as legal documents. Users rely on privacy policies to understand how their data is processed, but many policies are written as "disclaimers"—obscure and hard to understand. GDPR requires "clear and plain language" (Art.12).

Pass Challenges

Warm-up: Review a project you recently participated in. List all places in the system that store personal data (database tables, logs, caches, search indexes, backups). Does each have a corresponding retention policy?
Challenge: Design a complete data lifecycle management plan for your current project. Include: (1) Data classification (which columns are sensitive?) (2) Retention schedule (how long each type is kept?) (3) Auto-cleanup script (Python or SQL scheduled task) (4) Pseudonymization strategy (how to handle sensitive data in logs and backups?)
Troubleshoot: A user requests deletion of all personal data. Your system includes: PostgreSQL (primary + hot standby + cold backup for 30 days), MongoDB (operation logs), Elasticsearch (full-text search), S3 (user-uploaded files), Redis (24-hour cache expiry). What should your deletion plan cover? How do you handle "data exists but I don't know where it is"?
Observe: In Developer Tools > Application > Storage > Cookies and Local Storage, view what data a website stores. Which data is necessary (session token) and which might be unnecessary (tracking, analytics)?

Traveler's Notes

GDPR's seven principles: centered on data minimization and storage limitation—store less, be clear about purpose, delete when time comes
Users have the right to request deletion of personal data; the system must cover all data copies (production, backups, logs, caches)
Pseudonymization reduces risk (but still falls under GDPR); anonymization carries lower risk (but harder to truly achieve)
Differential privacy uses noise to protect individuals, suitable for statistical scenarios
Data lifecycle management must be automated—manual cleanup will eventually be forgotten
Privacy is not the exclusive domain of legal teams—engineering decisions directly affect privacy protection capabilities

Next Stop Preview

The defenses of the Border Fortress are finally complete. You started from the basics of cryptography, established encrypted communications; gained trusted identity verification through PKI; managed authorization with OAuth and OIDC; fixed multiple web application vulnerabilities; hardened defenses at the OS and network levels; embedded security into the development process; and ensured user data is protected at the privacy level.

Give your armor a squeeze—it's much heavier than when you started, but every layer has its meaning.