Skip to content

Metadata Card

  • Prerequisites: Chapter 11 (Data Governance Fundamentals), Vol 8 (Security Fundamentals)
  • Estimated time: 45 minutes
  • Core difficulty: Advanced
  • Reading mode: High focus
  • Completion: Able to identify sensitive information in data, apply basic security controls, and understand the principles of differential privacy

Your Progress

The final topic of the Data Prophecy Hall is not a technical problem, but a regulatory one.

You received an encrypted letter from the border fortress: 'Intelligence indicates the enemy is collecting behavioral trajectory data on our sentries. Please inspect all personnel data stored in the Prophecy Hall—who has access? Has the data been anonymized? What is the retention period?'

Data privacy and security aren't optional polish—they are your legal obligations as a data steward.

Your Task

You have a dataset containing user log data. You promised this data would only be used for analysis and would not leak. But you can't guarantee that everyone who gets access to the data will follow the same standards. This chapter addresses: how to protect personal privacy while analyzing data, how to do data science within a compliance framework, and where security controls fail when data is breached.


Identifying Sensitive Data

The first step is identifying which data needs protection.

The encrypted letter from the border fortress mentioned that personnel data might be stolen. Every visitor to the Prophecy Hall must run a sensitive field scan before accessing data. Column name matching is the first filter—when a column is named email, id_card, or ip_address, there's a clear obligation to protect it.

python
import pandas as pd
import re

def classify_sensitive_columns(df):
    """Classify DataFrame columns by sensitivity"""
    classes = {"PII": [], "sensitive": [], "non_sensitive": []}
    
    for col in df.columns:
        col_lower = col.lower()
        
        # Direct personal identifiable information (PII)
        if any(kw in col_lower for kw in [
            "name", "email", "phone", "id_card", "passport",
            "ip_address", "device_id", "address"
        ]):
            classes["PII"].append(col)
        
        # Sensitive attributes
        elif any(kw in col_lower for kw in [
            "salary", "income", "health", "diagnosis",
            "religion", "political", "biometric"
        ]):
            classes["sensitive"].append(col)
        
        else:
            classes["non_sensitive"].append(col)
    
    return classes

df = pd.read_csv("user_data.csv")
sensitivity = classify_sensitive_columns(df)
print("PII fields:", sensitivity["PII"])
print("Sensitive fields:", sensitivity["sensitive"])

Column names are a first pass. A more accurate method is to match actual values with regex patterns—like email formats, phone number formats, ID card formats.

python
def detect_pii_in_values(series):
    """Scan column values for possible PII content"""
    sample = series.dropna().astype(str).head(100)
    
    # Email
    email_count = sample.str.contains(
        r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    ).sum()
    
    # Phone number
    phone_count = sample.str.contains(
        r'^1[3-9]\d{9}$'
    ).sum()
    
    return {
        "email_ratio": email_count / len(sample),
        "phone_ratio": phone_count / len(sample),
    }

Data Masking

Once sensitive data is identified, you need to apply masking measures. Masking isn't just "delete and be done"—you need to balance privacy protection with data usability.

python
import hashlib

def anonymize_data(df):
    df = df.copy()
    
    # 1. Masking — show only partial information
    df["email_masked"] = df["email"].apply(
        lambda x: x[:3] + "***" + x[x.index("@"):] if "@" in str(x) else x
    )
    
    # 2. Tokenization (Hashing) — irreversible but preserves uniqueness
    df["user_id_token"] = df["user_id"].apply(
        lambda x: hashlib.sha256(str(x).encode()).hexdigest()[:16]
    )
    
    # 3. Generalization — reduce precision
    df["age_group"] = pd.cut(df["age"], 
        bins=[0, 18, 30, 45, 60, 120],
        labels=["0-18", "19-30", "31-45", "46-60", "60+"])
    
    # 4. Drop original PII columns
    pii_cols = ["email", "user_id", "phone", "address"]
    df.drop(columns=[c for c in pii_cols if c in df.columns], inplace=True)
    
    return df

Differential Privacy

Traditional masking has a problem: attackers can infer personal data through multiple queries. Differential privacy guarantees: whether or not your dataset includes a particular person, the distribution of query results is nearly identical.

python
import numpy as np

def laplace_mechanism(true_value, epsilon, sensitivity=1):
    """Add Laplace noise to query results for differential privacy
    epsilon: privacy budget (smaller = more private, less accurate)
    sensitivity: global sensitivity of the query
    """
    scale = sensitivity / epsilon
    noise = np.random.laplace(0, scale)
    return true_value + noise

# Example: add noise to a count query
true_count = 1000
epsilon = 0.5  # moderate privacy protection
noisy_count = laplace_mechanism(true_count, epsilon)
print(f"True value: {true_count}, Noisy result: {noisy_count:.0f}")

# Larger epsilon = less noise
for eps in [0.1, 0.5, 1.0, 5.0]:
    noisy = laplace_mechanism(true_count, eps)
    print(f"  epsilon={eps:.1f}: {noisy:.0f}")

The key design parameter of differential privacy is epsilon:

  • epsilon < 0.1: Strong privacy protection, high noise
  • epsilon ≈ 1: Moderate protection
  • epsilon > 10: Weak protection, close to raw answer

You can combine multiple queries to reduce noise—each query consumes a portion of the epsilon budget. Once the budget is exhausted, no more queries can be made.

Access Control

Who can access what data is managed through Access Control Lists (ACLs).

Data Classification:
  Public     — Can be openly shared (aggregate reports, statistical summaries)
  Internal   — Accessible within the company (on-demand authorization)
  Restricted — Limited to designated team members (raw user data)
  Confidential — Limited to a few people (unmasked data with full PII)

At the code level, the most common control is column-level permissions:

python
# Pseudo-code: column-level access control
ACCESS_CONTROL = {
    "role_analyst": {
        "can_view": ["mission_id", "region", "success_rate", "duration"],
        "cannot_view": ["user_id", "email"],
    },
    "role_data_scientist": {
        "can_view": ["mission_id", "region", "success_rate", "duration",
                     "user_id_token", "age_group"],
        "cannot_view": ["email", "phone", "address"],
    },
    "role_admin": {
        "can_view": "__all__",
    },
}

def filter_columns(df, role):
    """Filter visible columns based on role"""
    acl = ACCESS_CONTROL.get(role, {})
    if acl.get("can_view") == "__all__":
        return df
    allowed = acl.get("can_view", [])
    return df[[c for c in df.columns if c in allowed]]

Data Security Baseline

Finally, document a few minimum standards for data security:

  • Encryption in transit: All data must use TLS during transmission
  • Encryption at rest: Sensitive data must be encrypted (AES-256) when stored
  • Minimization: Collect and retain only necessary data (data minimization principle)
  • Least privilege: Data accessible by a role should be limited to what's needed for the task
  • Audit logs: Who accessed what data at what time should be traceable and recorded

Common Pitfalls

  • Thinking "de-identification" is sufficient security. With auxiliary information, most de-identified data can be re-identified.
  • Masking only at export time. If raw PII is still used in processing pipelines, leakage risk remains.
  • Setting differential privacy epsilon too small, making query results completely unusable. Test different epsilon values to find the balance between privacy and accuracy.
  • Believing compliance is solely the legal team's concern. Implementing data compliance requires engineering team involvement in the design.

Pass Challenges

  • Warm-up: Scan a dataset you commonly use and identify fields that may need protection.
  • Challenge: Implement a complete masking pipeline—identify PII → apply masking strategies → verify masking effectiveness. Record differences in data usability before and after masking.
  • Troubleshooting: An analyst complains that "the data is unusable" from a differentially-privacy-noised query. How do you determine if epsilon is too small or the query itself is the problem?

Acceptance Criteria

  • Can identify PII and sensitive fields in a dataset
  • Can apply one of four masking techniques and explain their pros and cons
  • Understands the basic principles of differential privacy
  • Knows how to implement access control in a data pipeline

Traveler's Notes

Data and privacy are not a binary choice. You can analyze data, protect privacy, and comply with regulations—all three can coexist. But you need to design for it from the start, not patch it on at the end.


Next Chapter Preview

The data journey concludes here. Next stop, the Model Workshop (Vol 13)—from data analysis to machine learning, letting the data speak for itself.

Built with VitePress | Software Systems Atlas