Metadata Card
- Prerequisites: Chapter 11 (Data Governance Fundamentals), Vol 8 (Security Fundamentals)
- Estimated time: 45 minutes
- Core difficulty: Advanced
- Reading mode: High focus
- Completion: Able to identify sensitive information in data, apply basic security controls, and understand the principles of differential privacy
Your Progress
The final topic of the Data Prophecy Hall is not a technical problem, but a regulatory one.
You received an encrypted letter from the border fortress: 'Intelligence indicates the enemy is collecting behavioral trajectory data on our sentries. Please inspect all personnel data stored in the Prophecy Hall—who has access? Has the data been anonymized? What is the retention period?'
Data privacy and security aren't optional polish—they are your legal obligations as a data steward.
Your Task
You have a dataset containing user log data. You promised this data would only be used for analysis and would not leak. But you can't guarantee that everyone who gets access to the data will follow the same standards. This chapter addresses: how to protect personal privacy while analyzing data, how to do data science within a compliance framework, and where security controls fail when data is breached.
Identifying Sensitive Data
The first step is identifying which data needs protection.
The encrypted letter from the border fortress mentioned that personnel data might be stolen. Every visitor to the Prophecy Hall must run a sensitive field scan before accessing data. Column name matching is the first filter—when a column is named email, id_card, or ip_address, there's a clear obligation to protect it.
import pandas as pd
import re
def classify_sensitive_columns(df):
"""Classify DataFrame columns by sensitivity"""
classes = {"PII": [], "sensitive": [], "non_sensitive": []}
for col in df.columns:
col_lower = col.lower()
# Direct personal identifiable information (PII)
if any(kw in col_lower for kw in [
"name", "email", "phone", "id_card", "passport",
"ip_address", "device_id", "address"
]):
classes["PII"].append(col)
# Sensitive attributes
elif any(kw in col_lower for kw in [
"salary", "income", "health", "diagnosis",
"religion", "political", "biometric"
]):
classes["sensitive"].append(col)
else:
classes["non_sensitive"].append(col)
return classes
df = pd.read_csv("user_data.csv")
sensitivity = classify_sensitive_columns(df)
print("PII fields:", sensitivity["PII"])
print("Sensitive fields:", sensitivity["sensitive"])Column names are a first pass. A more accurate method is to match actual values with regex patterns—like email formats, phone number formats, ID card formats.
def detect_pii_in_values(series):
"""Scan column values for possible PII content"""
sample = series.dropna().astype(str).head(100)
# Email
email_count = sample.str.contains(
r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
).sum()
# Phone number
phone_count = sample.str.contains(
r'^1[3-9]\d{9}$'
).sum()
return {
"email_ratio": email_count / len(sample),
"phone_ratio": phone_count / len(sample),
}Data Masking
Once sensitive data is identified, you need to apply masking measures. Masking isn't just "delete and be done"—you need to balance privacy protection with data usability.
import hashlib
def anonymize_data(df):
df = df.copy()
# 1. Masking — show only partial information
df["email_masked"] = df["email"].apply(
lambda x: x[:3] + "***" + x[x.index("@"):] if "@" in str(x) else x
)
# 2. Tokenization (Hashing) — irreversible but preserves uniqueness
df["user_id_token"] = df["user_id"].apply(
lambda x: hashlib.sha256(str(x).encode()).hexdigest()[:16]
)
# 3. Generalization — reduce precision
df["age_group"] = pd.cut(df["age"],
bins=[0, 18, 30, 45, 60, 120],
labels=["0-18", "19-30", "31-45", "46-60", "60+"])
# 4. Drop original PII columns
pii_cols = ["email", "user_id", "phone", "address"]
df.drop(columns=[c for c in pii_cols if c in df.columns], inplace=True)
return dfDifferential Privacy
Traditional masking has a problem: attackers can infer personal data through multiple queries. Differential privacy guarantees: whether or not your dataset includes a particular person, the distribution of query results is nearly identical.
import numpy as np
def laplace_mechanism(true_value, epsilon, sensitivity=1):
"""Add Laplace noise to query results for differential privacy
epsilon: privacy budget (smaller = more private, less accurate)
sensitivity: global sensitivity of the query
"""
scale = sensitivity / epsilon
noise = np.random.laplace(0, scale)
return true_value + noise
# Example: add noise to a count query
true_count = 1000
epsilon = 0.5 # moderate privacy protection
noisy_count = laplace_mechanism(true_count, epsilon)
print(f"True value: {true_count}, Noisy result: {noisy_count:.0f}")
# Larger epsilon = less noise
for eps in [0.1, 0.5, 1.0, 5.0]:
noisy = laplace_mechanism(true_count, eps)
print(f" epsilon={eps:.1f}: {noisy:.0f}")The key design parameter of differential privacy is epsilon:
- epsilon < 0.1: Strong privacy protection, high noise
- epsilon ≈ 1: Moderate protection
- epsilon > 10: Weak protection, close to raw answer
You can combine multiple queries to reduce noise—each query consumes a portion of the epsilon budget. Once the budget is exhausted, no more queries can be made.
Access Control
Who can access what data is managed through Access Control Lists (ACLs).
Data Classification:
Public — Can be openly shared (aggregate reports, statistical summaries)
Internal — Accessible within the company (on-demand authorization)
Restricted — Limited to designated team members (raw user data)
Confidential — Limited to a few people (unmasked data with full PII)At the code level, the most common control is column-level permissions:
# Pseudo-code: column-level access control
ACCESS_CONTROL = {
"role_analyst": {
"can_view": ["mission_id", "region", "success_rate", "duration"],
"cannot_view": ["user_id", "email"],
},
"role_data_scientist": {
"can_view": ["mission_id", "region", "success_rate", "duration",
"user_id_token", "age_group"],
"cannot_view": ["email", "phone", "address"],
},
"role_admin": {
"can_view": "__all__",
},
}
def filter_columns(df, role):
"""Filter visible columns based on role"""
acl = ACCESS_CONTROL.get(role, {})
if acl.get("can_view") == "__all__":
return df
allowed = acl.get("can_view", [])
return df[[c for c in df.columns if c in allowed]]Data Security Baseline
Finally, document a few minimum standards for data security:
- Encryption in transit: All data must use TLS during transmission
- Encryption at rest: Sensitive data must be encrypted (AES-256) when stored
- Minimization: Collect and retain only necessary data (data minimization principle)
- Least privilege: Data accessible by a role should be limited to what's needed for the task
- Audit logs: Who accessed what data at what time should be traceable and recorded
Common Pitfalls
- Thinking "de-identification" is sufficient security. With auxiliary information, most de-identified data can be re-identified.
- Masking only at export time. If raw PII is still used in processing pipelines, leakage risk remains.
- Setting differential privacy epsilon too small, making query results completely unusable. Test different epsilon values to find the balance between privacy and accuracy.
- Believing compliance is solely the legal team's concern. Implementing data compliance requires engineering team involvement in the design.
Pass Challenges
- Warm-up: Scan a dataset you commonly use and identify fields that may need protection.
- Challenge: Implement a complete masking pipeline—identify PII → apply masking strategies → verify masking effectiveness. Record differences in data usability before and after masking.
- Troubleshooting: An analyst complains that "the data is unusable" from a differentially-privacy-noised query. How do you determine if epsilon is too small or the query itself is the problem?
Acceptance Criteria
- Can identify PII and sensitive fields in a dataset
- Can apply one of four masking techniques and explain their pros and cons
- Understands the basic principles of differential privacy
- Knows how to implement access control in a data pipeline
Traveler's Notes
Data and privacy are not a binary choice. You can analyze data, protect privacy, and comply with regulations—all three can coexist. But you need to design for it from the start, not patch it on at the end.
Next Chapter Preview
The data journey concludes here. Next stop, the Model Workshop (Vol 13)—from data analysis to machine learning, letting the data speak for itself.