Metadata Card
- Prerequisites: Vol 3 Computer Systems (processes, memory, system calls), basic Linux operations
- Estimated time: 55 minutes
- Core difficulty: Advanced
- Completion mark: Can explain the difference between DAC and MAC, can use seccomp to restrict process system calls, understand the permission splitting concept of Linux Capabilities
Your Progress
You've reinforced the spell protections at the station application layer. But all protection spells run on top of the wizard tower's operating formation. If a dark mage breaks through the station application process's permissions and the operating formation doesn't have adequate built-in defense runes, they could directly access the entire tower's magic flow, rune library, and other processes.
You recall the diagram from Vol 3: the operating system is the guardian that manages all hardware resources. If the guardian itself lacks security design, all upper-level protections are castles built on sand.
Your Task
Understand the core mechanisms of operating system security. From the most basic Unix permission model (DAC) to Mandatory Access Control (MAC), Capabilities, and system call filtering (seccomp). These mechanisms together form the infrastructure of isolation and least privilege.
Chapter Layers
- Required: DAC (Unix permission model), Linux Capabilities, seccomp
- Optional: SELinux / AppArmor policy writing
- Advanced: LSM (Linux Security Module) framework, namespaces and container isolation
Breaking Ground · Tracing the Origin
Problem: The fortress patrol report system runs on a server. If an attacker gets a shell through SQL injection or file upload vulnerabilities, what can they do?
The answer: it depends on what identity the web application process is running as.
If it's running as root (as many early configurations and dev environments do), the attacker gets full control of the entire machine. If it's running as the www-data user, the attacker can read files that user can read and write to places that user can write to.
This is the most basic Discretionary Access Control (DAC).
Layer 1: Unix Permission Model (DAC)
Every file in the fortress is like a military document—who can read it, who can write it, who can execute it, all determined by the file's creator. Check a file's basic permissions:
-rw-r--r-- 1 www-data www-data 2048 Jun 24 10:00 patrol_reports.dbInterpretation:
- First char
-: regular file (d= directory,l= link) rw-: owner (www-data) has read and write permissionr--: group (www-data) has read permissionr--: others have read permission only
Permission bits explained:
| Permission | r (4) | w (2) | x (1) |
|---|---|---|---|
| File | Read content | Modify content | Execute (script/binary) |
| Directory | List files | Create/delete files | Enter directory (cd) |
The name DAC is accurate: Discretionary—the file owner can decide who can access their files at their discretion. The owner can set file permissions to 777 to let everyone write.
The problem is DAC's semantics are too coarse. You either have permission or you don't. You can't say "this process can only bind to port 80, but can't do other root operations."
Layer 2: Linux Capabilities
In traditional Unix, the root user (UID 0) has all permissions. If you want a program to bind to port 80, you have to give it full root privileges—like giving the sentry the keys to the entire city just to open one small door. Capabilities split root's privileges into smaller units:
| Capability | Meaning | Use Case |
|---|---|---|
CAP_NET_BIND_SERVICE | Bind to ports < 1024 | Web server |
CAP_SYS_TIME | Modify system clock | NTP service |
CAP_DAC_OVERRIDE | Bypass file permission checks | Backup tools |
CAP_NET_RAW | Use RAW sockets | ping / traceroute |
CAP_SYS_ADMIN | System administration | Most sensitive operations (dangerous) |
CAP_KILL | Send signals to any process | |
CAP_SETUID | Set user ID | login / sudo |
Give nginx only the permissions it needs—don't give it the whole set of keys:
# Don't need nginx to run as root
# Give it CAP_NET_BIND_SERVICE and that's enough
# Set at runtime (in systemd)
# In service file:
# AmbientCapabilities=CAP_NET_BIND_SERVICE
# Use setcap to set on a binary
sudo setcap 'cap_net_bind_service=+ep' /usr/sbin/nginx
# View set capabilities
getcap /usr/sbin/nginx
# Output: /usr/sbin/nginx = cap_net_bind_service+ep
# View a process's capabilities
cat /proc/<pid>/status | grep CapView the actual capabilities of a running process:
# Using capsh tool
capsh --decode=$(grep CapEff /proc/1/status | awk '{print $2}')Meaning of +ep:
e(Effective): Currently activep(Permitted): Maximum set allowed to usei(Inheritable): Can be inherited by child processes
Layer 3: Mandatory Access Control (MAC)
The problem with DAC is: root user can bypass all permission rules. Root can read any file, kill any process.
MAC (Mandatory Access Control) completely changes the model:
- System administrators define global security policy
- Even the root user cannot violate the policy
- Every subject (process) and object (file, port, device) has a security label
SELinux
SELinux (Security-Enhanced Linux) is a MAC implementation developed by the NSA (open-sourced in 2000, merged into the Linux mainline in 2003).
SELinux security context:
# View a process's security context
ps -Z
# Output: system_u:system_r:httpd_t:s0
# View a file's security context
ls -Z /var/www/html/
# Output: system_u:object_r:httpd_sys_content_t:s0Format: user:role:type:sensitivity
The core of SELinux is Type Enforcement. If a process of type httpd_t can only read files of type httpd_sys_content_t, then even if the web server is compromised (running as httpd_t), the attacker can't write to /etc/shadow (type shadow_t).
# Check what interactions SELinux policy allows
sesearch --allow --source httpd_t --target shadow_t
# If not allowed, the operation is blocked by SELinux (logged in audit.log)
# ausearch -m avc -ts recentAppArmor
AppArmor is another MAC implementation, more path-based than SELinux. It's like writing a route map for each important program—nginx can only go to certain paths and read certain files; anything outside the route map is denied:
# /etc/apparmor.d/usr.sbin.nginx
#include <tunables/global>
/usr/sbin/nginx {
#include <abstractions/base>
#include <abstractions/nameservice>
/usr/sbin/nginx mr,
/var/log/nginx/*.log w,
/etc/nginx/** r,
/var/www/html/** r,
/run/nginx.pid w,
# Deny these
deny /etc/shadow r,
deny /bin/bash r,
}AppArmor learning mode:
# First run in complain mode, logging all accesses
sudo aa-complain /etc/apparmor.d/usr.sbin.nginx
# View logs, generate a reasonable policy
sudo tail -f /var/log/syslog | grep nginx
# After confirming the policy covers all legitimate operations, switch to enforce mode
sudo aa-enforce /etc/apparmor.d/usr.sbin.nginxSELinux vs AppArmor:
| Dimension | SELinux | AppArmor |
|---|---|---|
| Granularity | Security context (type) | Path + permission |
| Ease of use | More complex | More intuitive |
| Distribution | Default on RHEL/CentOS/Fedora | Default on Ubuntu/Debian/openSUSE |
| Policy model | Global type enforcement | Per-program policy |
| Learning curve | Steep | Moderate |
Layer 4: seccomp (System Call Filtering)
Going one layer deeper—system calls are the interface between processes and the kernel. Does a web server really need to call execve (execute new programs) or socket (create new network connections)? seccomp lets you set up a whitelist for a process—a list of system calls it's allowed to make; anything else kills the process:
// C language: restrict system calls with seccomp
// Compile: gcc -o sandbox sandbox.c -lseccomp
#include <seccomp.h>
#include <stdio.h>
#include <unistd.h>
int main() {
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL); // Default: kill
if (!ctx) return 1;
// Allow the most basic system calls
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(munmap), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(brk), 0);
// Explicitly deny...
// open, execve, socket, clone are all killed by default
seccomp_load(ctx);
seccomp_release(ctx);
// From now on, can only use the system calls listed above
printf("Hello from sandbox!\n");
// Try to open a file—the process will be killed by the kernel
// FILE *f = fopen("/etc/passwd", "r"); // KILLED
return 0;
}In Docker and Chrome's sandbox, seccomp is the core isolation mechanism. Docker's default seccomp configuration blocks over 40 unnecessary system calls:
# Docker's default seccomp configuration
cat /etc/docker/seccomp/default.json | python3 -m json.tool | head -30
# Run container with custom seccomp config
docker run --security-opt seccomp=my-custom-profile.json my-appUsing seccomp in Python via prctl or the libseccomp binding:
# Using python-seccomp library (pip install python-seccomp)
import seccomp
import sys
def setup_sandbox():
filter = seccomp.SyscallFilter(seccomp.KILL)
filter.add_rule(seccomp.ALLOW, "read")
filter.add_rule(seccomp.ALLOW, "write")
filter.add_rule(seccomp.ALLOW, "exit_group")
filter.add_rule(seccomp.ALLOW, "futex")
filter.add_rule(seccomp.ALLOW, "mmap")
filter.add_rule(seccomp.ALLOW, "munmap")
filter.add_rule(seccomp.ALLOW, "brk")
filter.add_rule(seccomp.ALLOW, "clock_gettime")
# ... more allowed calls
filter.load() # Cannot be undone once setComprehensive Defense: Defense in Depth
A secure system doesn't defend at a single layer, but stacks multiple layers:
┌─────────────────────────────────────┐
│ Application Layer (Auth, Authorization, Encoding) │
├─────────────────────────────────────┤
│ seccomp (System Call Whitelist) │
├─────────────────────────────────────┤
│ Linux Capabilities (Least Privilege Splitting) │
├─────────────────────────────────────┤
│ MAC (SELinux / AppArmor Policies) │
├─────────────────────────────────────┤
│ DAC (Unix User/Group/Permissions) │
├─────────────────────────────────────┤
│ Hardware / Virtualization Isolation │
└─────────────────────────────────────┘Even if an attacker breaks through the web application (Layer 1), seccomp may prevent them from execve (running a shell), Capabilities may prevent them from modifying system time, SELinux may prevent them from reading /etc/shadow, and DAC limits them to their own directory.
Each layer is a door in the wall. Breaking through one doesn't mean breaking through all.
Common Pitfalls
- Running as root then dropping privileges. Many programs start as root, bind ports, then drop to a regular user. This can work, but ensure the drop is thorough:
setuid+setgid+setgroups+ drop capabilities. Many programs don't drop privileges thoroughly (onlysetuidwithout dropping capabilities). - Relying only on DAC. Most Linux distributions only have DAC by default. SELinux/AppArmor require additional configuration. If you run production servers, configuring MAC is worthwhile.
- Giving containers full
--privilegedpermissions. This is almost equivalent to giving the container host root privileges. Explicitly specifying a--cap-addlist is more secure. - Using incomplete seccomp configurations. Some system calls have non-obvious risks (e.g.,
userfaultfdcan bypass certain memory restrictions). Start with Docker's default seccomp profile and modify from there. - Disabling SELinux. "I don't know how to configure this thing, so
setenforce 0"—this is the most common but worst practice. SELinux's error messages when a permission check fails can tell you what permissions are missing; useaudit2allowto generate fix policies. - UNIX SUID binaries.
chmod u+smakes a binary run as its owner. Every SUID root binary is a potential attack surface (likesudo,passwd). Check and clean unnecessary SUID:
find / -perm -4000 -type f 2>/dev/nullPass Challenges
- Warm-up: On a Linux system, run
cat /proc/<pid>/status | grep Capand interpret the output. Check the corresponding permission meanings forCapEff. Find a non-critical process and think about whether it has unnecessary permissions. - Challenge: Write a Python script using seccomp that can compute PI to 10000 digits but cannot read any files or establish network connections. Note what system calls your script needs before
setup_sandboxto load the Python interpreter. - Observe: Run
docker run --rm alpine ping 8.8.8.8, then rundocker run --rm --cap-drop=NET_RAW alpine ping 8.8.8.8, and observe the difference. Usestraceto trace the system call differences. - Troubleshoot: Your web app runs fine on Ubuntu 18.04, but after migrating to RHEL 9, Nginx returns 403 even though files and permissions look correct. Debug this issue (consider SELinux context migration).
Traveler's Notes
- DAC (Unix permissions) is the most basic line of defense, but too coarse-grained and has no constraint on root
- Linux Capabilities split root's privileges into small units (like
CAP_NET_BIND_SERVICE) - MAC (SELinux/AppArmor) cannot be bypassed even by root—globally enforced
- seccomp provides whitelist filtering at the system call level, the innermost defense
- Defense in depth: each layer cannot fully trust the layer below or above it
Next Stop Preview
Operating system isolation lets you constrain the capabilities of individual processes. But how do processes communicate? How does data travel across the network? Network attacks go beyond application-layer XSS and SQLi—firewalls, VPNs, IDS/IPS await us. Next chapter, we enter the realm of network security.