Skip to content

Metadata Card

  • Prerequisites: Vol 3 Computer Systems (processes, memory, system calls), basic Linux operations
  • Estimated time: 55 minutes
  • Core difficulty: Advanced
  • Completion mark: Can explain the difference between DAC and MAC, can use seccomp to restrict process system calls, understand the permission splitting concept of Linux Capabilities

Your Progress

You've reinforced the spell protections at the station application layer. But all protection spells run on top of the wizard tower's operating formation. If a dark mage breaks through the station application process's permissions and the operating formation doesn't have adequate built-in defense runes, they could directly access the entire tower's magic flow, rune library, and other processes.

You recall the diagram from Vol 3: the operating system is the guardian that manages all hardware resources. If the guardian itself lacks security design, all upper-level protections are castles built on sand.

Your Task

Understand the core mechanisms of operating system security. From the most basic Unix permission model (DAC) to Mandatory Access Control (MAC), Capabilities, and system call filtering (seccomp). These mechanisms together form the infrastructure of isolation and least privilege.

Chapter Layers

  • Required: DAC (Unix permission model), Linux Capabilities, seccomp
  • Optional: SELinux / AppArmor policy writing
  • Advanced: LSM (Linux Security Module) framework, namespaces and container isolation

Breaking Ground · Tracing the Origin

Problem: The fortress patrol report system runs on a server. If an attacker gets a shell through SQL injection or file upload vulnerabilities, what can they do?

The answer: it depends on what identity the web application process is running as.

If it's running as root (as many early configurations and dev environments do), the attacker gets full control of the entire machine. If it's running as the www-data user, the attacker can read files that user can read and write to places that user can write to.

This is the most basic Discretionary Access Control (DAC).

Layer 1: Unix Permission Model (DAC)

Every file in the fortress is like a military document—who can read it, who can write it, who can execute it, all determined by the file's creator. Check a file's basic permissions:

-rw-r--r--    1 www-data www-data   2048 Jun 24 10:00 patrol_reports.db

Interpretation:

  • First char -: regular file (d = directory, l = link)
  • rw-: owner (www-data) has read and write permission
  • r--: group (www-data) has read permission
  • r--: others have read permission only

Permission bits explained:

Permissionr (4)w (2)x (1)
FileRead contentModify contentExecute (script/binary)
DirectoryList filesCreate/delete filesEnter directory (cd)

The name DAC is accurate: Discretionary—the file owner can decide who can access their files at their discretion. The owner can set file permissions to 777 to let everyone write.

The problem is DAC's semantics are too coarse. You either have permission or you don't. You can't say "this process can only bind to port 80, but can't do other root operations."

Layer 2: Linux Capabilities

In traditional Unix, the root user (UID 0) has all permissions. If you want a program to bind to port 80, you have to give it full root privileges—like giving the sentry the keys to the entire city just to open one small door. Capabilities split root's privileges into smaller units:

CapabilityMeaningUse Case
CAP_NET_BIND_SERVICEBind to ports < 1024Web server
CAP_SYS_TIMEModify system clockNTP service
CAP_DAC_OVERRIDEBypass file permission checksBackup tools
CAP_NET_RAWUse RAW socketsping / traceroute
CAP_SYS_ADMINSystem administrationMost sensitive operations (dangerous)
CAP_KILLSend signals to any process
CAP_SETUIDSet user IDlogin / sudo

Give nginx only the permissions it needs—don't give it the whole set of keys:

bash
# Don't need nginx to run as root
# Give it CAP_NET_BIND_SERVICE and that's enough

# Set at runtime (in systemd)
# In service file:
# AmbientCapabilities=CAP_NET_BIND_SERVICE

# Use setcap to set on a binary
sudo setcap 'cap_net_bind_service=+ep' /usr/sbin/nginx

# View set capabilities
getcap /usr/sbin/nginx
# Output: /usr/sbin/nginx = cap_net_bind_service+ep

# View a process's capabilities
cat /proc/<pid>/status | grep Cap

View the actual capabilities of a running process:

bash
# Using capsh tool
capsh --decode=$(grep CapEff /proc/1/status | awk '{print $2}')

Meaning of +ep:

  • e (Effective): Currently active
  • p (Permitted): Maximum set allowed to use
  • i (Inheritable): Can be inherited by child processes

Layer 3: Mandatory Access Control (MAC)

The problem with DAC is: root user can bypass all permission rules. Root can read any file, kill any process.

MAC (Mandatory Access Control) completely changes the model:

  • System administrators define global security policy
  • Even the root user cannot violate the policy
  • Every subject (process) and object (file, port, device) has a security label

SELinux

SELinux (Security-Enhanced Linux) is a MAC implementation developed by the NSA (open-sourced in 2000, merged into the Linux mainline in 2003).

SELinux security context:

bash
# View a process's security context
ps -Z
# Output: system_u:system_r:httpd_t:s0

# View a file's security context
ls -Z /var/www/html/
# Output: system_u:object_r:httpd_sys_content_t:s0

Format: user:role:type:sensitivity

The core of SELinux is Type Enforcement. If a process of type httpd_t can only read files of type httpd_sys_content_t, then even if the web server is compromised (running as httpd_t), the attacker can't write to /etc/shadow (type shadow_t).

bash
# Check what interactions SELinux policy allows
sesearch --allow --source httpd_t --target shadow_t

# If not allowed, the operation is blocked by SELinux (logged in audit.log)
# ausearch -m avc -ts recent

AppArmor

AppArmor is another MAC implementation, more path-based than SELinux. It's like writing a route map for each important program—nginx can only go to certain paths and read certain files; anything outside the route map is denied:

# /etc/apparmor.d/usr.sbin.nginx
#include <tunables/global>

/usr/sbin/nginx {
  #include <abstractions/base>
  #include <abstractions/nameservice>

  /usr/sbin/nginx mr,
  /var/log/nginx/*.log w,
  /etc/nginx/** r,
  /var/www/html/** r,
  /run/nginx.pid w,
  
  # Deny these
  deny /etc/shadow r,
  deny /bin/bash r,
}

AppArmor learning mode:

bash
# First run in complain mode, logging all accesses
sudo aa-complain /etc/apparmor.d/usr.sbin.nginx

# View logs, generate a reasonable policy
sudo tail -f /var/log/syslog | grep nginx

# After confirming the policy covers all legitimate operations, switch to enforce mode
sudo aa-enforce /etc/apparmor.d/usr.sbin.nginx

SELinux vs AppArmor:

DimensionSELinuxAppArmor
GranularitySecurity context (type)Path + permission
Ease of useMore complexMore intuitive
DistributionDefault on RHEL/CentOS/FedoraDefault on Ubuntu/Debian/openSUSE
Policy modelGlobal type enforcementPer-program policy
Learning curveSteepModerate

Layer 4: seccomp (System Call Filtering)

Going one layer deeper—system calls are the interface between processes and the kernel. Does a web server really need to call execve (execute new programs) or socket (create new network connections)? seccomp lets you set up a whitelist for a process—a list of system calls it's allowed to make; anything else kills the process:

c
// C language: restrict system calls with seccomp
// Compile: gcc -o sandbox sandbox.c -lseccomp

#include <seccomp.h>
#include <stdio.h>
#include <unistd.h>

int main() {
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL); // Default: kill
    if (!ctx) return 1;

    // Allow the most basic system calls
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(munmap), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(brk), 0);
    // Explicitly deny...
    // open, execve, socket, clone are all killed by default

    seccomp_load(ctx);
    seccomp_release(ctx);

    // From now on, can only use the system calls listed above
    printf("Hello from sandbox!\n");

    // Try to open a file—the process will be killed by the kernel
    // FILE *f = fopen("/etc/passwd", "r"); // KILLED

    return 0;
}

In Docker and Chrome's sandbox, seccomp is the core isolation mechanism. Docker's default seccomp configuration blocks over 40 unnecessary system calls:

bash
# Docker's default seccomp configuration
cat /etc/docker/seccomp/default.json | python3 -m json.tool | head -30

# Run container with custom seccomp config
docker run --security-opt seccomp=my-custom-profile.json my-app

Using seccomp in Python via prctl or the libseccomp binding:

python
# Using python-seccomp library (pip install python-seccomp)
import seccomp
import sys

def setup_sandbox():
    filter = seccomp.SyscallFilter(seccomp.KILL)
    
    filter.add_rule(seccomp.ALLOW, "read")
    filter.add_rule(seccomp.ALLOW, "write")
    filter.add_rule(seccomp.ALLOW, "exit_group")
    filter.add_rule(seccomp.ALLOW, "futex")
    filter.add_rule(seccomp.ALLOW, "mmap")
    filter.add_rule(seccomp.ALLOW, "munmap")
    filter.add_rule(seccomp.ALLOW, "brk")
    filter.add_rule(seccomp.ALLOW, "clock_gettime")
    # ... more allowed calls
    
    filter.load()  # Cannot be undone once set

Comprehensive Defense: Defense in Depth

A secure system doesn't defend at a single layer, but stacks multiple layers:

┌─────────────────────────────────────┐
│    Application Layer (Auth, Authorization, Encoding) │
├─────────────────────────────────────┤
│       seccomp (System Call Whitelist) │
├─────────────────────────────────────┤
│  Linux Capabilities (Least Privilege Splitting) │
├─────────────────────────────────────┤
│    MAC (SELinux / AppArmor Policies) │
├─────────────────────────────────────┤
│    DAC (Unix User/Group/Permissions) │
├─────────────────────────────────────┤
│        Hardware / Virtualization Isolation │
└─────────────────────────────────────┘

Even if an attacker breaks through the web application (Layer 1), seccomp may prevent them from execve (running a shell), Capabilities may prevent them from modifying system time, SELinux may prevent them from reading /etc/shadow, and DAC limits them to their own directory.

Each layer is a door in the wall. Breaking through one doesn't mean breaking through all.


Common Pitfalls

  • Running as root then dropping privileges. Many programs start as root, bind ports, then drop to a regular user. This can work, but ensure the drop is thorough: setuid + setgid + setgroups + drop capabilities. Many programs don't drop privileges thoroughly (only setuid without dropping capabilities).
  • Relying only on DAC. Most Linux distributions only have DAC by default. SELinux/AppArmor require additional configuration. If you run production servers, configuring MAC is worthwhile.
  • Giving containers full --privileged permissions. This is almost equivalent to giving the container host root privileges. Explicitly specifying a --cap-add list is more secure.
  • Using incomplete seccomp configurations. Some system calls have non-obvious risks (e.g., userfaultfd can bypass certain memory restrictions). Start with Docker's default seccomp profile and modify from there.
  • Disabling SELinux. "I don't know how to configure this thing, so setenforce 0"—this is the most common but worst practice. SELinux's error messages when a permission check fails can tell you what permissions are missing; use audit2allow to generate fix policies.
  • UNIX SUID binaries. chmod u+s makes a binary run as its owner. Every SUID root binary is a potential attack surface (like sudo, passwd). Check and clean unnecessary SUID:
bash
find / -perm -4000 -type f 2>/dev/null

Pass Challenges

  • Warm-up: On a Linux system, run cat /proc/<pid>/status | grep Cap and interpret the output. Check the corresponding permission meanings for CapEff. Find a non-critical process and think about whether it has unnecessary permissions.
  • Challenge: Write a Python script using seccomp that can compute PI to 10000 digits but cannot read any files or establish network connections. Note what system calls your script needs before setup_sandbox to load the Python interpreter.
  • Observe: Run docker run --rm alpine ping 8.8.8.8, then run docker run --rm --cap-drop=NET_RAW alpine ping 8.8.8.8, and observe the difference. Use strace to trace the system call differences.
  • Troubleshoot: Your web app runs fine on Ubuntu 18.04, but after migrating to RHEL 9, Nginx returns 403 even though files and permissions look correct. Debug this issue (consider SELinux context migration).

Traveler's Notes

  • DAC (Unix permissions) is the most basic line of defense, but too coarse-grained and has no constraint on root
  • Linux Capabilities split root's privileges into small units (like CAP_NET_BIND_SERVICE)
  • MAC (SELinux/AppArmor) cannot be bypassed even by root—globally enforced
  • seccomp provides whitelist filtering at the system call level, the innermost defense
  • Defense in depth: each layer cannot fully trust the layer below or above it

Next Stop Preview

Operating system isolation lets you constrain the capabilities of individual processes. But how do processes communicate? How does data travel across the network? Network attacks go beyond application-layer XSS and SQLi—firewalls, VPNs, IDS/IPS await us. Next chapter, we enter the realm of network security.

Built with VitePress | Software Systems Atlas