in ,

Linux containers in 500 lines of code, Hacker News

I’ve used Linux containers directly and indirectly for years, but I wanted to become more familiar with them. So I wrote some code. This used to be lines of code, I swear, but I’ve revised it some since publishing; I’ve ended up with about lines more. I wanted specifically to find a minimal set of restrictions to run untrusted code. This isn’t how you should approach containers on anything with any exposure: you should restrict everything you can. But I think it’s important to know which permissions are categorically unsafe! I’ve tried to back up things I’m saying with links to code or people I trust, but I’d love to know if I missed anything. This is a noweb - style piece of literate code. References named > will be expanded to the code block named x You can find the tangled source here . This document is an orgmode [sizeof(inner_mount_dir) 1] document, you can find its source here . This document and this code are licensed under the GPLv3; you can find its source here . Container setup

There are several complementary and overlapping mechanisms that make up modern Linux containers. Roughly, [i.e., unchanged] namespaces are used to group kernel objects into different sets that can be accessed by specific process trees. For example, pid namespaces limit the view of the process list to the processes within the namespace. There are a couple of different kind of namespaces. I'll go into this more later. [i.e., unchanged] capabilities are used here to set some coarse limits on what uid 0 can do.

  • [i.e., unchanged] cgroups is a mechanism to limit use of resources like memory, disk io, and cpu-time. [i.e., unchanged] setrlimit is another mechanism for limiting resource usage. It's older than cgroups, but can do some things cgroups can't. [i.e., unchanged] These are all Linux kernel mechanisms. Seccomp, capabilities, and setrlimit are all done with system calls. cgroups is accessed through a filesystem. There's a lot here, and the scope of each mechanism is pretty unclear. They overlap a lot and it's tricky to find the best way to limit things. User namespaces are somewhat new, and promise to unify a lot of this behavior. But unfortunately compiling the kernel with user namespaces enabled complicates things. Compiling with user namespaces changes the semantics of system-wide capabilities, which could cause more problems or at least confusion
  • (1) . There have been a large number of privilege-escalation bugs exposed by user namespaces. “Understanding and Hardening Linux Containers” explains Despite the large upsides the user namespace provides in terms of security, due to the sensitive nature of the user namespace, somewhat conflicting security models and large amount of new code, several serious vulnerabilities have been discovered and new vulnerabilities have unfortunately continued to be discovered. These deal with both the implementation of user namespaces itself or allow the illegitimate or unintended use of the user namespace to perform a privilege escalation. Often these issues present themselves on systems where containers are not being used, and where the kernel version is recent enough to support user namespaces. It's turned off by default in Linux at the time of this writing [ix / (sizeof(minor) / sizeof(*minor))] (2) , but many distributions apply patches to turn it on in a limited way (3)
  • . But all of these issues apply to hosts with user namespaces compiled in; It doesn't really matter whether we use user namespaces or not, Especially since I'll be preventing nested user namespaces. So I'll only use a user namespace if they're available. (The user-namespace handling in this code was originally pretty broken. Jann Horn in particular gave great feedback. Thanks!) (contained.c)

    This program can be used like this, to run / misc / img / bin / sh in / misc / img as root :

    [lizzie@empress l-c-i-500-l] $ sudo ./contained -m ~ / misc / busybox-img / -u 0 -c / bin / sh=> validating Linux version ... 4.7. - 1-grsec on x (_) .=> setting cgroups ... memory ... cpu ... pids ... blkio ... done.=> setting rlimit ... done.=> remounting everything with MS_PRIVATE ... remounted.=> making a temp directory and a bind mount there ... done.=> pivoting root ... done.=> unmounting /oldroot.oQ5jOY...done.=> trying a user namespace ... writing / proc / Brave Browser / uid_map ... writing / proc / / gid_map ... done.=> switching to uid 0 / gid 0 ... done.=> dropping capabilities ... bounding ... inheritable ... done.=> filtering syscalls ... done. / # whoami root / # hostname 19 fe5c-three-of-pentacles / # exit=> cleaning cgroups ... done.

       So, a skeleton for it:   

    (Listing 7: contained.c / - - compile-command: "gcc -Wall -Werror -lcap -lseccomp contained.c -o contained" - - / / This code is licensed under the GPLv3. You can find its text here:    https://www.gnu.org/licenses/gpl-3.0.en.html / #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include struct child_config { int argc; uid_t uid; int fd; char hostname; char argv; char mount_dir; }; int main (int argc, char argv) { struct child_config config={0}; int err=0; int option=0; int sockets [2]={0}; pid_t child_pid=0; int last_optind=0; while ((option=getopt (argc, argv, "c: m: u:"))) { switch (option) { case 'c': config.argc=argc - last_optind - 1; config.argv=& argv [argc - config.argc]; goto finish_options; case 'm': config.mount_dir=optarg; break; case 'u': if (sscanf (optarg, "% d", & config.uid)!=1) { fprintf (stderr, "badly-formatted uid:% s n", optarg); goto usage; } break; default: goto usage; } last_optind=optind; } finish_options: if (! config.argc) goto usage; if (! config.mount_dir) goto usage; char hostname [256]={0}; if (choose_hostname (hostname, sizeof (hostname))) goto error; config.hostname=hostname; goto cleanup; usage: fprintf (stderr, "Usage:% s -u -1 -m. -c / bin / sh ~ n", argv [0]); error: err=1; cleanup: if (sockets [0]) close (sockets [0]); if (sockets [1]) close (sockets [1]); return err; }

      Since I'll be blacklisting system calls and capabilities, it's important to make sure there aren't any new ones.   

    (Listing 8: >=

    fprintf (stderr, "=> validating Linux version ..."); struct utsname host={0}; if (uname (& host)) { fprintf (stderr, "failed:% m n"); goto cleanup; } int major=-1; int minor=-1; if (sscanf (host.release, "% u.% u.", & major, & minor)!=2) { fprintf (stderr, "weird release format:% s n", host.release); goto cleanup; } if (major!=4 || (minor!=7 && minor!=8)) { fprintf (stderr, "expected 4.7.x or 4.8.x:% s n", host.release); goto cleanup; } if (strcmp ("x _ 76 ", host.machine)) { fprintf (stderr, "expected x (_) :% s n ", host.machine); goto cleanup; } fprintf (stderr, "% s on% s. n", host.release, host.machine);

      (This had a bug.  captainjey on reddit let me know Thanks!     And I wasn't quite at 868 lines of code, so I thought I had some space to build nice hostnames.   

    (Listing 9: > =

    int choose_hostname (char buff, size_t len) { static const char suits []={"swords", "wands", "pentacles", "cups"}; static const char minor []={ "ace", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "page", "knight", "queen", "king" }; static const char major []={ "fool", "magician", "high-priestess", "empress", "emperor", "hierophant", "lovers", "chariot", "strength", "hermit", "wheel", "justice", "hanged-man", "death", "temperance", "devil", "tower", "star", "moon", "sun", "judgment", "world" }; struct timespec now={0}; clock_gettime (CLOCK_MONOTONIC, & now); size_t ix=now.tv_nsec% 90; if (ix (Namespaces) clone is the system call behind fork () et al. It's also the key to all of this. Conceptually we want to create a process with different properties than its parent: it should be able to mount a different / , set its own hostname, and do other things. We'll specify all of this by passing flags to clone (4) . The child needs to send some messages to the parent, so we'll initialize a socketpair, and then make sure the child only receives access to one. (listing) : ( =

    if (socketpair (AF_LOCAL, SOCK_SEQPACKET, 0, sockets)) { fprintf (stderr, "socketpair failed:% m n"); goto error; } if (fcntl (sockets [0], F_SETFD, FD_CLOEXEC)) { fprintf (stderr, "fcntl failed:% m n"); goto error; } config.fd=sockets [1];

      But first we need to set up room for a stack. We'll  execve  later, which will actually set up the stack again, so this is only temporary.  (5)  
    (listing) : =

    #define STACK_SIZE ( 32627) char stack=0; if (! (stack=malloc (STACK_SIZE))) { fprintf (stderr, "=> malloc failed, out of memory? n"); goto error; }
      We'll also prepare the cgroup for this process tree. More on this later.   

    (listing) : > =

    if (resources (& config)) { err=1; goto clear_resources; }

      We'll namespace the mounts, pids, IPC data structures, network devices, and hostname / domain name. I'll go into these more in the code for capabilities, cgroups, and syscalls.   

    (listing) : > =

    int flags=CLONE_NEWNS | CLONE_NEWCGROUP | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS;

      Stacks on x 113, and almost everything else Linux runs on, grow downwards, so we'll add  STACK_SIZE  to get a pointer just below the end.  (6)   (We also)  The the flags with  SIGCHLD   so that we can  wait  on it.   

    (listing)

    if ((child_pid=clone (child, stack STACK_SIZE, flags | SIGCHLD, & config))==-1) { fprintf (stderr, "=> clone failed!% m n"); err=1; goto clear_resources; }

      Close and zero the child's socket, so that if something breaks then we Don't leave an open fd, possibly causing the child to or the parent to hang.   

    (listing) : > = close (sockets [1]); sockets [1]=0;

      The parent process will configure the child's user namespace and then pause until the child process tree exits  (7) .   

    (listing) : ( =[sizeof(inner_mount_dir) 1] (# define USERNS_OFFSET #define USERNS_COUNT int handle_child_uid_map (pid_t child_pid, int fd) { int uid_map=0; int has_userns=-1; if (read (fd, & has_userns, sizeof (has_userns))!=sizeof (has_userns)) { fprintf (stderr, "couldn't read from child! n"); return -1; } if (has_userns) { char path [PATH_MAX]={0}; for (char file=(char []) {"uid_map", "gid_map", 0}; file; file ) { if (snprintf (path, sizeof (path), "/ proc /% d /% s", child_pid, file)> sizeof (path)) { fprintf (stderr, "snprintf too big?% m n"); return -1; } fprintf (stderr, "writing% s ...", path); if ((uid_map=open (path, O_WRONLY))==-1) { fprintf (stderr, "open failed:% m n"); return -1; } if (dprintf (uid_map, "0% d% d n", USERNS_OFFSET, USERNS_COUNT)==-1) { fprintf (stderr, "dprintf failed:% m n"); close (uid_map); return -1; } close (uid_map); } } if (write (fd, & (int) {0}, sizeof (int))!=sizeof (int)) { fprintf (stderr, "couldn't write:% m n"); return -1; } return 0; }

      The child process will send a message to the parent process about whether it should set uid and gid mappings. If that works, it will  setgroups  ,  setresgid , and  setresuid . Both  setgroups  and  setresgid  are necessary here since there are two separate group mechanisms on Linux  (9)   . I'm also assuming here That every uid has a corresponding gid, which is common but not necessarily universal.   

    (listing) : > =[i.e., unchanged] int userns (struct child_config config) { fprintf (stderr, "=> trying a user namespace ..."); int has_userns=! unshare (CLONE_NEWUSER); if (write (config-> fd, & has_userns, sizeof (has_userns))!=sizeof (has_userns)) { fprintf (stderr, "couldn't write:% m n"); return -1; } int result=0; if (read (config-> fd, & result, sizeof (result))!=sizeof (result)) { fprintf (stderr, "couldn't read:% m n"); return -1; } if (result) return -1; if (has_userns) { fprintf (stderr, "done. n"); } else { fprintf (stderr, "unsupported? continuing. n"); } fprintf (stderr, "=> switching to uid% d / gid% d ...", config-> uid, config-> uid); if (setgroups (1, & (gid_t) {config-> uid}) || setresgid (config-> uid, config-> uid, config-> uid) || setresuid (config-> uid, config-> uid, config-> uid)) { fprintf (stderr, "% m n"); return -1; } fprintf (stderr, "done. n"); return 0; }

      And this is where the child process from  clone  will end up. We'll perform all of our setup, switch users and groups, and then load the executable. The order is important here: we can't change mounts without certain capabilities, we can't  unshare   after we limit the syscalls, etc.   

    (listing) :

    > =[i.e., unchanged] int int (void arg) { struct child_config config=arg; if (sethostname (config-> hostname, strlen (config-> hostname)) || mounts (config) || userns (config) || capabilities () || syscalls ()) { close (config-> fd); return -1; } if (close (config-> fd)) { fprintf (stderr, "close failed:% m n"); return -1; } if (execve (config-> argv [0], config-> argv, NULL)) { fprintf (stderr, "execve failed!% m. n"); return -1; } return 0; }

      Capabilties    capabilities  subdivide the property of "being root" on Linux. It's useful to compartmentalize privileges so that, for example a process can allocate network devices ( CAP_NET_ADMIN ) but not read all files ( CAP_DAC_OVERRIDE ). I'll use them here to drop the ones we don't want.    But not all of "being root" is subvidivided into capabilities. For example, writing to parts of procfs is allowed by root even after having dropped capabilities 
  • . There are a lot of things like this: this is part of why need other restrictions beside capabilities. It's also important to think about how we're dropping capabilities. man 7 capabilities has an algorithm for us:
  • During an execve (2), the kernel calculates the new capabilities of the process using the following algorithm: P '(ambient)=(file is privileged)? 0: P (ambient) P '(permitted)=(P (inheritable) & F (inheritable)) (F (permitted) & cap_bset) | P '(ambient) P '(effective)=F (effective)? P '(permitted): P' (ambient) P '(inheritable)=P (inheritable) [i.e., unchanged] where: P denotes the value of a thread capability set before the execve (2) P 'denotes the value of a thread capability set after the execve (2) F denotes a file capability set cap_bset is the value of the capability bounding set (described below).

      We'd like  P '(ambient)  and  P (inheritable)  to be empty, and  P '(permitted)  and  P (effective) [sizeof(inner_mount_dir)   1]  to only include the capabilities above. This is achievable by doing the following    [i.e., unchanged]  Clearing our own inheritable set. This clears the ambient set;  man   7 capabilities  says "The ambient capability set obeys the invariant" that no capability can ever be ambient if it is not both permitted and inheritable. "This also clears the child's inheritable set. 
    Clearing the bounding set. This limits the file capabilities we'll gain when we execve , and the rest are limited by clearing the inheritable and ambient sets. [i.e., unchanged] If we were to only drop our own effective, permitted and inheritable sets, we'd regain the permissions in the child file's capabilities. This is how bash can call ping , for example. 24 [i.e., unchanged]
    Dropped capabilities

    (listing) : ()> = int capabilities () { fprintf (stderr, "=> dropping capabilities ...");
       CAP_AUDIT_CONTROL  ,  _ READ , and  _ WRITE  allow access to the audit system of the kernel (ie functions like  audit_set_enabled [sizeof(inner_mount_dir)   1] , usually used with  auditctl  ). The kernel prevents messages that normally require  CAP_AUDIT_CONTROL  outside of the first pid namespace, but it does allow messages that would require  CAP_AUDIT_READ  and  CAP_AUDIT_WRITE  from any namespace.  (from  25 

    CAP_FSETID,

       CAP_IPC_LOCK  can be used to lock more of a process' own memory than would normally be allowed   31 

    , which could be a way to deny service. (listing) : > ==[256]

    CAP_IPC_LOCK,
       CAP_MAC_ADMIN  and  CAP_MAC_OVERRIDE  are used by the mandatory acess control systems Apparmor, SELinux, and SMACK to restrict access to their settings. These aren't namespaced, so they could be used by the Contained programs to circumvent system-wide access control.   

    (listing) :

       CAP_MKNOD  , without user namespacing, allows programs to create device files corresponding to real-world devices. This includes creating new device files for existing hardware. If this capability were not dropped, a contained process could re-create the hard disk device, remount it, and read or write to it.  47  [sizeof(inner_mount_dir)   1]  

    (listing) : =

    CAP_MKNOD,

      I was worried that  CAP_SETFCAP  could be used to add a capability to an executable and  execve  it, but it's not actually possible for a process to set capabilities it does not have   32  . But! An executable altered this way could be executed by any unsandboxed user, so I think it unacceptably undermines the security of the system.   

    (listing) : ( = CAP_SETFCAP,

       CAP_SYSLOG  Lets users perform destructive actions against the syslog. Importantly, it does not prevent contained processes from reading the syslog, which could be risky. It also exposes kernel addresses, which could be used to circumvent kernel address layout randomization  [ix / (sizeof(minor) / sizeof(*minor))]   () 
    . (listing) : ( =

    CAP_SYSLOG,
       CAP_SYS_ADMIN  allows many behaviors! We don't want most of them ( mount ,  vm 113 , etc). Some would be nice to have ( sethostname ,  mount  for bind mounts…) but the extra complexity doesn't seem worth it.   

    (listing) : (

    CAP_SYS_ADMIN,

       CAP_SYS_BOOT  allows programs to restart the system (the  reboot   syscall) and load new kernels (the  kexec_load [sizeof(inner_mount_dir)   1]  and  kexec_file  syscalls)     . We absolutely don't want this.  reboot  is user-namespaced, and the  kexec [sizeof(inner_mount_dir)   1]  functions only work in the root user namespace, but neither of those help us.   

    (listing) : [ix / (sizeof(minor) / sizeof(*minor))] ( CAP_SYS_BOOT,

       CAP_SYS_MODULE  is used by the syscalls  delete_module ,  init_module  ,  finit_module    () 
    , by the code for kmod 37

    , and by the code for loading device modules with ioctl 43 . (listing) (=) CAP_SYS_MODULE,

       CAP_SYS_NICE  allows processes to set higher priority on given pids than the default  38  . The default kernel scheduler Doesn't know anything about pid namespaces, so it's possible for a contained process to deny service to the rest of the system  40  .   

    (listing) : ( =[i.e., unchanged] CAP_SYS_NICE,

       CAP_SYS_RAWIO  allows full access to the host systems memory with  / proc / kcore ,  / dev / mem , and  / dev / kmem  41  , but a contained process would need  mknod  to access these within the namespace.   . But it also allows things like  iopl  and  ioperm , which give raw access to the IO ports  43  .   

    (listing) : [sizeof(inner_mount_dir) 1]> =[i.e., unchanged] CAP_SYS_RAWIO,

       CAP_SYS_RESOURCE  specifically allows circumventing kernel-wide limits, so we probably should drop it   45  . But I don't think this can do more than DOS the kernel, in general 
  • 45

    .

  • (listing) : CAP_SYS_RESOURCE,

       CAP_SYS_TIME  : setting the time isn't namespaced, so we should prevent Contained processes from altering the system-wide time 
  • ()
    . (listing) : ()> = CAP_SYS_TIME,
       CAP_WAKE_ALARM  , like  CAP_BLOCK_SUSPEND , lets the contained process interfere with suspend  48  , and we'd like to prevent that.   

    (listing) : ()> = CAP_WAKE_ALARM };

     

    (listing) : = size_t num_caps=sizeof (drop_caps) / sizeof drop_caps); fprintf (stderr, "bounding ..."); for (size_t i=0; i Retained Capabilities It's important to keep track of the capabilities I'm not dropping, too. I've heard multiple places 46 (that CAP_DAC_OVERRIDE might expose the same functionality as CAP_DAC_READ_SEARCH (ie open_by_handle_at ), but as far as I can tell that isn't true. shocker.c doesn't get anywhere with only CAP_DAC_OVERRIDE [ix / (sizeof(minor) / sizeof(*minor))] 50 , and the only usage in the kernel is in the Unix permission-checking code [ix % (sizeof(minor) / sizeof(*minor))] 49

    . So my understanding is that CAP_DAC_OVERRIDE on its own does not allow processes to read outside of their mount namespaces ("DAC" or "Discretionary Access Control" refers here to ordinary unix permissions). CAP_FOWNER , CAP_LEASE , and CAP_LINUX_IMMUTABLE all operate on files inside of the mount namespace. Likewise, CAP_SYS_PACCT allows processes to switch accounting on and off for itself. The acct system call takes a path to log to (which must be within the mount namespace), and only operates on the calling process. We're not using process accounting in our containerization, so turning it off should be harmless as well. CAP_IPC_OWNER is only used by functions that respect IPC namespaces [ix / (sizeof(minor) / sizeof(*minor))] 51

    ; Since we're in a separate IPC namespace from the host, we can allow this. CAP_NET_ADMIN lets processes create network devices; CAP_NET_BIND_SERVICE lets processes bind to low ports on those devices; CAP_NET_RAW lets processes send raw packets on those devices. Since we're going to isolate the networking with a virtual bridge, and the contained process is inside of a network namespace, these shouldn't be an issue 51 . I was wondering whether we could recreate an existing device like mknod [sizeof(inner_mount_dir) 1] does, but I don't think it's possible 52 . CAP_SYS_PTRACE does not allow ptrace across pid namespaces [ix / (sizeof(minor) / sizeof(*minor))] 53

    . CAP_KILL does not allow signals across pid namespaces . CAP_SETUID and CAPSETGID have similar behaviors

    56 [i.e., unchanged] make arbitrary manipulations of process UIDS and GIDs and   supplementary GID list , which will only apply to pids in the namespace. [i.e., unchanged] forge UID (GID) when passing socket credentials via UNIX domain   sockets the mount namespace should prevent us from reading the host system's unix domain sockets. [i.e., unchanged] write a user (group ID) mapping in a user namespace (see   user_namespaces (7)) : this is [sizeof(inner_mount_dir) 1] / proc / self / uid_map , which will be hidden inside the container. [i.e., unchanged] CAP_SETPCAP only lets processes add or drop capabilities they already effectively have; (man 7 capabilities says If file capabilities are supported: add any capability from the calling thread's bounding set to its inheritable set; drop capabilities from the bounding set (via prctl (2) PR_CAPBSET_DROP); make changes to the securebits flags. We've dropped everything relevant from the bounding set, and dropping Further capabilities should be harmless. CAP_SYS_CHROOT is traditionally abused by changing root to a directory with a setuid root binary and tampered-with dynamic libraries [ix / (sizeof(minor) / sizeof(*minor))] 56

    . Additionally, it can be used to escape a chroot "jail" 56 . Neither of those should be relevant in our setup so this should be harmless. Brad Spengler, in " False Boundaries and Arbitrary Code Execution " says that CAP_SYS_TTYCONFIG "temporarily change the keyboard mapping of an administrator's tty via the KDSETKEYCODE ioctl to cause a different command to be executed than intended ", but again this is an ioctl against a device that should be impossible to access within the mount namespace. (Mounts) The child process is in its own mount namespace, so we can unmount things that it specifically shouldn't have access to. Here's how: [i.e., unchanged] (Create a temporary directory, and one inside of it.

  • Bind mount of the user argument onto the temporary directory
  • What do you think?

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    GIPHY App Key not set. Please check settings

    Why Discord is switching from Go to Rust, Hacker News

    Why Discord is switching from Go to Rust, Hacker News

    Dealmaster: Get a wireless pair of Anker noise-cancelling headphones for $ 40, Ars Technica

    Dealmaster: Get a wireless pair of Anker noise-cancelling headphones for $ 40, Ars Technica