I’ve used Linux containers directly and indirectly for years, but I wanted to become more familiar with them. So I wrote some code. This used to be lines of code, I swear, but I’ve revised it some since publishing; I’ve ended up with about lines more. I wanted specifically to find a minimal set of restrictions to run untrusted code. This isn’t how you should approach containers on anything with any exposure: you should restrict everything you can. But I think it’s important to know which permissions are categorically unsafe! I’ve tried to back up things I’m saying with links to code or people I trust, but I’d love to know if I missed anything. This is a noweb - style piece of literate code. References named
>
will be expanded to the code block named
x
You can find the tangled source
There are several complementary and overlapping mechanisms that make up modern Linux containers. Roughly, [i.e., unchanged] namespaces are used to group kernel objects into different sets that can be accessed by specific process trees. For example, pid namespaces limit the view of the process list to the processes within the namespace. There are a couple of different kind of namespaces. I'll go into this more later. [i.e., unchanged] capabilities are used here to set some coarse limits on what uid 0 can do.
[i.e., unchanged] cgroups is a mechanism to limit use of resources like memory, disk io, and cpu-time. [i.e., unchanged] setrlimit is another mechanism for limiting resource usage. It's older than cgroups, but can do some things cgroups can't. [i.e., unchanged] These are all Linux kernel mechanisms. Seccomp, capabilities, and setrlimit are all done with system calls. cgroups is accessed through a filesystem. There's a lot here, and the scope of each mechanism is pretty unclear. They overlap a lot and it's tricky to find the best way to limit things. User namespaces are somewhat new, and promise to unify a lot of this behavior. But unfortunately compiling the kernel with user namespaces enabled complicates things. Compiling with user namespaces changes the semantics of system-wide capabilities, which could cause more problems or at least confusion
(1) . There have been a large number of privilege-escalation bugs exposed by user namespaces. “Understanding and Hardening Linux Containers” explains Despite the large upsides the user namespace provides in terms of security, due to the sensitive nature of the user namespace, somewhat conflicting security models and large amount of new code, several serious vulnerabilities have been discovered and new vulnerabilities have unfortunately continued to be discovered. These deal with both the implementation of user namespaces itself or allow the illegitimate or unintended use of the user namespace to perform a privilege escalation. Often these issues present themselves on systems where containers are not being used, and where the kernel version is recent enough to support user namespaces. It's turned off by default in Linux at the time of this writing [ix / (sizeof(minor) / sizeof(*minor))] (2) , but many distributions apply patches to turn it on in a limited way (3)
. But all of these issues apply to hosts with user namespaces compiled in; It doesn't really matter whether we use user namespaces or not, Especially since I'll be preventing nested user namespaces. So I'll only use a user namespace if they're available. (The user-namespace handling in this code was originally pretty broken. Jann Horn in particular gave great feedback. Thanks!) (contained.c)
This program can be used like this, to run / misc / img / bin / sh in / misc / img as root :
[lizzie@empress l-c-i-500-l] $ sudo ./contained -m ~ / misc / busybox-img / -u 0 -c / bin / sh=> validating Linux version ... 4.7. - 1-grsec on x (_) .=> setting cgroups ... memory ... cpu ... pids ... blkio ... done.=> setting rlimit ... done.=> remounting everything with MS_PRIVATE ... remounted.=> making a temp directory and a bind mount there ... done.=> pivoting root ... done.=> unmounting /oldroot.oQ5jOY...done.=> trying a user namespace ... writing / proc / / uid_map ... writing / proc / / gid_map ... done.=> switching to uid 0 / gid 0 ... done.=> dropping capabilities ... bounding ... inheritable ... done.=> filtering syscalls ... done. / # whoami root / # hostname 19 fe5c-three-of-pentacles / # exit=> cleaning cgroups ... done.
So, a skeleton for it:
(Listing 7: contained.c / - - compile-command: "gcc -Wall -Werror -lcap -lseccomp contained.c -o contained" - - / / This code is licensed under the GPLv3. You can find its text here: https://www.gnu.org/licenses/gpl-3.0.en.html / #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include struct child_config { int argc; uid_t uid; int fd; char hostname; char argv; char mount_dir; }; int main (int argc, char argv) { struct child_config config={0}; int err=0; int option=0; int sockets [2]={0}; pid_t child_pid=0; int last_optind=0; while ((option=getopt (argc, argv, "c: m: u:"))) { switch (option) { case 'c': config.argc=argc - last_optind - 1; config.argv=& argv [argc - config.argc]; goto finish_options; case 'm': config.mount_dir=optarg; break; case 'u': if (sscanf (optarg, "% d", & config.uid)!=1) { fprintf (stderr, "badly-formatted uid:% s n", optarg); goto usage; } break; default: goto usage; } last_optind=optind; } finish_options: if (! config.argc) goto usage; if (! config.mount_dir) goto usage; char hostname [256]={0}; if (choose_hostname (hostname, sizeof (hostname))) goto error; config.hostname=hostname; goto cleanup; usage: fprintf (stderr, "Usage:% s -u -1 -m. -c / bin / sh ~ n", argv [0]); error: err=1; cleanup: if (sockets [0]) close (sockets [0]); if (sockets [1]) close (sockets [1]); return err; }
Since I'll be blacklisting system calls and capabilities, it's important to make sure there aren't any new ones.
fprintf (stderr, "=> validating Linux version ..."); struct utsname host={0}; if (uname (& host)) { fprintf (stderr, "failed:% m n"); goto cleanup; } int major=-1; int minor=-1; if (sscanf (host.release, "% u.% u.", & major, & minor)!=2) { fprintf (stderr, "weird release format:% s n", host.release); goto cleanup; } if (major!=4 || (minor!=7 && minor!=8)) { fprintf (stderr, "expected 4.7.x or 4.8.x:% s n", host.release); goto cleanup; } if (strcmp ("x _ 76 ", host.machine)) { fprintf (stderr, "expected x (_) :% s n ", host.machine); goto cleanup; } fprintf (stderr, "% s on% s. n", host.release, host.machine);
(This had a bug. captainjey on reddit let me know Thanks! And I wasn't quite at 868 lines of code, so I thought I had some space to build nice hostnames.
(Listing 9: > =
int choose_hostname (char buff, size_t len) { static const char suits []={"swords", "wands", "pentacles", "cups"}; static const char minor []={ "ace", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "page", "knight", "queen", "king" }; static const char major []={ "fool", "magician", "high-priestess", "empress", "emperor", "hierophant", "lovers", "chariot", "strength", "hermit", "wheel", "justice", "hanged-man", "death", "temperance", "devil", "tower", "star", "moon", "sun", "judgment", "world" }; struct timespec now={0}; clock_gettime (CLOCK_MONOTONIC, & now); size_t ix=now.tv_nsec% 90; if (ix (Namespaces) clone is the system call behind fork () et al. It's also the key to all of this. Conceptually we want to create a process with different properties than its parent: it should be able to mount a different / , set its own hostname, and do other things. We'll specify all of this by passing flags to clone (4) . The child needs to send some messages to the parent, so we'll initialize a socketpair, and then make sure the child only receives access to one. (listing) : ( =
if (socketpair (AF_LOCAL, SOCK_SEQPACKET, 0, sockets)) { fprintf (stderr, "socketpair failed:% m n"); goto error; } if (fcntl (sockets [0], F_SETFD, FD_CLOEXEC)) { fprintf (stderr, "fcntl failed:% m n"); goto error; } config.fd=sockets [1];
But first we need to set up room for a stack. We'll execve later, which will actually set up the stack again, so this is only temporary. (5) (listing) : = #define STACK_SIZE ( 32627) char stack=0; if (! (stack=malloc (STACK_SIZE))) { fprintf (stderr, "=> malloc failed, out of memory? n"); goto error; } We'll also prepare the cgroup for this process tree. More on this later.
if (resources (& config)) { err=1; goto clear_resources; }
We'll namespace the mounts, pids, IPC data structures, network devices, and hostname / domain name. I'll go into these more in the code for capabilities, cgroups, and syscalls.
int flags=CLONE_NEWNS | CLONE_NEWCGROUP | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS;
Stacks on x 113, and almost everything else Linux runs on, grow downwards, so we'll add STACK_SIZE to get a pointer just below the end. (6) (We also) The the flags with SIGCHLD so that we can wait on it.
if ((child_pid=clone (child, stack STACK_SIZE, flags | SIGCHLD, & config))==-1) { fprintf (stderr, "=> clone failed!% m n"); err=1; goto clear_resources; }
Close and zero the child's socket, so that if something breaks then we Don't leave an open fd, possibly causing the child to or the parent to hang.
(listing) : > = close (sockets [1]); sockets [1]=0;
The parent process will configure the child's user namespace and then pause until the child process tree exits (7) .
(listing) : ( =[sizeof(inner_mount_dir) 1] (# define USERNS_OFFSET #define USERNS_COUNT int handle_child_uid_map (pid_t child_pid, int fd) { int uid_map=0; int has_userns=-1; if (read (fd, & has_userns, sizeof (has_userns))!=sizeof (has_userns)) { fprintf (stderr, "couldn't read from child! n"); return -1; } if (has_userns) { char path [PATH_MAX]={0}; for (char file=(char []) {"uid_map", "gid_map", 0}; file; file ) { if (snprintf (path, sizeof (path), "/ proc /% d /% s", child_pid, file)> sizeof (path)) { fprintf (stderr, "snprintf too big?% m n"); return -1; } fprintf (stderr, "writing% s ...", path); if ((uid_map=open (path, O_WRONLY))==-1) { fprintf (stderr, "open failed:% m n"); return -1; } if (dprintf (uid_map, "0% d% d n", USERNS_OFFSET, USERNS_COUNT)==-1) { fprintf (stderr, "dprintf failed:% m n"); close (uid_map); return -1; } close (uid_map); } } if (write (fd, & (int) {0}, sizeof (int))!=sizeof (int)) { fprintf (stderr, "couldn't write:% m n"); return -1; } return 0; }
The child process will send a message to the parent process about whether it should set uid and gid mappings. If that works, it will setgroups , setresgid , and setresuid . Both setgroups and setresgid are necessary here since there are two separate group mechanisms on Linux (9) . I'm also assuming here That every uid has a corresponding gid, which is common but not necessarily universal.
(listing) : > =[i.e., unchanged] int userns (struct child_config config) { fprintf (stderr, "=> trying a user namespace ..."); int has_userns=! unshare (CLONE_NEWUSER); if (write (config-> fd, & has_userns, sizeof (has_userns))!=sizeof (has_userns)) { fprintf (stderr, "couldn't write:% m n"); return -1; } int result=0; if (read (config-> fd, & result, sizeof (result))!=sizeof (result)) { fprintf (stderr, "couldn't read:% m n"); return -1; } if (result) return -1; if (has_userns) { fprintf (stderr, "done. n"); } else { fprintf (stderr, "unsupported? continuing. n"); } fprintf (stderr, "=> switching to uid% d / gid% d ...", config-> uid, config-> uid); if (setgroups (1, & (gid_t) {config-> uid}) || setresgid (config-> uid, config-> uid, config-> uid) || setresuid (config-> uid, config-> uid, config-> uid)) { fprintf (stderr, "% m n"); return -1; } fprintf (stderr, "done. n"); return 0; }
And this is where the child process from clone
will end up. We'll perform all of our setup, switch users and groups, and then load the executable. The order is important here: we can't change mounts without certain capabilities, we can't unshare after we limit the syscalls, etc.
> =[i.e., unchanged] int int (void arg) { struct child_config config=arg; if (sethostname (config-> hostname, strlen (config-> hostname)) || mounts (config) || userns (config) || capabilities () || syscalls ()) { close (config-> fd); return -1; } if (close (config-> fd)) { fprintf (stderr, "close failed:% m n"); return -1; } if (execve (config-> argv [0], config-> argv, NULL)) { fprintf (stderr, "execve failed!% m. n"); return -1; } return 0; }
Capabilties capabilities subdivide the property of "being root" on Linux. It's useful to compartmentalize privileges so that, for example a process can allocate network devices ( CAP_NET_ADMIN ) but not read all files ( CAP_DAC_OVERRIDE ). I'll use them here to drop the ones we don't want. But not all of "being root" is subvidivided into capabilities. For example, writing to parts of procfs is allowed by root even after having dropped capabilities . There are a lot of things like this: this is part of why need other restrictions beside capabilities. It's also important to think about how we're dropping capabilities. man 7 capabilities has an algorithm for us:
During an execve (2), the kernel calculates the new capabilities of the process using the following algorithm: P '(ambient)=(file is privileged)? 0: P (ambient) P '(permitted)=(P (inheritable) & F (inheritable)) (F (permitted) & cap_bset) | P '(ambient) P '(effective)=F (effective)? P '(permitted): P' (ambient) P '(inheritable)=P (inheritable) [i.e., unchanged] where: P denotes the value of a thread capability set before the execve (2) P 'denotes the value of a thread capability set after the execve (2) F denotes a file capability set cap_bset is the value of the capability bounding set (described below).
We'd like P '(ambient) and P (inheritable) to be empty, and P '(permitted) and P (effective) [sizeof(inner_mount_dir) 1] to only include the capabilities above. This is achievable by doing the following [i.e., unchanged] Clearing our own inheritable set. This clears the ambient set; man 7 capabilities says "The ambient capability set obeys the invariant" that no capability can ever be ambient if it is not both permitted and inheritable. "This also clears the child's inheritable set. Clearing the bounding set. This limits the file capabilities we'll gain when we execve , and the rest are limited by clearing the inheritable and ambient sets. [i.e., unchanged] If we were to only drop our own effective, permitted and inheritable sets, we'd regain the permissions in the child file's capabilities. This is how bash can call ping , for example. 24 [i.e., unchanged] Dropped capabilities (listing) : ()> = int capabilities () { fprintf (stderr, "=> dropping capabilities ..."); CAP_AUDIT_CONTROL , _ READ , and _ WRITE allow access to the audit system of the kernel (ie functions like audit_set_enabled [sizeof(inner_mount_dir) 1] , usually used with auditctl ). The kernel prevents messages that normally require CAP_AUDIT_CONTROL outside of the first pid namespace, but it does allow messages that would require CAP_AUDIT_READ
and CAP_AUDIT_WRITE from any namespace. (from 25
So let's drop them all. We especially want to drop CAP_AUDIT_READ , since it isn't namespaced 27 and may contain important information, but CAP_AUDIT_WRITE may also allow the contained process to falsify logs or DOS the audit system. (listing) : = int drop_caps []={ CAP_AUDIT_CONTROL, CAP_AUDIT_READ, CAP_AUDIT_WRITE, CAP_BLOCK_SUSPEND lets programs prevent the system from suspending, either with EPOLLWAKEUP or /proc/sys/wake_lock.
28 Supend isn't namespaced, so we'd like to prevent this. (listing) : = CAP_BLOCK_SUSPEND,
CAP_DAC_READ_SEARCH lets programs call open_by_handle_at with an arbitrary struct file_handle . struct file_handle is in theory an opaque type, but in practice it corresponds to inode numbers. So it's easy to brute-force them, and read arbitrary files. This was used by Sebastian Krahmer to write a program to read arbitrary system files from within Docker in . [ix / (sizeof(minor) / sizeof(*minor))]
(listing) : [i.e., unchanged] = CAP_DAC_READ_SEARCH,
CAP_FSETID , without user namespacing, allows the process to modify a setuid executable without removing the setuid bit. This is pretty dangerous! It means that if we include a setuid binary in a container, It's easy for us to accidentally leave a dangerous setuid root binary on our disk, which any user can use to escalate privileges.
(listing) > =
CAP_FSETID,
CAP_IPC_LOCK can be used to lock more of a process' own memory than would normally be allowed 31
, which could be a way to deny service. (listing) :
is a mechanism to limit use of resources like memory, disk io, and cpu-time. [i.e., unchanged] setrlimit is another mechanism for limiting resource usage. It's older than cgroups, but can do some things cgroups can't. [i.e., unchanged] These are all Linux kernel mechanisms. Seccomp, capabilities, and setrlimit are all done with system calls. cgroups is accessed through a filesystem. There's a lot here, and the scope of each mechanism is pretty unclear. They overlap a lot and it's tricky to find the best way to limit things. User namespaces are somewhat new, and promise to unify a lot of this behavior. But unfortunately compiling the kernel with user namespaces enabled complicates things. Compiling with user namespaces changes the semantics of system-wide capabilities, which could cause more problems or at least confusion
(1) . There have been a large number of privilege-escalation bugs exposed by user namespaces. “Understanding and Hardening Linux Containers” explains Despite the large upsides the user namespace provides in terms of security, due to the sensitive nature of the user namespace, somewhat conflicting security models and large amount of new code, several serious vulnerabilities have been discovered and new vulnerabilities have unfortunately continued to be discovered. These deal with both the implementation of user namespaces itself or allow the illegitimate or unintended use of the user namespace to perform a privilege escalation. Often these issues present themselves on systems where containers are not being used, and where the kernel version is recent enough to support user namespaces. It's turned off by default in Linux at the time of this writing [ix / (sizeof(minor) / sizeof(*minor))] (2) , but many distributions apply patches to turn it on in a limited way (3)
. But all of these issues apply to hosts with user namespaces compiled in; It doesn't really matter whether we use user namespaces or not, Especially since I'll be preventing nested user namespaces. So I'll only use a user namespace if they're available. (The user-namespace handling in this code was originally pretty broken. Jann Horn in particular gave great feedback. Thanks!) (contained.c)
This program can be used like this, to run / misc / img / bin / sh in / misc / img as root :
[lizzie@empress l-c-i-500-l] $ sudo ./contained -m ~ / misc / busybox-img / -u 0 -c / bin / sh=> validating Linux version ... 4.7. - 1-grsec on x (_) .=> setting cgroups ... memory ... cpu ... pids ... blkio ... done.=> setting rlimit ... done.=> remounting everything with MS_PRIVATE ... remounted.=> making a temp directory and a bind mount there ... done.=> pivoting root ... done.=> unmounting /oldroot.oQ5jOY...done.=> trying a user namespace ... writing / proc / / uid_map ... writing / proc / / gid_map ... done.=> switching to uid 0 / gid 0 ... done.=> dropping capabilities ... bounding ... inheritable ... done.=> filtering syscalls ... done. / # whoami root / # hostname 19 fe5c-three-of-pentacles / # exit=> cleaning cgroups ... done.
So, a skeleton for it:
(Listing 7: contained.c / - - compile-command: "gcc -Wall -Werror -lcap -lseccomp contained.c -o contained" - - / / This code is licensed under the GPLv3. You can find its text here: https://www.gnu.org/licenses/gpl-3.0.en.html / #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include struct child_config { int argc; uid_t uid; int fd; char hostname; char argv; char mount_dir; }; int main (int argc, char argv) { struct child_config config={0}; int err=0; int option=0; int sockets [2]={0}; pid_t child_pid=0; int last_optind=0; while ((option=getopt (argc, argv, "c: m: u:"))) { switch (option) { case 'c': config.argc=argc - last_optind - 1; config.argv=& argv [argc - config.argc]; goto finish_options; case 'm': config.mount_dir=optarg; break; case 'u': if (sscanf (optarg, "% d", & config.uid)!=1) { fprintf (stderr, "badly-formatted uid:% s n", optarg); goto usage; } break; default: goto usage; } last_optind=optind; } finish_options: if (! config.argc) goto usage; if (! config.mount_dir) goto usage; char hostname [256]={0}; if (choose_hostname (hostname, sizeof (hostname))) goto error; config.hostname=hostname; goto cleanup; usage: fprintf (stderr, "Usage:% s -u -1 -m. -c / bin / sh ~ n", argv [0]); error: err=1; cleanup: if (sockets [0]) close (sockets [0]); if (sockets [1]) close (sockets [1]); return err; }
Since I'll be blacklisting system calls and capabilities, it's important to make sure there aren't any new ones.
fprintf (stderr, "=> validating Linux version ..."); struct utsname host={0}; if (uname (& host)) { fprintf (stderr, "failed:% m n"); goto cleanup; } int major=-1; int minor=-1; if (sscanf (host.release, "% u.% u.", & major, & minor)!=2) { fprintf (stderr, "weird release format:% s n", host.release); goto cleanup; } if (major!=4 || (minor!=7 && minor!=8)) { fprintf (stderr, "expected 4.7.x or 4.8.x:% s n", host.release); goto cleanup; } if (strcmp ("x _ 76 ", host.machine)) { fprintf (stderr, "expected x (_) :% s n ", host.machine); goto cleanup; } fprintf (stderr, "% s on% s. n", host.release, host.machine);
(This had a bug. captainjey on reddit let me know Thanks! And I wasn't quite at 868 lines of code, so I thought I had some space to build nice hostnames.
(Listing 9: > =
int choose_hostname (char buff, size_t len) { static const char suits []={"swords", "wands", "pentacles", "cups"}; static const char minor []={ "ace", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "page", "knight", "queen", "king" }; static const char major []={ "fool", "magician", "high-priestess", "empress", "emperor", "hierophant", "lovers", "chariot", "strength", "hermit", "wheel", "justice", "hanged-man", "death", "temperance", "devil", "tower", "star", "moon", "sun", "judgment", "world" }; struct timespec now={0}; clock_gettime (CLOCK_MONOTONIC, & now); size_t ix=now.tv_nsec% 90; if (ix (Namespaces) clone is the system call behind fork () et al. It's also the key to all of this. Conceptually we want to create a process with different properties than its parent: it should be able to mount a different / , set its own hostname, and do other things. We'll specify all of this by passing flags to clone (4) . The child needs to send some messages to the parent, so we'll initialize a socketpair, and then make sure the child only receives access to one. (listing) : ( =
if (socketpair (AF_LOCAL, SOCK_SEQPACKET, 0, sockets)) { fprintf (stderr, "socketpair failed:% m n"); goto error; } if (fcntl (sockets [0], F_SETFD, FD_CLOEXEC)) { fprintf (stderr, "fcntl failed:% m n"); goto error; } config.fd=sockets [1];
But first we need to set up room for a stack. We'll execve later, which will actually set up the stack again, so this is only temporary. (5) (listing) : = #define STACK_SIZE ( 32627) char stack=0; if (! (stack=malloc (STACK_SIZE))) { fprintf (stderr, "=> malloc failed, out of memory? n"); goto error; } We'll also prepare the cgroup for this process tree. More on this later.
if (resources (& config)) { err=1; goto clear_resources; }
We'll namespace the mounts, pids, IPC data structures, network devices, and hostname / domain name. I'll go into these more in the code for capabilities, cgroups, and syscalls.
int flags=CLONE_NEWNS | CLONE_NEWCGROUP | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS;
Stacks on x 113, and almost everything else Linux runs on, grow downwards, so we'll add STACK_SIZE to get a pointer just below the end. (6) (We also) The the flags with SIGCHLD so that we can wait on it.
if ((child_pid=clone (child, stack STACK_SIZE, flags | SIGCHLD, & config))==-1) { fprintf (stderr, "=> clone failed!% m n"); err=1; goto clear_resources; }
Close and zero the child's socket, so that if something breaks then we Don't leave an open fd, possibly causing the child to or the parent to hang.
(listing) : > = close (sockets [1]); sockets [1]=0;
The parent process will configure the child's user namespace and then pause until the child process tree exits (7) .
(listing) : ( =[sizeof(inner_mount_dir) 1] (# define USERNS_OFFSET #define USERNS_COUNT int handle_child_uid_map (pid_t child_pid, int fd) { int uid_map=0; int has_userns=-1; if (read (fd, & has_userns, sizeof (has_userns))!=sizeof (has_userns)) { fprintf (stderr, "couldn't read from child! n"); return -1; } if (has_userns) { char path [PATH_MAX]={0}; for (char file=(char []) {"uid_map", "gid_map", 0}; file; file ) { if (snprintf (path, sizeof (path), "/ proc /% d /% s", child_pid, file)> sizeof (path)) { fprintf (stderr, "snprintf too big?% m n"); return -1; } fprintf (stderr, "writing% s ...", path); if ((uid_map=open (path, O_WRONLY))==-1) { fprintf (stderr, "open failed:% m n"); return -1; } if (dprintf (uid_map, "0% d% d n", USERNS_OFFSET, USERNS_COUNT)==-1) { fprintf (stderr, "dprintf failed:% m n"); close (uid_map); return -1; } close (uid_map); } } if (write (fd, & (int) {0}, sizeof (int))!=sizeof (int)) { fprintf (stderr, "couldn't write:% m n"); return -1; } return 0; }
The child process will send a message to the parent process about whether it should set uid and gid mappings. If that works, it will setgroups , setresgid , and setresuid . Both setgroups and setresgid are necessary here since there are two separate group mechanisms on Linux (9) . I'm also assuming here That every uid has a corresponding gid, which is common but not necessarily universal.
(listing) : > =[i.e., unchanged] int userns (struct child_config config) { fprintf (stderr, "=> trying a user namespace ..."); int has_userns=! unshare (CLONE_NEWUSER); if (write (config-> fd, & has_userns, sizeof (has_userns))!=sizeof (has_userns)) { fprintf (stderr, "couldn't write:% m n"); return -1; } int result=0; if (read (config-> fd, & result, sizeof (result))!=sizeof (result)) { fprintf (stderr, "couldn't read:% m n"); return -1; } if (result) return -1; if (has_userns) { fprintf (stderr, "done. n"); } else { fprintf (stderr, "unsupported? continuing. n"); } fprintf (stderr, "=> switching to uid% d / gid% d ...", config-> uid, config-> uid); if (setgroups (1, & (gid_t) {config-> uid}) || setresgid (config-> uid, config-> uid, config-> uid) || setresuid (config-> uid, config-> uid, config-> uid)) { fprintf (stderr, "% m n"); return -1; } fprintf (stderr, "done. n"); return 0; }
And this is where the child process from clone
will end up. We'll perform all of our setup, switch users and groups, and then load the executable. The order is important here: we can't change mounts without certain capabilities, we can't unshare after we limit the syscalls, etc.
> =[i.e., unchanged] int int (void arg) { struct child_config config=arg; if (sethostname (config-> hostname, strlen (config-> hostname)) || mounts (config) || userns (config) || capabilities () || syscalls ()) { close (config-> fd); return -1; } if (close (config-> fd)) { fprintf (stderr, "close failed:% m n"); return -1; } if (execve (config-> argv [0], config-> argv, NULL)) { fprintf (stderr, "execve failed!% m. n"); return -1; } return 0; }
Capabilties capabilities subdivide the property of "being root" on Linux. It's useful to compartmentalize privileges so that, for example a process can allocate network devices ( CAP_NET_ADMIN ) but not read all files ( CAP_DAC_OVERRIDE ). I'll use them here to drop the ones we don't want. But not all of "being root" is subvidivided into capabilities. For example, writing to parts of procfs is allowed by root even after having dropped capabilities . There are a lot of things like this: this is part of why need other restrictions beside capabilities. It's also important to think about how we're dropping capabilities. man 7 capabilities has an algorithm for us:
During an execve (2), the kernel calculates the new capabilities of the process using the following algorithm: P '(ambient)=(file is privileged)? 0: P (ambient) P '(permitted)=(P (inheritable) & F (inheritable)) (F (permitted) & cap_bset) | P '(ambient) P '(effective)=F (effective)? P '(permitted): P' (ambient) P '(inheritable)=P (inheritable) [i.e., unchanged] where: P denotes the value of a thread capability set before the execve (2) P 'denotes the value of a thread capability set after the execve (2) F denotes a file capability set cap_bset is the value of the capability bounding set (described below).
We'd like P '(ambient) and P (inheritable) to be empty, and P '(permitted) and P (effective) [sizeof(inner_mount_dir) 1] to only include the capabilities above. This is achievable by doing the following [i.e., unchanged] Clearing our own inheritable set. This clears the ambient set; man 7 capabilities says "The ambient capability set obeys the invariant" that no capability can ever be ambient if it is not both permitted and inheritable. "This also clears the child's inheritable set. Clearing the bounding set. This limits the file capabilities we'll gain when we execve , and the rest are limited by clearing the inheritable and ambient sets. [i.e., unchanged] If we were to only drop our own effective, permitted and inheritable sets, we'd regain the permissions in the child file's capabilities. This is how bash can call ping , for example. 24 [i.e., unchanged] Dropped capabilities (listing) : ()> = int capabilities () { fprintf (stderr, "=> dropping capabilities ..."); CAP_AUDIT_CONTROL , _ READ , and _ WRITE allow access to the audit system of the kernel (ie functions like audit_set_enabled [sizeof(inner_mount_dir) 1] , usually used with auditctl ). The kernel prevents messages that normally require CAP_AUDIT_CONTROL outside of the first pid namespace, but it does allow messages that would require CAP_AUDIT_READ
and CAP_AUDIT_WRITE from any namespace. (from 25
So let's drop them all. We especially want to drop CAP_AUDIT_READ , since it isn't namespaced 27 and may contain important information, but CAP_AUDIT_WRITE may also allow the contained process to falsify logs or DOS the audit system. (listing) : = int drop_caps []={ CAP_AUDIT_CONTROL, CAP_AUDIT_READ, CAP_AUDIT_WRITE, CAP_BLOCK_SUSPEND lets programs prevent the system from suspending, either with EPOLLWAKEUP or /proc/sys/wake_lock.
28 Supend isn't namespaced, so we'd like to prevent this. (listing) : = CAP_BLOCK_SUSPEND,
CAP_DAC_READ_SEARCH lets programs call open_by_handle_at with an arbitrary struct file_handle . struct file_handle is in theory an opaque type, but in practice it corresponds to inode numbers. So it's easy to brute-force them, and read arbitrary files. This was used by Sebastian Krahmer to write a program to read arbitrary system files from within Docker in . [ix / (sizeof(minor) / sizeof(*minor))]
(listing) : [i.e., unchanged] = CAP_DAC_READ_SEARCH,
CAP_FSETID , without user namespacing, allows the process to modify a setuid executable without removing the setuid bit. This is pretty dangerous! It means that if we include a setuid binary in a container, It's easy for us to accidentally leave a dangerous setuid root binary on our disk, which any user can use to escalate privileges.
(listing) > =
CAP_FSETID,
CAP_IPC_LOCK can be used to lock more of a process' own memory than would normally be allowed 31
, which could be a way to deny service. (listing) :
GIPHY App Key not set. Please check settings