Linux containers in 500 lines of code, Hacker News


  There are several complementary and overlapping mechanisms that make up modern Linux containers. Roughly,  [i.e., unchanged]  namespaces  are used to group kernel objects into different sets that can be accessed by specific process trees. For example, pid namespaces limit the view of the process list to the processes within the namespace. There are a couple of different kind of namespaces. I'll go into this more later. [i.e., unchanged] capabilities  are used here to set some coarse limits on what uid 0 can do. 









 [i.e., unchanged] cgroups  is a mechanism to limit use of resources like memory, disk io, and cpu-time. [i.e., unchanged] setrlimit  is another mechanism for limiting resource usage. It's older than cgroups, but can do some things cgroups can't. [i.e., unchanged] These are all Linux kernel mechanisms. Seccomp, capabilities, and  setrlimit  are all done with system calls.  cgroups  is accessed through a filesystem.   There's a lot here, and the scope of each mechanism is pretty unclear. They overlap a lot and it's tricky to find the best way to limit things. User namespaces are somewhat new, and promise to unify a lot of this behavior. But unfortunately compiling the kernel with user namespaces enabled complicates things. Compiling with user namespaces changes the semantics of system-wide capabilities, which could cause more problems or at least confusion
 (1) . There have been a large number of privilege-escalation bugs exposed by user namespaces. “Understanding and Hardening Linux Containers” explains   Despite the large upsides the user namespace provides in terms of security, due to the sensitive nature of the user namespace, somewhat conflicting security models and large amount of new code, several serious vulnerabilities have been discovered and new vulnerabilities have unfortunately continued to be discovered. These deal with both the implementation of user namespaces itself or allow the illegitimate or unintended use of the user namespace to perform a privilege escalation. Often these issues present themselves on systems where containers are not being used, and where the kernel version is recent enough to support user namespaces.    It's turned off by default in Linux at the time of this writing [ix / (sizeof(minor) / sizeof(*minor))] (2) , but many distributions apply patches to turn it on in a limited way (3) 

.   But all of these issues apply to hosts with user namespaces compiled in; It doesn't really matter whether we use user namespaces or not, Especially since I'll be preventing nested user namespaces. So I'll only use a user namespace if they're available.   (The user-namespace handling in this code was originally pretty broken. Jann Horn in particular gave great feedback. Thanks!)   (contained.c) 

  This program can be used like this, to run  / misc / img / bin / sh  in  / misc / img  as  root  :  









 [lizzie@empress l-c-i-500-l] $ sudo ./contained -m ~ / misc / busybox-img / -u 0 -c / bin / sh=> validating Linux version ... 4.7.  - 1-grsec on x (_) .=> setting cgroups ... memory ... cpu ... pids ... blkio ... done.=> setting rlimit ... done.=> remounting everything with MS_PRIVATE ... remounted.=> making a temp directory and a bind mount there ... done.=> pivoting root ... done.=> unmounting /oldroot.oQ5jOY...done.=> trying a user namespace ... writing / proc /  / uid_map ... writing / proc / / gid_map ... done.=> switching to uid 0 / gid 0 ... done.=> dropping capabilities ... bounding ... inheritable ... done.=> filtering syscalls ... done. / # whoami root / # hostname 19 fe5c-three-of-pentacles / # exit=> cleaning cgroups ... done. 
   So, a skeleton for it:   
  (Listing 7:   contained.c  / - - compile-command: "gcc -Wall -Werror -lcap -lseccomp contained.c -o contained" - - / / This code is licensed under the GPLv3. You can find its text here:    https://www.gnu.org/licenses/gpl-3.0.en.html / #define _GNU_SOURCE #include  #include  #include  #include  #include  #include  #include #include  #include #include  #include #include  #include #include  #include #include #include #include #include #include #include #include  struct child_config { int argc; uid_t uid; int fd; char hostname; char argv; char mount_dir; }; int main (int argc, char argv) { struct child_config config={0}; int err=0; int option=0; int sockets [2]={0}; pid_t child_pid=0; int last_optind=0; while ((option=getopt (argc, argv, "c: m: u:"))) { switch (option) { case 'c': config.argc=argc - last_optind - 1; config.argv=& argv [argc - config.argc]; goto finish_options; case 'm': config.mount_dir=optarg; break; case 'u': if (sscanf (optarg, "% d", & config.uid)!=1) { fprintf (stderr, "badly-formatted uid:% s n", optarg); goto usage; } break; default: goto usage; } last_optind=optind; } finish_options: if (! config.argc) goto usage; if (! config.mount_dir) goto usage; char hostname [256]={0}; if (choose_hostname (hostname, sizeof (hostname))) goto error; config.hostname=hostname; goto cleanup; usage: fprintf (stderr, "Usage:% s -u -1 -m. -c / bin / sh ~ n", argv [0]); error: err=1; cleanup: if (sockets [0]) close (sockets [0]); if (sockets [1]) close (sockets [1]); return err; } 









  Since I'll be blacklisting system calls and capabilities, it's important to make sure there aren't any new ones.   
  (Listing 8:  >=

 fprintf (stderr, "=> validating Linux version ..."); struct utsname host={0}; if (uname (& host)) { fprintf (stderr, "failed:% m n"); goto cleanup; } int major=-1; int minor=-1; if (sscanf (host.release, "% u.% u.", & major, & minor)!=2) { fprintf (stderr, "weird release format:% s n", host.release); goto cleanup; } if (major!=4 || (minor!=7 && minor!=8)) { fprintf (stderr, "expected 4.7.x or 4.8.x:% s n", host.release); goto cleanup; } if (strcmp ("x  _ 76 ", host.machine)) { fprintf (stderr, "expected x (_) :% s n ", host.machine); goto cleanup; } fprintf (stderr, "% s on% s. n", host.release, host.machine); 
  (This had a bug.  captainjey on reddit let me know Thanks!     And I wasn't quite at 868 lines of code, so I thought I had some space to build nice hostnames.   
  (Listing 9:  > = 

 int choose_hostname (char buff, size_t len) { static const char suits []={"swords", "wands", "pentacles", "cups"}; static const char minor []={ "ace", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "page", "knight", "queen", "king" }; static const char major []={ "fool", "magician", "high-priestess", "empress", "emperor", "hierophant", "lovers", "chariot", "strength", "hermit", "wheel", "justice", "hanged-man", "death", "temperance", "devil", "tower", "star", "moon", "sun", "judgment", "world" }; struct timespec now={0}; clock_gettime (CLOCK_MONOTONIC, & now); size_t ix=now.tv_nsec% 90; if (ix  (Namespaces)   clone  is the system call behind  fork ()  et al. It's also the key to all of this. Conceptually we want to create a process with different properties than its parent: it should be able to mount a different  /  , set its own hostname, and do other things. We'll specify all of this by passing flags to  clone  (4) .   The child needs to send some messages to the parent, so we'll initialize a socketpair, and then make sure the child only receives access to one.   (listing) : (  = 
 if (socketpair (AF_LOCAL, SOCK_SEQPACKET, 0, sockets)) { fprintf (stderr, "socketpair failed:% m n"); goto error; } if (fcntl (sockets [0], F_SETFD, FD_CLOEXEC)) { fprintf (stderr, "fcntl failed:% m n"); goto error; } config.fd=sockets [1]; 
  But first we need to set up room for a stack. We'll  execve  later, which will actually set up the stack again, so this is only temporary.  (5)      (listing) :     =  #define STACK_SIZE ( 32627)  char stack=0; if (! (stack=malloc (STACK_SIZE))) { fprintf (stderr, "=> malloc failed, out of memory?  n"); goto error; }   We'll also prepare the cgroup for this process tree. More on this later.   
  (listing) :  >  = 

 if (resources (& config)) { err=1; goto clear_resources; } 
  We'll namespace the mounts, pids, IPC data structures, network devices, and hostname / domain name. I'll go into these more in the code for capabilities, cgroups, and syscalls.   
  (listing) :  >  =

 int flags=CLONE_NEWNS | CLONE_NEWCGROUP | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS; 
  Stacks on x 113, and almost everything else Linux runs on, grow downwards, so we'll add  STACK_SIZE  to get a pointer just below the end.  (6)   (We also)  The the flags with  SIGCHLD   so that we can  wait  on it.   
  (listing)    

 if ((child_pid=clone (child, stack   STACK_SIZE, flags | SIGCHLD, & config))==-1) { fprintf (stderr, "=> clone failed!% m  n"); err=1; goto clear_resources; } 
  Close and zero the child's socket, so that if something breaks then we Don't leave an open fd, possibly causing the child to or the parent to hang.   
  (listing) :  >  =  close (sockets [1]); sockets [1]=0; 

  The parent process will configure the child's user namespace and then pause until the child process tree exits  (7) .   
  (listing) :  (  =[sizeof(inner_mount_dir)   1] (# define USERNS_OFFSET  #define USERNS_COUNT int handle_child_uid_map (pid_t child_pid, int fd) { int uid_map=0; int has_userns=-1; if (read (fd, & has_userns, sizeof (has_userns))!=sizeof (has_userns)) { fprintf (stderr, "couldn't read from child!  n"); return -1; } if (has_userns) { char path [PATH_MAX]={0}; for (char file=(char []) {"uid_map", "gid_map", 0}; file; file   ) { if (snprintf (path, sizeof (path), "/ proc /% d /% s", child_pid, file)> sizeof (path)) { fprintf (stderr, "snprintf too big?% m  n"); return -1; } fprintf (stderr, "writing% s ...", path); if ((uid_map=open (path, O_WRONLY))==-1) { fprintf (stderr, "open failed:% m  n"); return -1; } if (dprintf (uid_map, "0% d% d  n", USERNS_OFFSET, USERNS_COUNT)==-1) { fprintf (stderr, "dprintf failed:% m  n"); close (uid_map); return -1; } close (uid_map); } } if (write (fd, & (int) {0}, sizeof (int))!=sizeof (int)) { fprintf (stderr, "couldn't write:% m  n"); return -1; } return 0; } 

  The child process will send a message to the parent process about whether it should set uid and gid mappings. If that works, it will  setgroups  ,  setresgid , and  setresuid . Both  setgroups  and  setresgid  are necessary here since there are two separate group mechanisms on Linux  (9)   . I'm also assuming here That every uid has a corresponding gid, which is common but not necessarily universal.   
  (listing) :  >  =[i.e., unchanged] int userns (struct child_config config) { fprintf (stderr, "=> trying a user namespace ..."); int has_userns=! unshare (CLONE_NEWUSER); if (write (config-> fd, & has_userns, sizeof (has_userns))!=sizeof (has_userns)) { fprintf (stderr, "couldn't write:% m  n"); return -1; } int result=0; if (read (config-> fd, & result, sizeof (result))!=sizeof (result)) { fprintf (stderr, "couldn't read:% m  n"); return -1; } if (result) return -1; if (has_userns) { fprintf (stderr, "done.  n"); } else { fprintf (stderr, "unsupported? continuing.  n"); } fprintf (stderr, "=> switching to uid% d / gid% d ...", config-> uid, config-> uid); if (setgroups (1, & (gid_t) {config-> uid}) || setresgid (config-> uid, config-> uid, config-> uid) || setresuid (config-> uid, config-> uid, config-> uid)) { fprintf (stderr, "% m  n"); return -1; } fprintf (stderr, "done.  n"); return 0; } 

  And this is where the child process from  clone  will end up. We'll perform all of our setup, switch users and groups, and then load the executable. The order is important here: we can't change mounts without certain capabilities, we can't  unshare   after we limit the syscalls, etc.   
  (listing) :  

>  =[i.e., unchanged] int int (void arg) { struct child_config config=arg; if (sethostname (config-> hostname, strlen (config-> hostname)) || mounts (config) || userns (config) || capabilities () || syscalls ()) { close (config-> fd); return -1; } if (close (config-> fd)) { fprintf (stderr, "close failed:% m  n"); return -1; } if (execve (config-> argv [0], config-> argv, NULL)) { fprintf (stderr, "execve failed!% m.  n"); return -1; } return 0; } 
  Capabilties    capabilities  subdivide the property of "being root" on Linux. It's useful to compartmentalize privileges so that, for example a process can allocate network devices ( CAP_NET_ADMIN ) but not read all files ( CAP_DAC_OVERRIDE ). I'll use them here to drop the ones we don't want.    But not all of "being root" is subvidivided into capabilities. For example, writing to parts of procfs is allowed by root even after having dropped capabilities   . There are a lot of things like this: this is part of why need other restrictions beside capabilities.    It's also important to think about how we're dropping capabilities.  man 7 capabilities  has an algorithm for us:   
 During an execve (2), the kernel calculates the new capabilities of the process using the following algorithm:  P '(ambient)=(file is privileged)? 0: P (ambient)  P '(permitted)=(P (inheritable) & F (inheritable)) (F (permitted) & cap_bset) | P '(ambient)  P '(effective)=F (effective)? P '(permitted): P' (ambient)  P '(inheritable)=P (inheritable) [i.e., unchanged]  where:  P denotes the value of a thread capability set before the execve (2)  P 'denotes the value of a thread capability set after the execve (2)  F denotes a file capability set  cap_bset is the value of the capability bounding set (described below). 
  We'd like  P '(ambient)  and  P (inheritable)  to be empty, and  P '(permitted)  and  P (effective) [sizeof(inner_mount_dir)   1]  to only include the capabilities above. This is achievable by doing the following    [i.e., unchanged]  Clearing our own inheritable set. This clears the ambient set;  man   7 capabilities  says "The ambient capability set obeys the invariant" that no capability can ever be ambient if it is not both permitted and inheritable. "This also clears the child's inheritable set.   Clearing the bounding set. This limits the file capabilities we'll gain when we  execve , and the rest are limited by clearing the inheritable and ambient sets.  [i.e., unchanged]  If we were to only drop our own effective, permitted and inheritable sets, we'd regain the permissions in the child file's capabilities. This is how  bash  can call  ping , for example.  24   [i.e., unchanged]   Dropped capabilities     (listing) :  ()>  = int capabilities () { fprintf (stderr, "=> dropping capabilities ...");    CAP_AUDIT_CONTROL  ,  _ READ , and  _ WRITE  allow access to the audit system of the kernel (ie functions like  audit_set_enabled [sizeof(inner_mount_dir)   1] , usually used with  auditctl  ). The kernel prevents messages that normally require  CAP_AUDIT_CONTROL  outside of the first pid namespace, but it does allow messages that would require  CAP_AUDIT_READ  and  CAP_AUDIT_WRITE  from any namespace.  (from  25   So let's drop them all. We especially want to drop  CAP_AUDIT_READ , since it isn't namespaced   27  and may contain important information, but  CAP_AUDIT_WRITE   may also allow the contained process to falsify logs or DOS the audit system.     (listing) :    =  int drop_caps []={ CAP_AUDIT_CONTROL, CAP_AUDIT_READ, CAP_AUDIT_WRITE,    CAP_BLOCK_SUSPEND  lets programs prevent the system from suspending, either with  EPOLLWAKEUP   or /proc/sys/wake_lock.

 28  Supend isn't namespaced, so we'd like to prevent this.     (listing) :    =  CAP_BLOCK_SUSPEND, 

   CAP_DAC_READ_SEARCH  lets programs call  open_by_handle_at  with an arbitrary  struct file_handle .  struct file_handle  is in theory an opaque type, but in practice it corresponds to inode numbers. So it's easy to brute-force them, and read arbitrary files. This was used by Sebastian Krahmer to write a program to read arbitrary system files from within Docker in . [ix / (sizeof(minor) / sizeof(*minor))]  
  (listing) :  [i.e., unchanged]   =  CAP_DAC_READ_SEARCH, 

   CAP_FSETID  , without user namespacing, allows the process to modify a setuid executable without removing the setuid bit. This is pretty dangerous! It means that if we include a setuid binary in a container, It's easy for us to accidentally leave a dangerous setuid root binary on our disk, which any user can use to escalate privileges.        
  (listing)  >  = 
 CAP_FSETID, 
   CAP_IPC_LOCK  can be used to lock more of a process' own memory than would normally be allowed   31  , which could be a way to deny service.     (listing) :  > ==[256]  CAP_IPC_LOCK,    CAP_MAC_ADMIN  and  CAP_MAC_OVERRIDE  are used by the mandatory acess control systems Apparmor, SELinux, and SMACK to restrict access to their settings. These aren't namespaced, so they could be used by the Contained programs to circumvent system-wide access control.   
  (listing) :  > ==CAP_MAC_ADMIN, CAP_MAC_OVERRIDE, 

   CAP_MKNOD  , without user namespacing, allows programs to create device files corresponding to real-world devices. This includes creating new device files for existing hardware. If this capability were not dropped, a contained process could re-create the hard disk device, remount it, and read or write to it.  47  [sizeof(inner_mount_dir)   1]  
  (listing) :  >  =

 CAP_MKNOD, 
  I was worried that  CAP_SETFCAP  could be used to add a capability to an executable and  execve  it, but it's not actually possible for a process to set capabilities it does not have   32  . But! An executable altered this way could be executed by any unsandboxed user, so I think it unacceptably undermines the security of the system.   
  (listing) : ( =  CAP_SETFCAP, 

	
			
	
			

		
			
			
					
			
		





   CAP_SYSLOG  Lets users perform destructive actions against the syslog. Importantly, it does not prevent contained processes from reading the syslog, which could be risky. It also exposes kernel addresses, which could be used to circumvent kernel address layout randomization  [ix / (sizeof(minor) / sizeof(*minor))]   () .     (listing) : ( =  CAP_SYSLOG,    CAP_SYS_ADMIN  allows many behaviors! We don't want most of them ( mount ,  vm 113 , etc). Some would be nice to have ( sethostname ,  mount  for bind mounts…) but the extra complexity doesn't seem worth it.   
  (listing) :  ( 

 CAP_SYS_ADMIN, 
   CAP_SYS_BOOT  allows programs to restart the system (the  reboot   syscall) and load new kernels (the  kexec_load [sizeof(inner_mount_dir)   1]  and  kexec_file  syscalls)     . We absolutely don't want this.  reboot  is user-namespaced, and the  kexec [sizeof(inner_mount_dir)   1]  functions only work in the root user namespace, but neither of those help us.   
  (listing) :  [ix / (sizeof(minor) / sizeof(*minor))] (  CAP_SYS_BOOT, 

   CAP_SYS_MODULE  is used by the syscalls  delete_module ,  init_module  ,  finit_module    () , by the code for  kmod    37  , and by the code for loading device modules with ioctl  43  .     (listing)     (=) CAP_SYS_MODULE, 
   CAP_SYS_NICE  allows processes to set higher priority on given pids than the default  38  . The default kernel scheduler Doesn't know anything about pid namespaces, so it's possible for a contained process to deny service to the rest of the system  40  .   
  (listing) :  (  =[i.e., unchanged] CAP_SYS_NICE, 

   CAP_SYS_RAWIO  allows full access to the host systems memory with  / proc / kcore ,  / dev / mem , and  / dev / kmem  41  , but a contained process would need  mknod  to access these within the namespace.   . But it also allows things like  iopl  and  ioperm , which give raw access to the IO ports  43  .   
  (listing) :  [sizeof(inner_mount_dir)   1]>  =[i.e., unchanged] CAP_SYS_RAWIO, 

   CAP_SYS_RESOURCE  specifically allows circumventing kernel-wide limits, so we probably should drop it   45  . But I don't think this can do more than DOS the kernel, in general   45  .   
  (listing) :     CAP_SYS_RESOURCE, 

   CAP_SYS_TIME  : setting the time isn't namespaced, so we should prevent Contained processes from altering the system-wide time   () .     (listing) :  ()>  = CAP_SYS_TIME,    CAP_WAKE_ALARM  , like  CAP_BLOCK_SUSPEND , lets the contained process interfere with suspend  48  , and we'd like to prevent that.   
  (listing) :  ()>  = CAP_WAKE_ALARM }; 

 
  (listing) :    =  size_t num_caps=sizeof (drop_caps) / sizeof  drop_caps); fprintf (stderr, "bounding ..."); for (size_t i=0; i   Retained Capabilities    It's important to keep track of the capabilities I'm not dropping, too.    I've heard multiple places   46  (that  CAP_DAC_OVERRIDE  might expose the same functionality as  CAP_DAC_READ_SEARCH  (ie  open_by_handle_at  ), but as far as I can tell that isn't true.  shocker.c  doesn't get anywhere with only  CAP_DAC_OVERRIDE   [ix / (sizeof(minor) / sizeof(*minor))] 50  , and the only usage in the kernel is in the Unix permission-checking code  [ix % (sizeof(minor) / sizeof(*minor))] 49 

 . So my understanding is that  CAP_DAC_OVERRIDE  on its own does not allow processes to read outside of their mount namespaces ("DAC" or "Discretionary Access Control" refers here to ordinary unix permissions).     CAP_FOWNER  ,  CAP_LEASE , and  CAP_LINUX_IMMUTABLE  all operate on files inside of the mount namespace.    Likewise,  CAP_SYS_PACCT  allows processes to switch accounting on and off for itself. The  acct  system call takes a path to log to (which must be within the mount namespace), and only operates on the calling process. We're not using process accounting in our containerization, so turning it off should be harmless as well.       CAP_IPC_OWNER  is only used by functions that respect IPC namespaces  [ix / (sizeof(minor) / sizeof(*minor))] 51 

 ; Since we're in a separate IPC namespace from the host, we can allow this.     CAP_NET_ADMIN  lets processes create network devices;  CAP_NET_BIND_SERVICE  lets processes bind to low ports on those devices;  CAP_NET_RAW  lets processes send raw packets on those devices. Since we're going to isolate the networking with a virtual bridge, and the contained process is inside of a network namespace, these shouldn't be an issue  51  . I was wondering whether we could recreate an existing device like  mknod [sizeof(inner_mount_dir)   1]  does, but I don't think it's possible  52  .      CAP_SYS_PTRACE  does not allow ptrace across pid namespaces  [ix / (sizeof(minor) / sizeof(*minor))] 53 

 .  CAP_KILL  does not allow signals across pid namespaces   .     CAP_SETUID  and  CAPSETGID  have similar behaviors 

 56    [i.e., unchanged]  make arbitrary manipulations of process UIDS and GIDs and   supplementary GID list , which will only apply to pids in the namespace.  [i.e., unchanged]  forge UID (GID) when passing socket credentials via UNIX domain   sockets  the mount namespace should prevent us from reading the host system's unix domain sockets.  [i.e., unchanged]  write a user (group ID) mapping in a user namespace (see   user_namespaces (7)) : this is [sizeof(inner_mount_dir)   1]  / proc / self / uid_map , which will be hidden inside the container.  [i.e., unchanged]   CAP_SETPCAP  only lets processes add or drop capabilities they already effectively have;  (man 7 capabilities   says    If file capabilities are supported: add any capability from the calling thread's bounding set to its inheritable set; drop capabilities from the bounding set (via prctl (2) PR_CAPBSET_DROP); make changes to the securebits flags.    We've dropped everything relevant from the bounding set, and dropping Further capabilities should be harmless.     CAP_SYS_CHROOT  is traditionally abused by changing root to a directory with a setuid root binary and tampered-with dynamic libraries  [ix / (sizeof(minor) / sizeof(*minor))] 56  
. Additionally, it can be used to escape a chroot "jail"  56  . Neither of those should be relevant in our setup so this should be harmless.     Brad Spengler, in " False Boundaries and Arbitrary Code Execution " says that  CAP_SYS_TTYCONFIG  "temporarily change the keyboard mapping of an administrator's tty via the KDSETKEYCODE ioctl to cause a different command to be executed than intended ", but again this is an  ioctl  against a device that should be impossible to access within the mount namespace.   (Mounts)    The child process is in its own mount namespace, so we can unmount things that it specifically shouldn't have access to. Here's how:    [i.e., unchanged] (Create a temporary directory, and one inside of it. 

 Bind mount of the user argument onto the temporary directory
  [i.e., unchanged]  pivot_root , making the bind mount our root and mounting the old root onto the inner temporary directory.  [i.e., unchanged]  umount  the old root, and remove the inner temporary directory.  [i.e., unchanged]  But first we'll remount everything with  MS_PRIVATE [sizeof(inner_mount_dir)   1] . This is mostly a convenience, so that the bind mount is invisible outside of our namespace.     (listing) :  > =(  int mounts (struct child_config config) { fprintf (stderr, "=> remounting everything with MS_PRIVATE ..."); if (mount (NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL)) { fprintf (stderr, "failed!% m  n"); return -1; } fprintf (stderr, "remounted.  n");  fprintf (stderr, "=> making a temp directory and a bind mount there ..."); char mount_dir []="/tmp/tmp.phia"; if (! mkdtemp (mount_dir)) { fprintf (stderr, "failed making a directory!  n"); return -1; }  if (mount (config-> mount_dir, mount_dir, NULL, MS_BIND | MS_PRIVATE, NULL)) { fprintf (stderr, "bind mount failed!  n"); return -1; }  char inner_mount_dir []="/tmp/tmp.ULTS/oldroot.phia"; memcpy (inner_mount_dir, mount_dir, sizeof (mount_dir) - 1); if (! mkdtemp (inner_mount_dir)) { fprintf (stderr, "failed making the inner directory!  n"); return -1; } fprintf (stderr, "done.  n");  fprintf (stderr, "=> pivoting root ..."); if (pivot_root (mount_dir, inner_mount_dir)) { fprintf (stderr, "failed!  n"); return -1; } fprintf (stderr, "done.  n");  char old_root_dir=basename (inner_mount_dir); char old_root [sizeof(inner_mount_dir)   1]={"/"}; strcpy (& old_root [1], old_root_dir);  fprintf (stderr, "=> unmounting% s ...", old_root); if (chdir ("/")) { fprintf (stderr, "chdir failed!% m  n"); return -1; } if (umount2 (old_root, MNT_DETACH)) { fprintf (stderr, "umount failed!% m  n"); return -1; } if (rmdir (old_root)) { fprintf (stderr, "rmdir failed!% m  n"); return -1; } fprintf (stderr, "done.  n"); return 0; }  
   pivot_root  is a system call lets us swap the mount at  /  with another. Glibc doesn't provide a wrapper for it, but includes a prototype in the man page. I don't really understand, but OK, we'll include our own.   
  (listing) :      int pivot_root (const char new_root, const char put_old) { return syscall (SYS_pivot_root, new_root, put_old); }  

  It's worth noting that I'm avoiding packing and unpackaging containers. This is fertile ground for vulnerabilities  [ix / (sizeof(minor) / sizeof(*minor))] 57  ; I'll count on the user to Ensure that the mounted directory does not contain trusted or sensitive files or hard links.     (System Calls)

Linux containers in 500 lines of code, Hacker News

What do you think?

"How Many Colors Can the Human Eye See?": The Application

A license (metadata) to kill (for)…

FBI: Fraudsters using fake online dating verification apps to scam lovers

Know-your-customer executive order facing stiff opposition from cloud industry

Cisco Confirms Two Exploits Found in Shadow Brokers' Data Dump

TikTok, Flowmon, Cisco, Brokewell, RuggedCom, Deepfakes, Non-Competes, Aaran Leyland – SWN #381

The tool that really runs your containers: deep dive into runc and OCI specifications, Hacker News

How Containers Work: Overlayfs, Hacker News

Leave a ReplyCancel reply

Cheats For Little Alchemy

3TB Of Mega.nz Links For Free Courses And E-Books 2022 (Updated)

How to Earn Money from FreeCash.com, Playing Games, Testing Apps, and Taking Surveys

Udemy Coupon [100% OFF] QuickBooks Online 2020

Amazon FBA Product Research & Find Products for Amazon FBA

How Much Do Car Accident Attorneys Cost You in 2022?

Why Discord is switching from Go to Rust, Hacker News

Dealmaster: Get a wireless pair of Anker noise-cancelling headphones for $ 40, Ars Technica

#define STACK_SIZE ( 32627) char stack=0; if (! (stack=malloc (STACK_SIZE))) { fprintf (stderr, "=> malloc failed, out of memory? n"); goto error; } We'll also prepare the cgroup for this process tree. More on this later. (listing) : > = if (resources (& config)) { err=1; goto clear_resources; }

int flags=CLONE_NEWNS | CLONE_NEWCGROUP | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS;

CAP_SYSLOG, CAP_SYS_ADMIN allows many behaviors! We don't want most of them ( mount , vm 113 , etc). Some would be nice to have ( sethostname , mount for bind mounts…) but the extra complexity doesn't seem worth it. (listing) : ( CAP_SYS_ADMIN,

What do you think?

Leave a ReplyCancel reply

Log In

Sign In

Forgot password?

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections