[personal profile] mjg59
First, read these slides. Done? Good.

Hypervisors present a smaller attack surface than containers. This is somewhat mitigated in containers by using seccomp, selinux and restricting capabilities in order to reduce the number of kernel entry points that untrusted code can touch, but even so there is simply a greater quantity of privileged code available to untrusted apps in a container environment when compared to a hypervisor environment[1].

Does this mean containers provide reduced security? That's an arguable point. In the event of a new kernel vulnerability, container-based deployments merely need to upgrade the kernel on the host and restart all the containers. Full VMs need to upgrade the kernel in each individual image, which takes longer and may be delayed due to the additional disruption. In the event of a flaw in some remotely accessible code running in your image, an attacker's ability to cause further damage may be restricted by the existing seccomp and capabilities configuration in a container. They may be able to escalate to a more privileged user in a full VM.

I'm not really compelled by either of these arguments. Both argue that the security of your container is improved, but in almost all cases exploiting these vulnerabilities would require that an attacker already be able to run arbitrary code in your container. Many container deployments are task-specific rather than running a full system, and in that case your attacker is already able to compromise pretty much everything within the container. The argument's stronger in the Virtual Private Server case, but there you're trading that off against losing some other security features - sure, you're deploying seccomp, but you can't use selinux inside your container, because the policy isn't per-namespace[2].

So that seems like kind of a wash - there's maybe marginal increases in practical security for certain kinds of deployment, and perhaps marginal decreases for others. We end up coming back to the attack surface, and it seems inevitable that that's always going to be larger in container environments. The question is, does it matter? If the larger attack surface still only results in one more vulnerability per thousand years, you probably don't care. The aim isn't to get containers to the same level of security as hypervisors, it's to get them close enough that the difference doesn't matter.

I don't think we're there yet. Searching the kernel for bugs triggered by Trinity shows plenty of cases where the kernel screws up from unprivileged input[3]. A sufficiently strong seccomp policy plus tight restrictions on the ability of a container to touch /proc, /sys and /dev helps a lot here, but it's not full coverage. The presentation I linked to at the top of this post suggests using the grsec patches - these will tend to mitigate several (but not all) kernel vulnerabilities, but there's tradeoffs in (a) ease of management (having to build your own kernels) and (b) performance (several of the grsec options reduce performance).

But this isn't intended as a complaint. Or, rather, it is, just not about security. I suspect containers can be made sufficiently secure that the attack surface size doesn't matter. But who's going to do that work? As mentioned, modern container deployment tools make use of a number of kernel security features. But there's been something of a dearth of contributions from the companies who sell container-based services. Meaningful work here would include things like:

  • Strong auditing and aggressive fuzzing of containers under realistic configurations
  • Support for meaningful nesting of Linux Security Modules in namespaces
  • Introspection of container state and (more difficult) the host OS itself in order to identify compromises

These aren't easy jobs, but they're important, and I'm hoping that the lack of obvious development in areas like this is merely a symptom of the youth of the technology rather than a lack of meaningful desire to make things better. But until things improve, it's going to be far too easy to write containers off as a "convenient, cheap, secure: choose two" tradeoff. That's not a winning strategy.

[1] Companies using hypervisors! Audit your qemu setup to ensure that you're not providing more emulated hardware than necessary to your guests. If you're using KVM, ensure that you're using sVirt (either selinux or apparmor backed) in order to restrict qemu's privileges.
[2] There's apparently some support for loading per-namespace Apparmor policies, but that means that the process is no longer confined by the sVirt policy
[3] To be fair, last time I ran Trinity under Docker under a VM, it ended up killing my host. Glass houses, etc.

Date: 2014-10-23 08:04 am (UTC)
From: [identity profile] m50d.wordpress.com
The same reasoning about attack surfaces applies to other OS' container systems, like BSD jails or Solaris zones, right? So have any of them undergone this kind of rigorous security analysis? I'd expect Sun/Oracle or maybe even Joyent to have put some effort in, but maybe I'm being overly optimistic. So are there any audited container-like systems out there, or is every option in the same boat?

Date: 2014-10-23 02:47 pm (UTC)
From: (Anonymous)
You're not being overly optimistic: Sun did extensive analysis when the zones work was being done -- to the point of (somewhat famously) having a company-wide contest (with significant cash prizes) for finding an exploit in zones. Faults aside, Sun had many creative engineers, and many tried to find exploits. In the end, a single exploit was found that was somewhat dubious (it allowed for denial-of-service, but not necessarily privilege escalation), but it was fixed nonetheless -- and this was over a decade ago. In the years since, there has never been a privilege escalation discovered with Solaris zones (or with its descendant technologies in the open source illumos). At Joyent (where I am the CTO) we have run SmartOS containers in multi-tenant production for 8 years; we have built our business on it, and we take its security very seriously!

Use-cases for lightweight containers

Date: 2014-10-23 12:51 pm (UTC)
From: [personal profile] pvanhoof
Right now it's still hard for application developers to start using nspawn and the likes.

Firstly systemd is not universal yet. With Debian having accepted it as init system I have really good hopes this will happen soon.

Secondly its lightweight containers are at the moment not yet completely fit for delegating a desktop service to. For example I want to run org.gnome.evolution.dataserver.Calendar4 in a container that is completely separate from the host, yet the hosts' calendar applet of gnome-shell needs to show the Calendar's contents.

That's because sd-bus isn't a public library yet and kdbus isn't universal either, yet. You'd need both, I learned yesterday, for applications to start using sd_bus_open_system_remote and bus_set_address_system_remote.

I have not figured out how to properly configure D-Bus service activation for service requests on the host to activate and nspawn the container providing the service. But according to Lennart that's also already possible through container socket activation. I just have not figured it out yet.

Re: Use-cases for lightweight containers

Date: 2014-10-23 04:33 pm (UTC)
From: (Anonymous)
Actually it's possible to do basic containerization without relying upon work being delegated to a privileged component such as systemd.

See https://gitorious.org/linted/linted/source/0178ba7e01bbfcae993394af8965a5365ec3816b:src/spawn/spawn.c

Re: Use-cases for lightweight containers

Date: 2014-10-23 11:14 pm (UTC)
From: (Anonymous)
Why not just bind mount the dbus socket into the container?

Re: Use-cases for lightweight containers

Date: 2014-10-23 11:22 pm (UTC)
From: (Anonymous)
Oh, and maybe for the activation bit you can make a .service that is aliased to dbus-yadayada then Wants=systemd-nspawn@mycont.service.

LSMs and containers

Date: 2014-10-23 01:50 pm (UTC)
From: (Anonymous)
I suspect that lack of per-namespace policy isn't the only problem with LSMs and containers. I'm very slowly working on improving this, but I'm not going to do the heavy lifting.

--Andy, who breaks these things for amusement

Namespaced LSM's

Date: 2014-10-23 11:27 pm (UTC)
From: (Anonymous)
> There's apparently some support for loading per-namespace Apparmor policies,
> but that means that the process is no longer confined by the sVirt policy

Would it not be possible to make the namespace handling be able to tell if a namespaced policy tries to expand beyond the original (in this case, sVirt) policy then just silently deny that expansion (and report it in the host)?

Profile

Matthew Garrett

About Matthew

Power management, mobile and firmware developer on Linux. Security developer at Nebula. Member of the Linux Foundation Technical Advisory Board. Ex-biologist. @mjg59 on Twitter. Content here should not be interpreted as the opinion of my employer.

Expand Cut Tags

No cut tags