How to stop Ubuntu Xenial from randomly killing your big processes

For the past month, the Out of Memory killer has been out of control.

[Update on February 20th: As predicted below, the fix has now been released to Xenial; upgrade to it with apt-get update && apt-get upgrade.]

If you’re using Ubuntu Xenial and uname -r prints out something between 4.4.0-58 and 4.4.0-62 (inclusive), your kernel may incorrectly kill big processes even when there’s plenty of memory available. Read on to learn more about it and how to fix it.

You know it’s a good day when you’re staring at kernel stack traces.

Here at Meteor Development Group, we do our best to keep our systems up to date with the latest security patches. Galaxy, our hosting service for Meteor apps, runs user apps on Ubuntu Linux. We use the most recent “Long Term Support” distribution: 16.04 LTS, also known as Xenial Xerus. And we keep up to date with security patches, especially for exploitable bugs in the Linux kernel.

With any software, updating frequently runs the risk of bringing in new bugs along with the desired bug fixes. The kernel team at Ubuntu does a great job of keeping the kernel versions they release stable by only backporting important bug fixes, but just as with any release engineering process, once in a blue moon a new bug shows up with a bug fix. One of these rare problems happened in early January.

An out of control OOM killer

Right now, the default Linux kernel on a Xenial system contains a nasty bug.

When a Linux system runs out of usable memory, a system called the “OOM killer” (for “Out Of Memory”) kills big processes until there’s enough memory to keep going. A healthy Linux system tends to have very little completely free memory: leaving RAM chips completely unused is wasteful, and the kernel will happily fill it with caches, especially of filesystem data. When you run low on memory, the kernel first frees memory in its caches; the OOM killer only kicks in if this isn’t good enough.

Unfortunately, a bug was recently introduced into the allocator which made it sometimes not try hard enough to free kernel cache memory before giving up and invoking the OOM killer. In practice, this means that at random times, the OOM killer would strike at big processes when the kernel tries to allocate, say, 16 kilobytes of memory for a new process’s thread stack — even when there are many gigabytes of memory in reclaimable kernel caches!

In particular, the bug occurs when the kernel attempts a so-called “higher order” allocation, meaning an allocation of more than one contiguous page (a “page” of memory on x86 is 4kb). Most allocations are for a single page at a time, so the bug manifests rarely, but kernel stacks require a contiguous 16k allocation, so the bug occurs most frequently when fork()ing a new process.

Monitoring and debugging

This hit us hard at Galaxy. We started noticing large hosted apps being killed even though the machines seemed to have plenty of memory. Fortunately, our overall system functioned and restarted the apps automatically, but our users weren’t happy to see unexpected and unexplained process deaths. I wrote a custom check for the Datadog monitoring service so I could see how often the OOM killer was striking and so I could tell if my attempted fixes were working.

At first, I operated under the assumption that we were doing something wrong. Maybe we just needed to run fewer apps on each machine, even though the rest of our monitoring showed plenty of room? Maybe some app was doing something very strange to fill up lots of kernel caches with unfreeable data? With some tips from my friend Nelson, I figured out exactly what kernel allocation was failing, and eventually found the right search terms to find an Ubuntu bug report.

I immediately added a comment with some of the search terms I had tried days before (like copy_process, alloc_kmem_page_node, and “slab reclaimable”) that failed to find the bug. I recommend this practice for helping other people who are in the same boat as you. (Or you can write a blog post!)

Xenial users are still affected!

On Ubuntu Xenial, a kernel containing this bug was published to the main “updates” repository on January 10th in kernel version 4.4.0–59.80, in order to fix a different bug. Within a few days, a user reported the bug. That day, a member of the Ubuntu kernel team tracked down the buggy kernel commit. The bug had already been fixed in the upstream kernel in a newer branch, as another user noted the next day. It took a little bit of work to backport the fix to 4.4, but the fix was ready to go by the end of January.

The Ubuntu kernel team generally doesn’t rush out new kernels without testing, so as of today, this fix is still being validated. The fixed kernel won’t be the default kernel until at least February 20th. That means if you upgrade a Xenial machine today, it’ll get an affected kernel!

Checking and fixing your system

You can tell if you’re running a bad version by running uname -r, which will print something like 4.4.0-57-generic. In this version number, 4.4.0 is the “upstream” version number, chosen by the kernel project itself. -57 is a revision specific to Ubuntu’s packaging. generic is a particular “flavor” of how Ubuntu builds kernels; you might also see server or virtual or lowlatency.

Any 4.4.0 kernel up to -57 is OK; the bug was added in -58. When this blog post was originally written, -62 was the published version, and it still has the bug. After this blog post was originally written , the fix was published in -63. As far as I know, the only Ubuntu distribution that ever had a version with this bug is Xenial.

If you’re running a bad version and you want to avoid this bug, you should just upgrade to the newest version by running apt-get update && apt-get upgrade.

(Before February 20th, you had two options. You could run the “proposed” version by enabling access to proposed repository. This gets you the latest version with all known security fixes, but hasn’t had nearly as much testing as older kernel versions. Alternatively, you could roll back to the last good version before the bug was introduced. (This will re-introduce the older bug that the buggy commit was intended to fix, though!) This is pretty straightforward. Check uname -r to see what flavor of kernel you’re using, and run sudo apt-get install linux-image-4.4.0-57-generic, replacing generic with your kernel’s flavor. Then restart your machine.)

Do you enjoy investigating strange problems, whether as low-level as Linux kernel virtual memory bugs or as high-level as distributed application scheduler optimization? Join us at MDG as a Cloud Systems Engineer, or one of many other roles.

Do you hate the idea of having to worry about kernel bugs? So do the rest of our Galaxy app hosting customers. We take care of all of this so you don’t have to.