(cache) Speed up your Haskell programs with one weird trick

Speed up your Haskell programs with one weird trick

GHC developers hate me.

Here’s a quick quiz:

Do you have an application written in Haskell?
Is it single threaded?
But you still use the -threaded runtime (e.g. for FFI features?)
and you use -N, to automatically spread out any forkIO threads across cores? Just in case?

Then you’re in luck, because there is one small trick you can do to improve the performance of your application in most cases: run your application with the extra arguments +RTS -qg. This flag turns off GHC’s parallel garbage collector, which is a surprising efficiency loss for a lot of applications.

This is a common issue I find, and one I’ve fixed quite a few times for my clients. A lot of people don’t use Haskell for cool concurrent webservers – many people use it for their basic infrastructure tooling as well as their applications (think “The tool that flips from a staging environment to a production environment”). But because “one off” tools are often only run at non critical times, or in some kind of batch mode – huge, lingering inefficiencies can hang around.

Before taking a second to explain why this helps, in case you’re wondering what kind of impact this can have: here are the runtime statistics of a compiler I’m working on for a client, which suffered from this issue.

Before, on my 4 core virtual machine (summarized +RTS -s output):

1,052,653,961,424 bytes allocated in the heap
 170,146,435,072 bytes copied during GC
     728,934,472 bytes maximum residency (1073 sample(s))
      78,111,168 bytes maximum slop
            1871 MB total memory in use (0 MB lost due to fragmentation)

  ...

  Parallel GC work balance: 13.79% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

  ...

  INIT    time    0.002s   (  0.001s elapsed)
  MUT     time  785.992s   (430.742s elapsed)
  GC      time  894.988s   (254.146s elapsed)
  EXIT    time    0.009s   (  0.008s elapsed)
  Total   time  1680.993s  (684.899s elapsed)

  Alloc rate    1,339,268,076 bytes per MUT second

  Productivity  46.8% of total user, 62.9% of total elapsed

You can just look at the numbers: less than 50% productivity (e.g. of that ~680s wall clock time, over 300 seconds were only spent doing GC).

The key thing to notice is the Parallel GC work balance line, with a poor 13% scaling efficiency, which hints at the problem.

And now, after this change:

 654,600,619,448 bytes allocated in the heap
 908,010,576,224 bytes copied during GC
     772,992,696 bytes maximum residency (84 sample(s))
      22,753,608 bytes maximum slop
            1990 MB total memory in use (0 MB lost due to fragmentation)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

  ...

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time  132.592s  (133.190s elapsed)
  GC      time   36.696s  ( 36.560s elapsed)
  EXIT    time    0.006s  (  0.006s elapsed)
  Total   time  169.297s  (169.757s elapsed)

  Alloc rate    4,936,953,192 bytes per MUT second

  Productivity  78.3% of total user, 78.5% of total elapsed

This is an incredible efficiency boost, for free. 30% productivity increase, half the total number of allocated bytes, 4x improved allocation rate, and a runtime reduction of 75 percent off the bat, for no work. To be unambiguous: It can now do the same amount of work as it could before, but using ~25% of the previous number of CPU cycles.

To put this in perspective: a run of this compiler took 12 minutes before. With this change, it takes 2 minutes now.

So… why does this happen? Here is a short, hand-wavy answer: when you enable GHC’s parallel garbage collector, every “capability” or “cap” (the runtime’s view of an Operating System thread) has a private nursery, which it can garbage collect independent of other capabilities. A nursery is a small, capability-private heap that owned by the garbage collector.

It’s very important to realize that the GC is parallel and not concurrent. This means that the GC does collect multiple nurseries, all at the same time. But it can’t be done concurrently, while your program is running. The entire program must be stopped for this parallel collection to occur. To do this, every thread must synchronize before GC (so the mutator is stopped), collect garbage in parallel, synchronize again (so we know every capability is done collecting, as one might take more time than the other), and then continue. I’m not even going into collection of the old generation, which is an added cost on top of all this.

When you run your application with -N, there are N given capabilities, one for each core GHC detects, each capability with its own nursery. With the parallel garbage collector, GHC will always collect these nurseries in parallel, regardless of whether it is actually practical to do so, or whether or not you actually have threads allocating on those capabilities.

That means every time your application does a GC, it must synchronize N threads and wait for them, before continuing. That’s expensive, and this is part of the massive allocation rate increase: by removing the synchronization (as it is unnecessary and uses up time), the allocation rate (bytes/second) increases.

The easiest way to see this is to watch your CPU load; if your Haskell application is sequential but you see it eating massive parallel resources (e.g. every CPU core you have, as if you had many threads), it is likely the parallel GC is kicking in. Even though your Haskell code is single threaded, the parallel GC only thinks at the level of a capability. It sees a capability is alive. And so it collects it in parallel.¹

In a future post I’d like to dig into this more and poke at the runtime system code, but that’ll have to wait.

In any case, if you’ve got some Haskell programs lying near you – give this a shot. It’s an ~8 character change that can have huge improvements for some workloads.

In theory this shouldn’t be hard to fix. The GHC scheduler only has a few static code paths that pick between parallel and sequential GC. So you could, in theory, turn this off or on at will. In practice, the real hard part, of course, is coming up with a solid cost model for all this where it improves things.↩

Return to the homepage. π

more contrast