(cache) IAP: Speeding up conduit

21 Aug 2014 Michael Snoyman

As most of us know, performance isn't a one-dimensional spectrum. There are in fact multiple different ways to judge performance of a program. A commonly recognized tradeoff is that between CPU and memory usage. Often times, a program can be sped up by caching more data, for example.

conduit is a streaming data library. In that sense, it has two very specific performance criterion it aims for:

Constant memory usage.
Efficient usage of scarce resources, such as closing file descriptors as early as possible.

While CPU performance is always a nice goal, it has never been my top priority in the library's design, especially given that in the main use case for conduit (streaming data in an I/O context), the I/O cost almost always far outweighs any CPU overhead from conduit.

However, for our upcoming Integrated Analysis Platform (IAP) release, this is no longer the case. conduit will be used in tight loops, where we do need to optimize for the lowest CPU overhead possible.

This blog post covers the first set of optimizations I've applied to conduit. There is still more work to be done, and throughout this blogpost I'll be describing some of the upcoming changes I am attempting.

I'll give a brief summary up front:

Applying the codensity transform results in much better complexity of monadic bind.
We're also less reliant on rewrite rules firing, which has always been unreliable (and now I know why).
This change does represent a breaking API change. However, it only affects users of the Data.Conduit.Internal module. If you've just been using the public API, your code will be unaffected, besides getting an automatic speedup.
These changes will soon be released as conduit 1.2.0, after a period for community feedback.

Note that this blog post follows the actual steps I went through (more or less) in identifying the performance issues I wanted to solve. If you want to skip ahead to the solution itself, you may want to skip to the discussion on difference lists, or even straight to continuation passing style, church-encoding, codensity.

By the way, after I originally wrote this blog post, I continued working on the optimizations I describe as possible future enhancements. Those are actually working out far better than I expected, and it looks like conduit 1.2.0 will be able to ship with them. I'll be writing a separate blog post detailing those changes. A bit of a teaser is: for vector-equivalent code, conduit now generates identical core as vector itself.

The benchmarks

Before embarking on any kind of serious optimizations, it's important to have some benchmarks. I defined three benchmarks for the work I was going to be doing:

A simple sum: adding up the numbers from 1 to 10000. This is to get a baseline of the overhead coming from conduit.
A monte carlo analysis: This was based on a previous IAP blog post. I noticed when working on that benchmark that, while the conduit solution was highly memory efficient, there was still room to speed up the benchmark.
Sliding vectors: Naren Sundar recently sent a sliding windows pull requests, which allow us to get a view of a fixed size of a stream of values. This feature is very useful for a number of financial analyses, especially regarding time series.
Naren's pull request was based on immutable data structures, and for those cases it is highly efficient. However, it's possible to be far more memory efficient by writing to a mutable vector instead, and then taking immutable slices of that vector. Mihaly Barasz sent a pull request for this feature, and much to our disappointment, for small window sizes, it performed worse than sliding windows. We want to understand why.

You can see the benchmark code, which stays mostly unchanged for the rest of this blog post (a few new cases are added to demonstrate extra points). The benchmarks always contain a low-level base case representing the optimal performance we can expect from hand-written Haskell (without resorting to any kind of FFI tricks or the like).

You can see the first run results which reflect conduit 1.1.7, plus inlining of a few functions. Some initial analysis:

Control.Monad.foldM is surpringly slow.
Data.Conduit.List.foldM has a rather steep performance hit versus Data.Conduit.List.fold.
There's a very high overhead in the monte carlo analysis.
For sliding vector, the conduit overhead is more pronounced at smaller window sizes.
But even with large window sizes, mutable vector conduits still have a large overhead. The sliding window/immutable approach, however, shows almost no overhead.

That hopefully sets the scene enough for us to begin to dive in.

Rewrite rules: lift

GHC offers a very powerful optimization technique: rewrite rules. This allows you to tell the compiler that a certain expression can be rewritten to a more efficient one. A common example of a rewrite rule would be to state that map f . map g is the same as map (f . g). This can be expressed as:

{-# RULES "map f . map g" forall f g. map f . map g = map (f . g) #-}

Note that GHC's list rewrite rules are actually more complicated than this, and revolve around a concept called build/foldr fusion.

Let's look at the implementation of the yield function in conduit (with some newtypes stripped away):

yield :: Monad m => o -> ConduitM i o m ()
yield o = HaveOutput (Done ()) (return ()) o
{-# INLINE [1] yield #-}
{-# RULES
    "yield o >> p" forall o (p :: ConduitM i o m r).
    yield o >> p = HaveOutput p (return ()) o
  #-}

The core datatype of conduit is recursive. The HaveOutput constructor contains a field for "what to do next." In the case of yield, there isn't anything to do next, so we fill that with Done (). However, creating that Done () value just to throw it away after a monadic bind is wasteful. So we have a rewrite rule to fuse those two steps together.

But no such rewrite rule exists for lift! My first step was to add such a rule, and check the results. Unfortunately, the rule didn't have any real impact, because it wasn't firing. Let's put that issue to the side; we'll come back to it later.

Cleanup, inlining

One of the nice features introduced in (I believe) GHC 7.8 is that the compiler will now warn you when a rewrite rule may not fire. When compiling conduit, I saw messages like:

Data/Conduit/List.hs:274:11: Warning:
    Rule "source/map fusion $=" may never fire
      because ‘$=’ might inline first
    Probable fix: add an INLINE[n] or NOINLINE[n] pragma on ‘$=’

Data/Conduit/List.hs:275:11: Warning:
    Rule "source/map fusion =$=" may never fire
      because ‘=$=’ might inline first
    Probable fix: add an INLINE[n] or NOINLINE[n] pragma on ‘=$=’

Data/Conduit/List.hs:542:11: Warning:
    Rule "source/filter fusion $=" may never fire
      because ‘$=’ might inline first
    Probable fix: add an INLINE[n] or NOINLINE[n] pragma on ‘$=’

Data/Conduit/List.hs:543:11: Warning:
    Rule "source/filter fusion =$=" may never fire
      because ‘=$=’ might inline first
    Probable fix: add an INLINE[n] or NOINLINE[n] pragma on ‘=$=’

Data/Conduit/List.hs:552:11: Warning:
    Rule "connect to sinkNull" may never fire
      because ‘$$’ might inline first
    Probable fix: add an INLINE[n] or NOINLINE[n] pragma on ‘$$’

This demonstrates an important interaction between inlining and rewrite rules. We need to make sure that expressions that need to be rewritten are not inlined first. If they are first inlined, then GHC won't be able to rewrite them to our more optimized version.

A common approach to this is to delay inlining of functions until a later simplification phase. The GHC simplification process runs in multiple steps, and we can state that rules and inlining should only happen before or after a certain phase. The phases count down from 2 to 0, so we commonly want to delay inlining of functions until phase 0, if they may be subject to rewriting.

Conversely, some functions need to be inlined before a rewrite rule can fire. In stream fusion, for example, the fusion framework depends on the following sequencing to get good performance:

map f . map g
-- inline map
unstream . mapS f . stream . unstream . mapS g . stream
-- rewrite stream . unstream
unstream . mapS f . mapS g . stream
-- rewrite mapS . mapS
unstream . mapS (f . g) . stream

In conduit, we need to make sure that all of this is happening in the correct order. There was one particular complexity that made it difficult to ensure this happened. conduit in fact has two core datatypes: Pipe and ConduitM, with the latter being a more friendly newtype wrapper around the first. Up until this point, the code for the two was jumbled into a single internal module, making it difficult to track which things were being written in which version of the API.

My next step was to split things into .Pipe and .Conduit internal modules, and then clean up GHC's warnings to get rules to fire more reliably. This gave a modest performance boost to the sliding vector benchmarks, but not much else. But it does pave the way for future improvements.

Getting serious about sum, by cheating

The results so far have been uninspiring. We've identified a core problem (too many of those Done data constructors being used), and noticed that the rewrite rules that should fix that don't seem to be doing their job. Now let's take our first stab at really improving performance: with aggressive rewrite rules.

Our sum benchmark is really simple: use enumFromTo to create a stream of values, and fold (or foldM) to consume that. The thing that slows us down is that, in between these two simple functions, we end up allocating a bunch of temporary data structures. Let's get rid of them with rewrite rules!

This certainly did the trick. The conduit implementation jumped from 185us to just 8.63us. For comparison, the low level approach (or vector's stream fusion) clocks in at 5.77us, whereas foldl' on a list is 80.6us. This is a huge win!

But it's also misleading. All we've done here is sneakily rewritten our conduit algorithm into a low-level format. This solves the specific problem on the table (connecting enumFromTo with fold), but won't fully generalize to other cases. A more representative demonstration of this improvement is the speedup for foldM, which went from 1180us to 81us. The reason this is more realistic is that the rewrite rule is not specialized to enumFromTo, but rather works on any Source.

I took a big detour at this point, and ended up writing an initial implementation of stream fusion in conduit. Unfortunately, I ran into a dead end on that branch, and had to put that work to the side temporarily. However, the improvements discussed in the rest of this blog post will hopefully reopen the door to stream fusion, which I hope to investigate next.

Monte carlo, and associativity

Now that I'd made the results of the sum benchmark thoroughly useless, I decided to focus on the results of monte carlo, where the low level implementation still won by a considerable margin (3.42ms vs 10.6ms). The question was: why was this happening? To understand, let's start by looking at the code:

analysis = do
    successes <- sourceRandomN count
              $$ CL.fold (\t (x, y) ->
                            if (x*x + y*(y :: Double) < 1)
                                then t + 1
                                else t)
                    (0 :: Int)
    return $ fromIntegral successes / fromIntegral count * 4

sourceRandomN :: (MWC.Variate a, MonadIO m) => Int -> Source m a
sourceRandomN cnt0 = do
    gen <- liftIO MWC.createSystemRandom
    let loop 0 = return ()
        loop cnt = do
            liftIO (MWC.uniform gen) >>= yield >> loop (cnt - 1)
    loop cnt0

The analysis function is not very interesting: it simply connects sourceRandomN with a fold. Given that we now have a well behaved and consistently-firing rewrite rule for connecting to folds, it's safe to say that was not the source of our slowdown. So our slowdown must be coming from:

liftIO (MWC.uniform gen) >>= yield >> loop (cnt - 1)

This should in theory generate really efficient code. yield >> loop (cnt - 1) should be rewritten to \x -> HaveOutput (loop (cnt - 1)) (return ()) x), and then liftIO should get rewritten to generate:

PipeM $ do
    x <- MWC.uniform gen
    return $ HaveOutput (loop $ cnt - 1) (return ()) x

I added another commit to include a few more versions of the monte carlo benchmark (results here). The two most interesting are:

Explicit usage of the Pipe constructors:

sourceRandomNConstr :: (MWC.Variate a, MonadIO m) => Int -> Source m a
sourceRandomNConstr cnt0 = ConduitM $ PipeM $ do
    gen <- liftIO MWC.createSystemRandom
    let loop 0 = return $ Done ()
        loop cnt = do
            x <- liftIO (MWC.uniform gen)
            return $ HaveOutput (PipeM $ loop (cnt - 1)) (return ()) x
    loop cnt0

This version ran in 4.84ms, vs the original conduit version which ran in 15.8ms. So this is definitely the problem!

Explicitly force right-associated binding order:

sourceRandomNBind :: (MWC.Variate a, MonadIO m) => Int -> Source m a
sourceRandomNBind cnt0 = lift (liftIO MWC.createSystemRandom) >>= \gen ->
    let loop 0 = return ()
        loop cnt = do
            lift (liftIO $ MWC.uniform gen) >>= (\o -> yield o >> loop (cnt - 1))
     in loop cnt0

Or to zoom in on the important bit:

lift (liftIO $ MWC.uniform gen) >>= (\o -> yield o >> loop (cnt - 1))

By the monad laws, this code is identical to the original. However, instead of standard left-associativity, we have right associativity or monadic bind. This code ran in 5.19ms, an approximate threefold speedup vs the left associative code!

This issue of associativity was something Roman Cheplyaka told me about back in April, so I wasn't surprised to see it here. Back then, I'd looked into using Codensity together with ConduitM, but didn't get immediate results, and therefore postponed further research until I had more time.

OK, so why exactly does left-associativity hurt us so much? There are two reasons actually:

Generally speaking, many monads perform better when they are right associated. This is especially true for free monads, of which conduit is just a special case. Janis Voigtl ̈ander's paper Asymptotic Improvement of Computations over Free Monads and Edward Kmett's blog post series free monads for less do a far better job of explaining the issue than I could.
In the case of conduit, left associativity prevented the lift and yield rewrite rules from firing, which introduced extra, unnecessary monadic bind operations. Forcing right associativity allows these rules to fire, avoiding a lot of unnecessary data constructor allocation and analysis.

At this point, it became obvious at this point that the main slowdown I was seeing was driven by this problem. The question is: how should we solve it?

Difference lists

To pave the way for the next step, I want to take a quick detour and talk about something simpler: difference lists. Consider the following code:

(((w ++ x) ++ y) ++ z)

Most experienced Haskellers will cringe upon reading that. The append operation for a list needs to traverse every cons cell in its left value. When we left-associate append operations like this, we will need to traverse every cell in w, then every cell in w ++ x, then every cell in w ++ x ++ y. This is highly inefficient, and would clearly be better done in a right-associated style (sound familiar?).

But forcing programmers to ensure that their code is always right-associated isn't always practical. So instead, we have two common alternatives. The first is: use a better datastructure. In particular, Data.Sequence has far cheaper append operations than lists.

The other approach is to use difference lists. Difference lists are functions instead of actual list values. They are instructions for adding values to the beginning of the list. In order to append, you use normal function composition. And to convert them to a list, you apply the resulting function to an empty list. As an example:

type DList a = [a] -> [a]

dlist1 :: DList Int
dlist1 rest = 1 : 2 : rest

dlist2 :: DList Int
dlist2 rest = 3 : 4 : rest

final :: [Int]
final = dlist1 . dlist2 $ []

main :: IO ()
main = print final

Both difference lists and sequences have advantages. Probably the simplest summary is:

Difference lists have smaller constant factors for appending.
Sequences allow you to analyze them directly, without having to convert them to a different data type first.

That second point is important. If you need to regularly analyze your list and then continue to append, the performance of a difference list will be abysmal. You will constantly be swapping representations, and converting from a list to a difference list is an O(n) operation. But if you will simply be constructing a list once without any analysis, odds are difference lists will be faster.

This situation is almost identical to our problems with conduit. Our monadic composition operator- like list's append operator- needs to traverse the entire left hand side. This connection is more clearly spelled out in Reflection without Remorse by Atze van der Ploeg and Oleg Kiselyov (and for me, care of Roman).

Alright, with that out of the way, let's finally fix conduit!

Continuation passing style, church-encoding, codensity

There are essentially two things we need to do with conduits:

Monadically compose them to sequence two streams into a larger stream.
Categorically compose them to connect one stream to the next in a pipeline.

The latter requires that we be able to case analyze our datatypes, while theoretically the former does not: something like difference lists for simple appending would be ideal. In the past, I've tried out a number of different alternative implementations of conduit, none of which worked well enough. The problem I always ran into was that either monadic bind became too expensive, or categorical composition became too expensive.

Roman, Mihaly, Edward and I discussed these issues a bit on Github, and based on Roman's advice, I went ahead with writing a benchmark of different conduit implementations. I currently have four implementations in this benchmark (and hope to add more):

Standard, which looks very much like conduit 1.1, just a bit simplified (no rewrite rules, no finalizers, no leftovers).
Free, which is conduit rewritten to explicitly use the free monad transformer.
Church, which modifies Free to instead use the Church-encoded free monad transformer.
Codensity, which is a Codensity-transform-inspired version of conduit.

You can see the benchmark results, which clearly show the codensity version to be the winner. Though it would be interesting, I think I'll avoid going into depth on the other three implementations for now (this blog post is long enough already).

What is Codensity?

Implementing Codensity in conduit just means changing the ConduitM newtype wrapper to look like this:

newtype ConduitM i o m r = ConduitM
    { unConduitM :: forall b.
                    (r -> Pipe i i o () m b) -> Pipe i i o () m b
    }

What this says is "I'm going to provide an r value. If you give me a function that needs an r value, I'll give it that r value and then continue with the resulting Pipe." Notice how similar this looks to the type signature of monadic bind itself:

(>>=) :: Pipe i i o () m r
      -> (r -> Pipe i i o () m b)
      -> Pipe i i o () m b

This isn't by chance, it's by construction. More information is available in the Haddocks of kan-extension, or in the above-linked paper and blog posts by Janis and Edward. To see why this change is important, let's look at the new implementations of some of the core conduit functions and type classes:

yield o = ConduitM $ \rest -> HaveOutput (rest ()) (return ()) o

await = ConduitM $ \f -> NeedInput (f . Just) (const $ f Nothing)

instance Monad (ConduitM i o m) where
    return x = ConduitM ($ x)
    ConduitM f >>= g = ConduitM $ \h -> f $ \a -> unConduitM (g a) h

instance MonadTrans (ConduitM i o) where
    lift mr = ConduitM $ \rest -> PipeM (liftM rest mr)

Instead of having explicit Done constructors in yield, await, and lift, we use the continuation rest. This is the exact same transformation we were previously relying on rewrite rules to provide. However, our rewrite rules couldn't fire properly in a left-associated monadic binding. Now we've avoided the whole problem!

Our Monad instance also became much smaller. Notice that in order to monadically compose, there is no longer any need to case-analyze the left hand side, which avoids the high penalty of left association.

Another interesting quirk is that our Monad instance on ConduitM no longer requires that the base m type constructor itself be a Monad. This is nice feature of Codensity.

So that's half the story. What about categorical composition? That certainly does require analyzing both the left and right hand structures. So don't we lose all of our speed gains of Codensity with this? Actually, I think not. Let's look at the code for categorical composition:

ConduitM left0 =$= ConduitM right0 = ConduitM $ \rest ->
    let goRight final left right =
            case right of
                HaveOutput p c o  -> HaveOutput (recurse p) (c >> final) o
                NeedInput rp rc   -> goLeft rp rc final left
                Done r2           -> PipeM (final >> return (rest r2))
                PipeM mp          -> PipeM (liftM recurse mp)
                Leftover right' i -> goRight final (HaveOutput left final i) right'
          where
            recurse = goRight final left

        goLeft rp rc final left =
            case left of
                HaveOutput left' final' o -> goRight final' left' (rp o)
                NeedInput left' lc        -> NeedInput (recurse . left') (recurse . lc)
                Done r1                   -> goRight (return ()) (Done r1) (rc r1)
                PipeM mp                  -> PipeM (liftM recurse mp)
                Leftover left' i          -> Leftover (recurse left') i
          where
            recurse = goLeft rp rc final
     in goRight (return ()) (left0 Done) (right0 Done)

In the last line, we apply left0 and right0 to Done, which is how we convert our Codensity version into something we can actually analyze. (This is equivalent to applying a difference list to an empty list.) We then traverse these values in the same way that we did in conduit 1.1 and earlier.

The important difference is how we ultimately finish. The code in question is the Done clause of the goRight's case analysis, namely:

Done r2           -> PipeM (final >> return (rest r2))

Notice the usage of rest, instead of what we would have previously done: used the Done constructor. By doing this, we're immediately recreating a Codensity version of our resulting Pipe, which allows us to only traverse our incoming Pipe values once each, and not need to retraverse the outgoing Pipe for future monadic binding.

This trick doesn't just work for composition. There are a large number of functions in conduit that need to analyze a Pipe, such as addCleanup and catchC. All of them are now implemented in this same style.

After implementing this change, the resulting benchmarks look much better. The naive implementation of monte carlo is now quite close to the low-level version (5.28ms vs 3.44ms, as opposed to the original 15ms). Sliding vector is also much better: the unboxed, 1000-size window benchmark went from 7.96ms to 4.05ms, vs a low-level implementation at 1.87ms.

Type-indexed sequences

One approach that I haven't tried yet is the type-indexed sequence approach from Reflection without Remorse. I still intend to add it to my conduit benchmark, but I'm not optimistic about it beating out Codensity. My guess is that a sequence data type will have a higher constant factor overhead, and based on the way composition is implemented in conduit, we won't get any benefit from avoiding the need to transition between two representations.

Edward said he's hoping to get an implementation of such a data structure into the free package, at which point I'll update my benchmark to see how it performs.

To pursue next: streamProducer, streamConsumer, and more

While this round of benchmarking produced some very nice results, we're clearly not yet at the same level as low-level code. My goal is to focus on that next. I have some experiments going already relating to getting conduit to expose stream fusion rules. In simple cases, I've generated a conduit-compatible API with the same performance as vector.

The sticking point is getting something which is efficient not just for functions explicitly written in stream style, but also provides decent performance when composed with the await/yield approach. While the latter approach will almost certainly be slower than stream fusion, I'm hoping we can get it to degrade to current-conduit performance levels, and allow stream fusion to provide a significant speedup when categorically composing two Conduits written in that style.

The code discussed in this post is now available on the next-cps branch of conduit. conduit-extra, conduit-combinators, and a number of other packages either compile out-of-the-box with these changes, or require minor tweaks (already implemented), so I'm hoping that this API change does not affect too many people.

As I mentioned initially, I'd like to have some time for community discussion on this before I make this next release.