(cache) weigh: Measuring allocations in Haskell

27 May 2016 Chris Done

While working for various clients that needed fast binary serialization, we had discovered that the binary and cereal packages are both inefficient and so we created the store package.

In the high-frequency trading sector, we had to decode and encode a binary protocol into Haskell data structures for analysis. During this process it was made apparent to us that while we had been attentive to micro-benchmark with the venerable criterion package, we hadn't put a lot of work into ensuring that memory usage was well studied. Bringing down allocations (and thus work, and garbage collection) was key to achieving reasonable speed.

Let's measure space

In response, let's measure space more, in an automatic way.

The currently available way to do this is by compiling with profiling enabled and adding call centers and then running our program with RTS options. For example, we write a program with an SCC call center, like this:

main :: IO ()
main = do
  let !_ = {-# SCC myfunction_10 #-} myfunction 10
  return ()

Then compile with profiling enabled with -p and run with +RTS -P and we get an output like this:

COST CENTRE       MODULE no. entries  ... bytes

MAIN              MAIN   43  0        ... 760
 CAF:main1        Main   85  0        ... 0
  main            Main   86  1        ... 0
   myfunction_10  Main   87  1        ... 160

(Information omitted with ... to save space.)

That's great, exactly the kind of information we'd like to get. But we want it in a more concise, programmatic fashion. On a test suite level.

Announcing `weigh`

To serve this purpose, I've written the weigh package, which seeks to automate the measuring of memory usage of programs, in the same way that criterion does for timing of programs.

It doesn't promise perfect measurement and comes with a grain of salt, but it's reproducible. Unlike timing, allocation is generally reliable provided you use something like stack to pin the GHC version and packages, so you can also make a test suite out of it.

How it works

There is a simple DSL, like hspec, for writing out your tests. It looks like this:

import Weigh
main =
  mainWith (do func "integers count 0" count 0
               func "integers count 1" count 1
               func "integers count 2" count 2
               func "integers count 3" count 3
               func "integers count 10" count 10
               func "integers count 100" count 100)
  where count :: Integer -> ()
        count 0 = ()
        count a = count (a - 1)

This example weighs the function count, which counts down to zero. We want to measure the bytes allocated to perform the action. The output is:

Case                Bytes  GCs  Check
integers count 0        0    0  OK
integers count 1       32    0  OK
integers count 2       64    0  OK
integers count 3       96    0  OK
integers count 10     320    0  OK
integers count 100  3,200    0  OK

Weee! We can now go around weighing everything! I encourage you to do that. Even Haskell newbies can make use of this to get a vague idea of how costly their code (or libraries they're using) is.

Real-world use-case: `store`

I wrote a few tests, while developing weigh, for the store package: encoding of lists, vectors and storable vectors. Here's the criterion result for encoding a regular Vector type:

benchmarking encode/1kb normal (Vector Int32)/store
time                 3.454 μs   (3.418 μs .. 3.484 μs)
benchmarking encode/1kb normal (Vector Int32)/cereal
time                 19.56 μs   (19.34 μs .. 19.79 μs)

benchmarking encode/10kb normal (Vector Int32)/store
time                 33.09 μs   (32.73 μs .. 33.57 μs)
benchmarking encode/10kb normal (Vector Int32)/cereal
time                 202.7 μs   (201.2 μs .. 204.6 μs)

store is 6x faster than cereal at encoding Int32 vectors. Great! Our job is done, we've overcome previous limitations of binary encoding speed. Let's take a look at how heavy this process is. Weighing the program on 1 million and 10 million elements yields:

   1,000,000 Boxed Vector Int     Encode: Store      88,008,584     140  OK
   1,000,000 Boxed Vector Int     Encode: Cereal    600,238,200   1,118  OK

  10,000,000 Boxed Vector Int     Encode: Store     880,078,896   1,384  OK
  10,000,000 Boxed Vector Int     Encode: Cereal  6,002,099,680  11,168  OK

store is 6.8x more memory efficient than cereal. Excellent. But is our job really finished? Take a look at those allocations. To simply allocate a vector of that size, it's:

   1,000,000 Boxed Vector Int     Allocate            8,007,936       1  OK

  10,000,000 Boxed Vector Int     Allocate           80,078,248       1  OK

While store is more efficient than cereal, how are we allocating 11x the amount of space necessary? We looked into this in the codebase, it turned out more inlining was needed. After comprehensively applying the INLINE pragma to key methods and functions, the memory was brought down to:

   1,000,000 Boxed Vector Int     Allocate            8,007,936       1  OK
   1,000,000 Boxed Vector Int     Encode: Store      16,008,568       2  OK
   1,000,000 Boxed Vector Int     Encode: Cereal    600,238,200   1,118  OK

  10,000,000 Boxed Vector Int     Allocate           80,078,248       1  OK
  10,000,000 Boxed Vector Int     Encode: Store     160,078,880       2  OK
  10,000,000 Boxed Vector Int     Encode: Cereal  6,002,099,680  11,168  OK

Now, store takes an additional 8MB to encode an 8MB vector, 80MB for an 80MB buffer. That's perfect 1:1 memory usage! Let's check out the new speed without these allocations:

benchmarking encode/1kb normal (Vector Int32)/store
time                 848.4 ns   (831.6 ns .. 868.6 ns)
benchmarking encode/1kb normal (Vector Int32)/cereal
time                 20.80 μs   (20.33 μs .. 21.20 μs)

benchmarking encode/10kb normal (Vector Int32)/store
time                 7.708 μs   (7.606 μs .. 7.822 μs)
benchmarking encode/10kb normal (Vector Int32)/cereal
time                 207.4 μs   (204.9 μs .. 210.3 μs)

store is 4x faster than previously! store is also now 20x faster than cereal at encoding a vector of ints.

Containers vs unordered-containers

Another quick example, the Map structures from the two containers packages. Let's weigh how heavy fromList is on 1 million elements. For fun, the keys are randomly generated rather than ordered. We force the list completely ahead of time, because we just want to see the allocations by the library, not our input list.

fromlists :: Weigh ()
fromlists =
  do let !elems =
           force (zip (randoms (mkStdGen 0) :: [Int])
                      [1 :: Int .. 1000000])
     func "Data.Map.Strict.fromList     (1 million)" Data.Map.Strict.fromList elems
     func "Data.Map.Lazy.fromList       (1 million)" Data.Map.Lazy.fromList elems
     func "Data.IntMap.Strict.fromList  (1 million)" Data.IntMap.Strict.fromList elems
     func "Data.IntMap.Lazy.fromList    (1 million)" Data.IntMap.Lazy.fromList elems
     func "Data.HashMap.Strict.fromList (1 million)" Data.HashMap.Strict.fromList elems
     func "Data.HashMap.Lazy.fromList   (1 million)" Data.HashMap.Lazy.fromList elems

We clearly see that IntMap from containers is about 1.3x more memory efficient than the generic Ord-based Map. However, HashMap wipes the floor with both of them (for Int, at least), using 6.3x less memory than Map and 4.8x less memory than IntMap:

Data.Map.Strict.fromList     (1 million)  1,016,187,152  1,942  OK
Data.Map.Lazy.fromList       (1 million)  1,016,187,152  1,942  OK
Data.IntMap.Strict.fromList  (1 million)    776,852,648  1,489  OK
Data.IntMap.Lazy.fromList    (1 million)    776,852,648  1,489  OK
Data.HashMap.Strict.fromList (1 million)    161,155,384    314  OK
Data.HashMap.Lazy.fromList   (1 million)    161,155,384    314  OK

This is just a trivial few lines of code to generate this result, as you see above.

Caveat

But beware: it's not going to be obvious exactly where allocations are coming from in the computation (if you need to know that, use the profiler). It's better to consider a computation holistically: this is how much was allocated to produce this result.

Analysis at finer granularity is likely to be guess-work (beyond even what's available in profiling). For the brave, let's study some examples of that.

Interpreting the results: `Integer`

Notice that in the table we generated, there is a rather odd increase of allocations:

Case                Bytes  GCs  Check
integers count 0        0    0  OK
integers count 1       32    0  OK
integers count 2       64    0  OK
integers count 3       96    0  OK
integers count 10     320    0  OK
integers count 100  3,200    0  OK

What's the explanation for those bytes in each iteration?

Refreshing our memory: The space taken up by a "small" Integer is two machine words. On 64-bit that's 16 bytes. Integer is defined like this:

data Integer
  = S# Int#                            -- small integers
  | J# Int# ByteArray#                 -- large integers

For the rest, we'd expect only 16 bytes per iteration, but we're seeing more than that. Why? Let's look at the Core for count:

Main.main48 = __integer 0
Main.main41 = __integer 1
Rec {
Main.main_count [Occ=LoopBreaker] :: Integer -> ()
[GblId, Arity=1, Str=DmdType <S,U>]
Main.main_count =
  \ (ds_d4Am :: Integer) ->
    case eqInteger# ds_d4Am Main.main48 of wild_a4Fq { __DEFAULT ->
    case ghc-prim-0.4.0.0:GHC.Prim.tagToEnum# @ Bool wild_a4Fq
    of _ [Occ=Dead] {
      False -> Main.main_count (minusInteger ds_d4Am Main.main41);
      True -> ghc-prim-0.4.0.0:GHC.Tuple.()
    }
    }
end Rec }

The eqInteger# function is a pretend-primop, which apparently combines with tagToEnum# and is optimized away at the code generation phase. This should lead to an unboxed comparison of something like Int#, which should not allocate. This leaves only the addition operation, which should allocate one new 16-byte Integer.

So where are those additional 16 bytes from? The implementation of minusInteger for Integer types is actually implemented as x + -y:

-- TODO
-- | Subtract two 'Integer's from each other.
minusInteger :: Integer -> Integer -> Integer
minusInteger x y = inline plusInteger x (inline negateInteger y)

This means we're allocating one more Integer. That explains the additional 16 bytes!

There's a TODO there. I guess someone implemented negateInteger and plusInteger (which is non-trivial) and had enough.

If we implement a second function count' that takes this into account,

import Weigh
main :: IO ()
main =
  mainWith (do func "integers count 0" count 0
               func "integers count 1" count 1
               func "integers count 2" count 2
               func "integers count 3" count 3
               func "integers count' 0" count' 0
               func "integers count' 1" count' 1
               func "integers count' 2" count' 2
               func "integers count' 3" count' 3)
  where count :: Integer -> ()
        count 0 = ()
        count a = count (a - 1)
        count' :: Integer -> ()
        count' 0 = ()
        count' a = count' (a + (-1))

we get more reasonable allocations:

Case                Bytes  GCs  Check
integers count 0        0    0  OK
integers count 1       32    0  OK
integers count 2       64    0  OK
integers count 3       96    0  OK

integers count' 0       0    0  OK
integers count' 1      16    0  OK
integers count' 2      32    0  OK
integers count' 3      48    0  OK

It turns out that count' is 20% faster (from criterion benchmarks), but realistically, if speed matters, we'd be using Int, which is practically 1000x faster.

What did we learn? Even something as simple as Integer subtraction doesn't behave as you would naively expect.

Considering a different type: `Int`

Comparatively, let's look at Int:

import Weigh
main =
  mainWith (do func "int count 0" count 0
               func "int count 1" count 1
               func "int count 10" count 10
               func "int count 100" count 100)
  where count :: Int -> ()
        count 0 = ()
        count a = count (a - 1)

The output is:

Case                Bytes  GCs  Check
ints count 1            0    0  OK
ints count 10           0    0  OK
ints count 1000000      0    0  OK

It allocates zero bytes. Why? Let's take a look at the Core:

Rec {
Main.$wcount1 [InlPrag=[0], Occ=LoopBreaker]
  :: ghc-prim-0.4.0.0:GHC.Prim.Int# -> ()
[GblId, Arity=1, Caf=NoCafRefs, Str=DmdType <S,1*U>]
Main.$wcount1 =
  \ (ww_s57C :: ghc-prim-0.4.0.0:GHC.Prim.Int#) ->
    case ww_s57C of ds_X4Gu {
      __DEFAULT ->
        Main.$wcount1 (ghc-prim-0.4.0.0:GHC.Prim.-# ds_X4Gu 1);
      0 -> ghc-prim-0.4.0.0:GHC.Tuple.()
    }
end Rec }

It's clear that GHC is able to optimize this tight loop, and unbox the Int into an Int#, which can be put into a register rather than being allocated by the GHC runtime allocator to be freed later.

The lesson is not to take for granted that everything has a 1:1 memory mapping at runtime with your source, and to take each case in context.

Data structures

Finally, from our contrived examples we can take a look at user-defined data types and observe some of the optimizations that GHC does for memory.

Let's demonstrate that unpacking a data structure yields less memory. Here is a data type that contains an Int:

data HasInt = HasInt !Int
  deriving (Generic)
instance NFData HasInt

Here are two identical data types which use HasInt, but the first simply uses HasInt, and the latter unpacks it.

data HasPacked = HasPacked HasInt
  deriving (Generic)
instance NFData HasPacked

data HasUnpacked = HasUnpacked {-# UNPACK #-} !HasInt
  deriving (Generic)
instance NFData HasUnpacked

We can measure the difference by weighing them like this:

-- | Weigh: packing vs no packing.
packing :: Weigh ()
packing =
  do func "\\x -> HasInt x" (\x -> HasInt x) 5
     func "\\x -> HasPacked (HasInt x)" (\x -> HasPacked (HasInt x)) 5
     func "\\x -> HasUnpacked (HasInt x)" (\x -> HasUnpacked (HasInt x)) 5

The output is:

\x -> HasInt x                      16    0  OK
\x -> HasPacked (HasInt x)          32    0  OK
\x -> HasUnpacked (HasInt x)        16    0  OK

Voilà! Here we've demonstrated that:

HasInt x consists of the 8 byte header for the constructor, and 8 bytes for the Int.
HasPacked has 8 bytes for the constructor, 8 bytes for the first slot, then another 8 bytes for the HasInt constructor, finally 8 bytes for the Int itself.
HasUnpacked only allocates 8 bytes for the header, and 8 bytes for the Int.

GHC did what we wanted!

Summary

We've looked at:

What lead to this package.
Propose that we start measuring our functions more, especially libraries.
How to use this package.
Some of our use-cases at FP Complete.
Caveats.
Some contrived examples do not lead to obvious explanations.

Now I encourage you to try it out!

weigh: Measuring allocations in Haskell