27 May 2016
While working for various clients that needed fast binary
serialization, we had discovered that the
binary
and
cereal
packages are both
inefficient and so we created the
store
package.
In the high-frequency trading sector, we had to decode and encode a
binary protocol into Haskell data structures for analysis. During this
process it was made apparent to us that while we had been attentive to
micro-benchmark with the venerable
criterion
package, we
hadn't put a lot of work into ensuring that memory usage was well
studied. Bringing down allocations (and thus work, and garbage
collection) was key to achieving reasonable speed.
Let's measure space
In response, let's measure space more, in an automatic way.
The currently available way to do this is by compiling with profiling
enabled and adding call centers and then running our program with RTS
options. For example, we write a program with an SCC
call center,
like this:
main :: IO () main = do let !_ = {-# SCC myfunction_10 #-} myfunction 10 return ()
Then compile with profiling enabled with -p
and run with +RTS -P
and we get an output like this:
COST CENTRE MODULE no. entries ... bytes
MAIN MAIN 43 0 ... 760
CAF:main1 Main 85 0 ... 0
main Main 86 1 ... 0
myfunction_10 Main 87 1 ... 160
(Information omitted with ...
to save space.)
That's great, exactly the kind of information we'd like to get. But we want it in a more concise, programmatic fashion. On a test suite level.
Announcing weigh
To serve this purpose, I've written the
weigh
package, which seeks to
automate the measuring of memory usage of programs, in the same way
that criterion
does for timing of programs.
It doesn't promise perfect measurement and comes with a grain of salt,
but it's reproducible. Unlike timing, allocation is generally reliable
provided you use something like stack
to
pin the GHC version and packages, so you can also make a test suite
out of it.
How it works
There is a simple DSL, like hspec
, for writing out your tests. It
looks like this:
import Weigh main = mainWith (do func "integers count 0" count 0 func "integers count 1" count 1 func "integers count 2" count 2 func "integers count 3" count 3 func "integers count 10" count 10 func "integers count 100" count 100) where count :: Integer -> () count 0 = () count a = count (a - 1)
This example weighs the function count
, which counts down to
zero. We want to measure the bytes allocated to perform the
action. The output is:
Case Bytes GCs Check
integers count 0 0 0 OK
integers count 1 32 0 OK
integers count 2 64 0 OK
integers count 3 96 0 OK
integers count 10 320 0 OK
integers count 100 3,200 0 OK
Weee! We can now go around weighing everything! I encourage you to do that. Even Haskell newbies can make use of this to get a vague idea of how costly their code (or libraries they're using) is.
Real-world use-case: store
I wrote a few tests, while developing weigh
, for the store
package: encoding of lists, vectors and storable vectors. Here's the
criterion
result for encoding a regular Vector
type:
benchmarking encode/1kb normal (Vector Int32)/store
time 3.454 μs (3.418 μs .. 3.484 μs)
benchmarking encode/1kb normal (Vector Int32)/cereal
time 19.56 μs (19.34 μs .. 19.79 μs)
benchmarking encode/10kb normal (Vector Int32)/store
time 33.09 μs (32.73 μs .. 33.57 μs)
benchmarking encode/10kb normal (Vector Int32)/cereal
time 202.7 μs (201.2 μs .. 204.6 μs)
store
is 6x faster than cereal
at encoding Int32 vectors. Great! Our
job is done, we've overcome previous limitations of binary encoding
speed. Let's take a look at how heavy this process is. Weighing the
program on 1 million and 10 million elements yields:
1,000,000 Boxed Vector Int Encode: Store 88,008,584 140 OK
1,000,000 Boxed Vector Int Encode: Cereal 600,238,200 1,118 OK
10,000,000 Boxed Vector Int Encode: Store 880,078,896 1,384 OK
10,000,000 Boxed Vector Int Encode: Cereal 6,002,099,680 11,168 OK
store
is 6.8x more memory efficient than cereal
. Excellent. But is our
job really finished? Take a look at those allocations. To simply
allocate a vector of that size, it's:
1,000,000 Boxed Vector Int Allocate 8,007,936 1 OK
10,000,000 Boxed Vector Int Allocate 80,078,248 1 OK
While store
is more efficient than cereal
, how are we allocating 11x
the amount of space necessary? We looked into this in the codebase, it
turned out more inlining was needed. After comprehensively applying
the INLINE
pragma to key methods and functions, the memory was
brought down to:
1,000,000 Boxed Vector Int Allocate 8,007,936 1 OK
1,000,000 Boxed Vector Int Encode: Store 16,008,568 2 OK
1,000,000 Boxed Vector Int Encode: Cereal 600,238,200 1,118 OK
10,000,000 Boxed Vector Int Allocate 80,078,248 1 OK
10,000,000 Boxed Vector Int Encode: Store 160,078,880 2 OK
10,000,000 Boxed Vector Int Encode: Cereal 6,002,099,680 11,168 OK
Now, store
takes an additional 8MB to encode an 8MB vector, 80MB for
an 80MB buffer. That's perfect 1:1 memory usage! Let's check out the
new speed without these allocations:
benchmarking encode/1kb normal (Vector Int32)/store
time 848.4 ns (831.6 ns .. 868.6 ns)
benchmarking encode/1kb normal (Vector Int32)/cereal
time 20.80 μs (20.33 μs .. 21.20 μs)
benchmarking encode/10kb normal (Vector Int32)/store
time 7.708 μs (7.606 μs .. 7.822 μs)
benchmarking encode/10kb normal (Vector Int32)/cereal
time 207.4 μs (204.9 μs .. 210.3 μs)
store
is 4x faster than previously! store
is also now 20x faster than
cereal
at encoding a vector of ints.
Containers vs unordered-containers
Another quick example, the Map structures from the two containers
packages. Let's weigh how heavy fromList
is on 1 million
elements. For fun, the keys are randomly generated rather than
ordered. We force the list completely ahead of time, because we just
want to see the allocations by the library, not our input list.
fromlists :: Weigh () fromlists = do let !elems = force (zip (randoms (mkStdGen 0) :: [Int]) [1 :: Int .. 1000000]) func "Data.Map.Strict.fromList (1 million)" Data.Map.Strict.fromList elems func "Data.Map.Lazy.fromList (1 million)" Data.Map.Lazy.fromList elems func "Data.IntMap.Strict.fromList (1 million)" Data.IntMap.Strict.fromList elems func "Data.IntMap.Lazy.fromList (1 million)" Data.IntMap.Lazy.fromList elems func "Data.HashMap.Strict.fromList (1 million)" Data.HashMap.Strict.fromList elems func "Data.HashMap.Lazy.fromList (1 million)" Data.HashMap.Lazy.fromList elems
We clearly see that IntMap
from containers
is about 1.3x more
memory efficient than the generic Ord
-based Map
. However,
HashMap
wipes the floor with both of them (for Int
, at least),
using 6.3x less memory than Map
and 4.8x less memory than IntMap
:
Data.Map.Strict.fromList (1 million) 1,016,187,152 1,942 OK
Data.Map.Lazy.fromList (1 million) 1,016,187,152 1,942 OK
Data.IntMap.Strict.fromList (1 million) 776,852,648 1,489 OK
Data.IntMap.Lazy.fromList (1 million) 776,852,648 1,489 OK
Data.HashMap.Strict.fromList (1 million) 161,155,384 314 OK
Data.HashMap.Lazy.fromList (1 million) 161,155,384 314 OK
This is just a trivial few lines of code to generate this result, as you see above.
Caveat
But beware: it's not going to be obvious exactly where allocations are coming from in the computation (if you need to know that, use the profiler). It's better to consider a computation holistically: this is how much was allocated to produce this result.
Analysis at finer granularity is likely to be guess-work (beyond even what's available in profiling). For the brave, let's study some examples of that.
Interpreting the results: Integer
Notice that in the table we generated, there is a rather odd increase of allocations:
Case Bytes GCs Check
integers count 0 0 0 OK
integers count 1 32 0 OK
integers count 2 64 0 OK
integers count 3 96 0 OK
integers count 10 320 0 OK
integers count 100 3,200 0 OK
What's the explanation for those bytes in each iteration?
Refreshing our memory: The space taken up by a "small" Integer is
two machine words. On
64-bit that's 16 bytes. Integer
is defined like this:
data Integer = S# Int# -- small integers | J# Int# ByteArray# -- large integers
For the rest, we'd expect only 16 bytes per iteration, but we're
seeing more than that. Why? Let's look at
the Core
for count
:
Main.main48 = __integer 0 Main.main41 = __integer 1 Rec { Main.main_count [Occ=LoopBreaker] :: Integer -> () [GblId, Arity=1, Str=DmdType <S,U>] Main.main_count = \ (ds_d4Am :: Integer) -> case eqInteger# ds_d4Am Main.main48 of wild_a4Fq { __DEFAULT -> case ghc-prim-0.4.0.0:GHC.Prim.tagToEnum# @ Bool wild_a4Fq of _ [Occ=Dead] { False -> Main.main_count (minusInteger ds_d4Am Main.main41); True -> ghc-prim-0.4.0.0:GHC.Tuple.() } } end Rec }
The eqInteger#
function is
a pretend-primop,
which apparently combines with tagToEnum#
and is
optimized away at
the code generation phase.
This should lead to an unboxed comparison of something like Int#
,
which should not allocate. This leaves only the addition operation,
which should allocate one new 16-byte Integer
.
So where are those additional 16 bytes from?
The implementation of minusInteger
for Integer
types
is actually implemented as x + -y
:
-- TODO -- | Subtract two 'Integer's from each other. minusInteger :: Integer -> Integer -> Integer minusInteger x y = inline plusInteger x (inline negateInteger y)
This means we're allocating one more Integer. That explains the additional 16 bytes!
There's a TODO
there. I guess someone implemented negateInteger
and plusInteger
(which is
non-trivial)
and had enough.
If we implement a second function count'
that takes this into
account,
import Weigh main :: IO () main = mainWith (do func "integers count 0" count 0 func "integers count 1" count 1 func "integers count 2" count 2 func "integers count 3" count 3 func "integers count' 0" count' 0 func "integers count' 1" count' 1 func "integers count' 2" count' 2 func "integers count' 3" count' 3) where count :: Integer -> () count 0 = () count a = count (a - 1) count' :: Integer -> () count' 0 = () count' a = count' (a + (-1))
we get more reasonable allocations:
Case Bytes GCs Check
integers count 0 0 0 OK
integers count 1 32 0 OK
integers count 2 64 0 OK
integers count 3 96 0 OK
integers count' 0 0 0 OK
integers count' 1 16 0 OK
integers count' 2 32 0 OK
integers count' 3 48 0 OK
It turns out that count'
is 20% faster (from criterion
benchmarks), but realistically, if speed matters, we'd be using Int
,
which is practically 1000x faster.
What did we learn? Even something as simple as Integer
subtraction
doesn't behave as you would naively expect.
Considering a different type: Int
Comparatively, let's look at Int
:
import Weigh main = mainWith (do func "int count 0" count 0 func "int count 1" count 1 func "int count 10" count 10 func "int count 100" count 100) where count :: Int -> () count 0 = () count a = count (a - 1)
The output is:
Case Bytes GCs Check
ints count 1 0 0 OK
ints count 10 0 0 OK
ints count 1000000 0 0 OK
It allocates zero bytes. Why? Let's take a look at the Core:
Rec { Main.$wcount1 [InlPrag=[0], Occ=LoopBreaker] :: ghc-prim-0.4.0.0:GHC.Prim.Int# -> () [GblId, Arity=1, Caf=NoCafRefs, Str=DmdType <S,1*U>] Main.$wcount1 = \ (ww_s57C :: ghc-prim-0.4.0.0:GHC.Prim.Int#) -> case ww_s57C of ds_X4Gu { __DEFAULT -> Main.$wcount1 (ghc-prim-0.4.0.0:GHC.Prim.-# ds_X4Gu 1); 0 -> ghc-prim-0.4.0.0:GHC.Tuple.() } end Rec }
It's clear that GHC is able to optimize this tight loop, and unbox the
Int
into an Int#
, which can be put into a register rather than
being allocated by the GHC runtime allocator to be freed later.
The lesson is not to take for granted that everything has a 1:1 memory mapping at runtime with your source, and to take each case in context.
Data structures
Finally, from our contrived examples we can take a look at user-defined data types and observe some of the optimizations that GHC does for memory.
Let's demonstrate that unpacking a data structure yields less memory. Here is a data type that contains an Int
:
data HasInt = HasInt !Int deriving (Generic) instance NFData HasInt
Here are two identical data types which use HasInt
, but the first
simply uses HasInt
, and the latter unpacks it.
data HasPacked = HasPacked HasInt deriving (Generic) instance NFData HasPacked data HasUnpacked = HasUnpacked {-# UNPACK #-} !HasInt deriving (Generic) instance NFData HasUnpacked
We can measure the difference by weighing them like this:
-- | Weigh: packing vs no packing. packing :: Weigh () packing = do func "\\x -> HasInt x" (\x -> HasInt x) 5 func "\\x -> HasPacked (HasInt x)" (\x -> HasPacked (HasInt x)) 5 func "\\x -> HasUnpacked (HasInt x)" (\x -> HasUnpacked (HasInt x)) 5
The output is:
\x -> HasInt x 16 0 OK \x -> HasPacked (HasInt x) 32 0 OK \x -> HasUnpacked (HasInt x) 16 0 OK
Voilà! Here we've demonstrated that:
HasInt x
consists of the 8 byte header for the constructor, and 8 bytes for the Int.HasPacked
has 8 bytes for the constructor, 8 bytes for the first slot, then another 8 bytes for theHasInt
constructor, finally 8 bytes for theInt
itself.HasUnpacked
only allocates 8 bytes for the header, and 8 bytes for theInt
.
GHC did what we wanted!
Summary
We've looked at:
- What lead to this package.
- Propose that we start measuring our functions more, especially libraries.
- How to use this package.
- Some of our use-cases at FP Complete.
- Caveats.
- Some contrived examples do not lead to obvious explanations.
Now I encourage you to try it out!