By Beyang Liu for the GopherCon Liveblog on August 28, 2018
Presenter: Eben Freeman
Liveblogger: Beyang Liu
A whirlwind tour of the Go memory allocator and garbage collector, with tools and tips on how to optimize.
Eben Freeman (@_emfree_), an engineer at Honeycomb, talks about pitfalls and optimizations related to the memory allocator in Go.
Go is a managed-memory language, which means most of the time you don't have to worry about manual memory management, because the runtime does a lot of work for you. However, dynamic memory allocation is not free, and a program's allocation patterns can substantially affect its performance. The performance implications, however, are not obvious from language syntax. This talk covers techniques and tools that you can use to detect and address allocator bottlenecks.
The architecture of the storage / query service at Honeycomb:
The design/performance goals are:
Using the techniques described in this talk, they were able to obtain speedups of up to 2.5x on some key benchmarks:
Depending on their lifetimes, objects are allocated either on the goroutine's stack or on the heap. The Go compiler decides which objects go where, but the general rule is that objects that outlive their containing function calls need to go on the heap.
So what actually is in the heap? How did it get there? And who's keeping track of it all?
To answer these questions, we should think about what design goals the authors of the heap memory allocator probably had in mind:
Goal: Efficiently satisfy allocations of a given size, but avoid fragmentation
Goal: Avoid locking in the common case
Goal: Efficiently reclaim freeable memory
A global mheap
struct keeps track of both of them:
type mheap struct {
arenas [1 << 22]*heapArena // Covering map of arena frames
free []mSpanList // Lists of fully free spans
central []mcentral // Lists of in-use spans
// plus much more
}
Arenas are coarse sections of the available address space (64MB on 64-bit archs). We allocate OS memory in units of arenas. For each arena, there's metadata in a heapArena struct:
type mheap struct {
arenas [1 << 22]*heapArena // Covering map of arena frames
// ...
}
type heapArena struct {
// page-granularity map to spans
spans [pagesPerArena]*mspan
// pointer/scalar bitmap (2bits/word)
bitmap [heapArenaBitmapBytes]byte
}
Most heap objects live in a span: a run of pages containing objects of a fixed size class:
type span struct {
startAddr uintptr
npages uintptr
spanclass spanClass
// allocated/free bitmap
allocBits *gcBits
// ...
}
There are ~70 different span size classes.
Each P
has an mcache
holding a span of each size class. Ideally, allocations can be satisfied directly out of the mcache (so they're fast). In Go, a P
is a scheduling context. Generally there's one P
per core and at most one goroutine running per P
at a time:
To allocate, we find the first free object in our cached mspan, then return its address:
Let's say we need 96 bytes. First we'd look in the mcache for the mspan with 96-byte objects. After allocation, the picture would look like this:
This design means that "most" memory allocations are fast and require no locking! There are just 3 quick steps:
So now we have covered 2 of our 3 design goals for the allocator: (1) Efficiently satisfy allocations of a given size, but avoid fragmentation, (2) Avoid locking in the common case. But what about (3) Efficiently reclaim memory? In other words, garbage collection?
We have to find and reclaim objects once they're no longer referenced.
Go uses a tricolor concurrent mark-sweep garbage collector, which sounds intimidating, but isn't. "Mark-sweep" means the garbage collector is divided into a mark phase, where objects are marked for collection, and a sweep phase, where memory is actually freed.
In a tricolor garbage collector, objects are marked white, grey, or black. Initially, all objects are white. We start by marking goroutine stacks and globals. When we reach an object, we mark it grey:
When an object's referents are all marked, we mark it black:
At the end, objects are either white or black:
White objects can then be swept and freed.
Simple, right? Well, it leaves some questions open:
Go uses bitmaps for such metadata.
Say we have a struct type like:
type Row struct {
index int
data []interface{}
}
In memory, that looks like:
How does the garbage collector know what other objects it points to? I.e., which of its fields are pointers?
Remember that this heap object is actually inside an arena. The arena's bitmap tells us which of its words are pointers:
Similarly, mark state is kept in a span's gcMark
bits:
Once we're done marking, unmarked bits correspond to free slots, so we can just swap that mark state into the alloc bits:
This means sweeping in Go is generally super fast (while marking adds more overhead and is what we have to be concerned about).
The garbage collector is concurrent... with a twist. If we're not careful with the implementation, the application can do sneaky stuff to thwart the garbage collector. For example, take the following code:
type S struct {
p *int
}
func f(s *S) *int {
r := s.p
s.p = nil
return r
}
After the first line of the function executes, we could end up with a state like this:
And at the return statement, the GC state could look like this:
And then the return value would be a live pointer to memory that the garbage collector could free!
To prevent this from occurring, the compiler translates pointer writes into potential calls into the write barrier. Very roughly:
There are many ingenious optimizations in the Go compiler that let this process happen concurrently, but we won't get into those here.
We can, however, begin to understand how marking might be a problem as far as processor overhead is concerned.
When GC is marking, the write barrier is turned on.
During marking 25% of GOMAXPROCS
are dedicated to background marking. But additional goroutines can be forced into mark assist:
Why the need for mark assist? A rapidly allocating goroutine can outrun the background markers. So when allocating during marking, a goroutine gets charged for each allocation. If it's in debt, it has to do mark work before continuing:
func mallocgc(size uintptr, ...) unsafe.Pointer {
// ...
assistG.gcAssistBytes -= int64(size)
if assistG.gcAssistBytes < 0 {
// This goroutine is in debt. Assist the GC to
// this before allocating. This must happen
// before disabling preemption.
gcAssistAlloc(assistG)
}
// ...
We've learned quite a bit about the runtime:
It seems like dynamic memory allocation has some cost. Does this mean that reducing allocations will always improve performance? Well, it depends. The builtin memory profiler can tell us where we're allocating, but can't answer the causal question, "Will reducing allocations make a difference?" To do that, we can use three other tools:
If you want to see how much overhead the allocator and garbage collector add, you can turn them off with the following runtime flags:
GOGC=off
: Disables garbage collectorGODEBUG=sbrk=1
: Replaces entire allocator with simple persistent allocator that gets big blocks from the OS and gives you successive slices of those as you allocateThis might seem kind of stupid, but it's a cheap way to establish a rough estimate of possible speedup potential. Maybe 30% speedup isn't worthwhile -- or maybe it's a compelling opportunity!
Problems with this approach:
However, it is a worthwhile first check.
A pprof CPU profile can often show time spent in runtime.mallocgc.
Tips:
perf
too
This gives us line-level attribution of where we're doing CPU-expensive allocations, and can help clue us into the underlying cause of GC/allocator overhead.
Problems with this approach:
The execution tracer might be the best tool at our disposal to understand the impact of allocating in granular detail.
The execution tracer captures very granular runtime events over a short time window:
curl localhost:6060/debug/pprof/trace?seconds=5 > trace.out
You can visualize the output in a web UI:
go tool trace trace.out
...though it can be a little dense:
The top-level blue bar shows you at a top-level when GC is running, and the lower bars show what's happening internally.
Remember, top-level GC doesn't mean the program is blocked, but what happens within GC is interesting!
The minimum mutator utilization curve can also show you if a service is GC-bound at different quartiles. ("Mutator" here means "not GC".)
Together, benchmarks with the allocator off, CPU profiles, and execution traces give us a sense of:
If we've concluded that allocations are a source of inefficiency, what can we do? A few of high-level strategies:
sync.Pool
)Okay, now onto things we can change in code to improve allocation/GC behavior...
The Go compiler can be enticed to tell you why a variable is heap-allocated:
go build -gcflags="-m -m"
but its output is a bit unwieldy.
https://github.com/loov/view-annotated-file helps digest it:
Sometimes, spurious heap allocations are easily avoided!
In addition to avoiding spurious allocations, avoiding structs with inner pointers helps the garbage collector:
^ Why this discrepancy? Because the sneaky time.Time
struct conceals a nefarious pointer:
type Time struct {
wall uint64
ext in64
loc *Location
}
Even though the fast path in the allocator is very optimized, we still need to do some work on every allocation:
That ends up being ~20ns per allocation, if you need a back-of-the-envelope. In some cases, we can amortize that overhead by doing fewer, bigger allocs:
// Allocate individual interface{}s out of a big buffer
type SlicePool struct {
bigBuf []interface{}
}
func (s *SlicePool) GetSlice(size int) []interface{} {
if size >= len(s.bigBuf) {
s.bigBuf = make([]interface{}, blockSize)
}
res := s.bigBuf[:size]
s.bigBuf = s.bigBuf[size:]
return res
}
The danger is that any live reference will keep the whole slab alive:
Also, these aren't safe for concurrent use. They're best for a few heavily-allocating goroutines.
Consider the following storage engine architecture:
This architecture
Optimization: explicitly recycle allocated blocks of memory:
var buf RowBuffer
select {
case buf = <-recycled:
default:
buf = NewRowBuffer()
}
A more sophisticated version would use sync.Pool
, which
Danger: must be very careful to zero or overwrite recycled memory.
Below is a chart of different benchmarks after successive rounds of optimization (your mileage may vary):
In summary,
Special credit to Ian Wilkes and many others at Honeycomb: the true masterminds
Suggested further reading:
Allocation efficiency in high-performance Go services Achille Roussel / Rick Branson segment.com/blog/allocation-efficiency-in-high- performance-go-services/
Go 1.5 concurrent garbage collector pacing Austin Clements golang.org/s/go15gcpacing
So You Wanna Go Fast Tyler Treat bravenewgeek.com/so-you-wanna-go-fast/