|
What Challenges and Trade-Offs do Optimising Compilers Face?
June 22 2017
After immersing oneself in a research area for several years, it's easy to
feel like the heir to a way of looking at the world that is criminally underrated: why
don't more people want what I want? This can be a useful motivation, but
it can also lead to ignoring the changing context in which most research sits.
So, from time to time, I like to ask myself the
question: “what if everything I believe about this area is wrong?” It’s a simple
question, but hard to answer honestly. Mostly, of course, the only answer my ego
will allow me is “I can’t see anything I could be wrong about”. Sometimes I spot
a small problem that can be easily fixed. Much more rarely, I've been
forced to acknowledge that there’s a deep flaw in my thinking, and that the
direction I’m heading in is a dead-end. I’ve twice had to face this realisation,
each time after a gradual accumulation of unpleasant evidence that I've tried to
ignore. While abandoning several years work wasn’t easy, it felt a lot easier than trying
to flog a horse I now knew to be dead.
All of this means that I’m probably a little better disposed to contrarian
thinking than I was when I started as a researcher. And so it is that I’m going to write
about some thoughts that were initially inspired by a (somewhat controversial) talk by Daniel Bernstein. The talk
itself, entitled “Death of Optimizing Compilers” (slides
and audio), was presented at ETAPS 2015 in
London. I happened to be present for the talk and have engaged in mild advocacy
both immediately afterwards and in the two years since. The problem is that the case against
is quite easy to state but the case for rather hard — or, at least, I
haven’t managed to frame an argument good enough to convince anyone. But I
hope I might now have sharpened my argument enough to make another attempt
worthwhile.
Suitably condensed, the talk’s thesis is fairly simple. As software systems become
bigger and bigger, an ever smaller percentage of code dominates execution time ,
especially as computers are asked to crunch ever bigger
chunks of data. However, optimising compilers (e.g. gcc -O2 or even a
VM like HotSpot) are outclassed by humans on the most performance
critical code. Because of that, optimising compilers are operating
in a region between a problem that’s not worth solving (most parts of most programs)
and a problem they can’t solve (the really performance critical parts). It
doesn’t take much thought to realise that this last point has been the source of
controversy: how can someone belittle the huge amounts of effort put into
optimising compilers?
Daniel used his own (very cunning)
ChaCha20
algorithm as an example of where a human can significantly outdo anything a
compiler can. ChaCha20 is effective because the
algorithm was invented with the strengths of modern processors in mind:
its design and implementation are intricately intertwined
in a way that no compiler could ever expect to match.
It may not surprise you that I don’t agree with Daniel’s conclusion. However,
sometimes looking at things from an extreme angle can help bring forgotten
aspects into focus. So, while I’m not claiming to have any novel insights, here
is my attempt to enumerate some often-forgotten (at least by me) issues which
limit the effectiveness of optimising compilers.
Balancing program performance and compilation time
We all know that there's a trade-off between optimisation and
compilation time: the more optimised you want your program to be, the longer it
takes to compile. At one extreme, language implementations like CPython and CRuby
have low compilation costs and little
or no optimisation costs, but have
consequently poor performance. At the other extreme, superoptimisation
tries vast numbers of machine-code combinations before picking the fastest. This means
that a tiny change to even the tiniest function can take many, many hours to optimise:
scaling this up to even “small” functions could be a challenge.
Slightly less extreme are C compilers like gcc and clang. These put considerable
effort into optimising programs, but generally compile even medium sized files in at most a
second or two.
Would developers be happy if we suddenly increased compilation times by, say,
a factor of 10 if it lead to a 5% performance increase? Probably not. One possible
solution is to add a “really good optimisation but slow to compile” mode to
compilers. However, precedent suggests that such a mode will be little used. For
example, gcc’s highest optimisation level -O3 is often treated with
a mix of fear and disdain: partly because it often doesn't improve performance
, but also because it's thought to more often produce
buggy code. Whether this is still a justifiable position, whether it's
because of compiler bugs, or whether it's due to user bugs
that only happen to show up when highly optimised is almost irrelevant: if no-one is
prepared to use such a mode, it might as well not be there. OpenBSD, long my operating
system of choice, uses -O2 exclusively for both the base operating
system and packaged third party software. Another possibility is profile-guided
optimisation, which I'll touch on later.
We don't know which parts of a program to optimise
In my experience, one of the easiest ways of distinguishing a seasoned
programmer from those early in their career is their approach to “manual”
optimisation. When presented with a slow running program, junior programmers
tend to quickly identify what they think is the bottleneck, and start hacking away at it.
Seasoned developers, in contrast, tend to have realised from bitter experience
that their initial guesses about bottlenecks are nearly always wrong, and first
profile performance.
Sometimes, even standard profiling doesn’t help. My standard approach is to
profile a program, identify the longest running function(s), and then stare at it
to see if I can improve it. This works in many cases, but fundamentally profiling can only
highlight symptoms, not prove causes. For example, if I look at the longest running
function and it’s a bubble sort, I can quickly change it to a faster sorting
algorithm. But if it’s a Timsort
function, there probably isn’t much I can
obviously do to fix it. The problem could be elsewhere (e.g. more than once I’ve
seen sort functions used as part of a de-duplicator function, which is a slow
way of solving that particular problem) and may be very subtle (e.g. bad data structures can cause much
more memory to be allocated than necessary). Occasionally, though rarely,
I can work out a likely culprit, but judge that the necessary rewrite is too
big to justify the gain. But, in most cases, I simply can’t work out if there is
anything which would benefit from further optimisation or not.
We can now ask a simple question: if it’s difficult for a human to look at
the runtime performance of a specific program and work out what should be
optimised, how difficult is it for a compiler? The answer, I fear, is
that the compiler has, by far, the more difficult job. Static
compilers implicitly make a guess about how a program will execute – without having
the luxury of actually executing it – and optimise based on that guess. In some cases, we can make
accurate guesses (e.g. we can work out statically how long the machine code sequence
for 3 * 4 will take versus 12 ) but in most we can’t
(e.g. will a branch that represents a loop never be taken, taken occasionally,
or taken often?). The latter quickly overwhelm the former, meaning that the
compiler can have no real idea of which parts of a program are likely to be
bottlenecks.
The only safe choice the compiler can thus make is to try
and optimise all parts of a program, though this has two issues. First, the trade-off between
compilation time and optimisation rears its head again. Since some optimisations
are either expensive or rarely applicable, we can’t afford to apply them
everywhere, especially as they would mostly have no useful runtime effect.
Second, optimisations are heavily order dependent. Imagine I have three
optimisations A, B, and C: applying A
then B might produce code that C can't do anything with;
but applying B then A might produce code that C
can optimise. In this simple case we could try both possibilities and,
if we're lucky, work out which outcome is better and use that. In a real
compiler, with a very large number of optimisations, combinatorial
explosion is real and this is impractical. This means that even though
our compilers sometimes contain all the optimisations they need to produce
high-performing code, they don't run them in the right order (or can't run
them for long enough) to optimise all parts of a program.
In practice, therefore, trying to optimise all parts of a program means
doing a so-so job on all parts of a program. There are two techniques
which try and identify the sub-parts of a program that would most benefit
from heavy optimisation.
The first is
profile-guided optimisation: in essence, one statically compiles a program,
profiles an execution or two, then feeds that back to another run of the static
optimiser, which uses that information to hone in on the parts of a program
which will most benefit from optimisation. Doing this can allow a compiler
to run its optimisers for longer on smaller parts of a program, rather than
wasting its effort trying to optimise the entire program. This can yield useful
performance benefits (see e.g. the 10% improvements that Chrome
and Firefox
reported) but it’s difficult to set-up, compilation becomes even slower, and it only helps if
all subsequent runs follow the same general pattern as the profiled
run(s). I suspect the latter point is the most critical: if people use
your program in different ways from run to run, profile-guided optimisation can even
degrade performance.
The second is dynamic compilation (i.e. JIT compilation), which can
be thought of as live, continuous profile-guided optimisation: every time a
program runs, dynamic compilation observes its behaviour, and compiles based on
that behaviour. Thus dynamic compilation has the advantage over static compilation that it can much
more accurately identify the genuine bottlenecks in a program. However, dynamic compilation has to be even more
mindful of the costs of optimisation, since time spent dynamically compiling is
bad for the user (draining their battery or simply taking potential CPU time
away from the user). Thus while dynamic compilation is the most effective way of
determining which parts of a program will benefit most from optimisation, it has
much less leeway to use expensive optimisations than static compilation.
Finally, whether static or dynamic, all compilers contain a finite number of
optimisations: they may not contain the optimisation that our program needs to
run fast. There is a small-ish set of optimisations which it's safe to assume
that every optimising compiler (static or dynamic) contains. After that, things
become, in my experience, a lot more reactive. If a user shows a compiler
developer a program which runs unexpectedly slow, the compiler developer will
probably add, or tweak, an optimisation to make it perform better. Over time,
therefore, different compilers tend to do well on an unpredictable subset of all
possible programs. You can often see this in the real world where different
optimising compilers for the same language perform much better or worse on a
given program. Is there anything we can do about this? I don't think so. After
all, we can't (or, at least, probably shouldn't) optimise what we haven't seen:
benchmarks are, to a large degree, destiny.
We don't know where the performance ceiling is
One of the problems that optimising compiler authors face is that we don’t
really know if we’re doing a good job or not. If optimisation makes a given program 50%
faster that might be really good, or it might be somewhat mediocre: since we generally
don’t know what the best possible performance (what I call the “performance ceiling”) of
a program is, we have no real idea how far away we are from it. The closest we
come to a “performance oracle” is when there are multiple compilers for a given
language: if another compiler makes a program run faster than our compiler, we
know we have work to do.
Given that our current static compilers seem to be well within the
“diminishing returns on further optimisations” category, it might appear that
we're close to the performance ceiling. But, for all we know, we could be stuck
in a local maxima. Are there any ways that we might be able to tell?
Superoptimisation, mentioned above, suggests that there are potentially many
more performance gains to be had, although it's only been applied on small
functions. Dynamic compilation is the other obvious possibility: since much more
research has been put into static compilers, we don't really know if current
comparisons of static and dynamic compilation are the end of the story or not.
Maintaining implicit properties
As Daniel pointed out, compilers
have a narrow view of optimisation: they’ll optimise x
+ 3 * 4 into x + 12 , but I don’t know of any that will turn a bubble sort
function into a merge sort. In part this is because recognising that a particular
piece of code implements bubble sort is hard (think of all the minor, obfuscating,
variations). But, more significantly, it’s
because compilers are generally unable to prove that such a major change
won’t violate properties of the program explicitly intended by the user.
For example, what happens if the user’s
bubble sort is so coded because they know that the program needs to run in constant
memory and they’re prepared to trade-off execution speed for memory? That property
is implicitly encoded in the bubble sort, and the compiler has no way of
knowing if it was explicitly intended or not.
Real programs contain countless such examples of implicit properties: indeed,
most of us, when programming, don’t even consider how much we expect the optimised
version of our program to reflect the program as we wrote it.
Partly in recognition of these implicit properties, compilers
are often rather conservative in their optimisations: for example, with the
notable exception of inlining, they are generally reluctant to perform
optimisations that silently increase memory usage. Because of this, a human
who can reason about the importance of such properties, and disregard them
if they are not important, always has a potential edge over a compiler.
Over-prioritising compilation has consequences
Experience suggests that over-prioritising optimisation can lead to
problems in other areas, notably correctness and security. The most obvious,
though by no means the only, example of this is C. Most widely known are the
problems with undefined behaviour. To cut a long story short, the C
specification says, more often than most other languages, “if you do X, the
behaviour is undefined”, at which point the program can do absolutely anything,
whether you intended it to or not.
The problem is that the vast majority of
such cases cannot be detected statically: that is, it’s often impossible to know
if your program could exhibit undefined behaviour or not. Have you remembered,
for example, to add a runtime check that the divisor of every non-constant division is
non-zero? That one’s well known enough that people will tend to check it. But
what happens when signed integer addition overflows? C gives no guarantees about the
outcome, and most programs have so many additions that we struggle to reason
about which ones we might need to be careful about. You can read more examples
here.
It goes without saying that a large number of unfortunate
bugs (including many security alerts) have occurred because of undefined
behaviour. However, the problem has become much worse in recent years as C
compilers optimise programs based on the assumption that no undefined behaviour
exists. In other words, compilers are now more strictly interpreting the standards
which, ironically, means that they have become less conservative in their
optimisations. No doubt a few benchmarks increase in speed because of this, but
my guess is that it makes only a very small difference to most program's performance.
I used to
think that I could reliably program in C, but the exploitation of “minor”
undefined behaviour has undermined my confidence. To add insult to injury, not
only does C have far more undefined behaviour than other languages, making it
more likely that one will forget some of the less common cases, but the C
standard isn't even freely available (though one can get drafts, which are
“close” to the final standard). This has led to deep unhappiness on both sides:
programmers tend to say “please don’t violate widely held expectations even if
the standard doesn’t mention those expectations”; compiler writers have a
tendency to say “we’re following the standard, and are making your programs faster: if
there’s a problem in your program, you should fix it.” If there’s any
single factor which will hasten the
decline of C as a widely used language, I suspect this is it.
C compilers are the most obvious example of this problem, but not the only ones.
There are at least a couple of cases where,
depending on the day of the week, I can convince myself either way. A good
example of this is Python's memory semantics. In theory, Python long specified
that programmers should not rely on the deterministic destruction of objects.
However, since the only implementation of Python was CPython, which is reference
counted – and since most people don't read, or remember, the language
specification – many programs came to rely in subtle ways on deterministic
destruction (e.g. implicitly expecting that file descriptors would be freed at
the end of a function). When PyPy came along, it
soon came to favour non-deterministic destruction. Doing so allowed PyPy to have
a high performing state-of-the-art garbage collector, but it did break a number
of programs (e.g. many ran out of file descriptors). Is that a good trade-off?
Could, or should, PyPy have used some mix of reference counting and “proper”
garbage collection? Since this is pretty much the only intentional
“incompatibility” between CPython and PyPy it feels just about reasonable to me,
although I'm sure some people feel differently.
Summary
It's important to be realistic: most people don't care about program performance most of the
time. Modern computers are so fast that most programs run fast enough even with
very slow language implementations. In that sense, I agree with Daniel's
premise: optimising compilers are often unimportant. But “often” is
often unsatisfying, as it is here. Users find themselves transitioning from not caring at all
about performance to suddenly really caring, often in the space of a
single day.
This, to me, is where optimising compilers come into their own: they
mean that even fewer people need care about program performance. And I don't
mean that they get us from, say, 98 to 99 people out of 100 not needing to care:
it's probably more like going from 80 to 99 people out of 100 not needing to
care. This is, I suspect, more significant than it seems: it means that many
people can go through an entire career without worrying about performance.
Martin Berger reminded me of A
N Whitehead’s wonderful line that “civilization advances by extending the number
of important operations which we can perform without thinking about them” and
this seems a classic example of that at work. Even better, optimising
compilers are widely tested and thus generally much more reliable than
the equivalent optimisations performed manually.
But I think that those of us who work on optimising compilers need to be honest with
ourselves, and with users, about what performance improvement one can expect to
see on a typical program. We have a tendency to pick the maximum possible
improvement and talk about it as if it's the mean, when there's often a
huge difference between the two.
There are many good reasons for that gap, and I hope in this blog post I've at least
made you think about some of the challenges and trade-offs that optimising
compilers are subject to.
Acknowledgements: My thanks to Edd Barrett, Martin Berger, Carl
Friedrich Bolz, Lukas Diekmann, and Sarah Mount for substantial and insightful
comments on early drafts. Any errors, omissions, or infelicities, are my own.
Footnotes
[1] |
Most readers will be familiar with Knuth’s quip that “premature optimisation is
the root of all evil.” However, I doubt that any of us have any real idea what
proportion of time is spent in the average part of the average program. In
such cases, I tend to assume that Pareto’s principle
won't be far too wrong (i.e. that 80% of execution time is spent in 20% of code).
In
1971 a study by Knuth and others of Fortran programs, found that 50% of execution time was spent in 4% of code. I don't know
of modern equivalents of this study, and for them to be truly useful, they'd
have to be rather big. If anyone knows of something along these lines,
please let me know!
|
[2] |
Daniel finished with an idea as to how this limitation could be overcome:
instead of compilers being a black box, programs should be able to interact with
them to help the compiler optimise them. While there is some work in this
general direction (see e.g. Haskell’s
approach), I think that pushing it as far as Daniel would like is hard.
Certainly, I’ve done enough work in macros and compile-time meta-programming to
have some idea of how difficult it is to find a good design for interactions
between a program and compiler. |
[3] |
Though it is important to note that both do compile to bytecode
internally. The oft-stated distinction between “interpreted” and “compiled”
languages is a red-herring: any language can be interpreted or compiled (and
some language implementations interpret compiled code).
|
[4] |
Interestingly, modern languages seem much happier with lengthier compile times.
rustc, which I use a lot, is not fast, even with optimisations turned off; and
scalac, when I last used it, was even slower for small files (though
since, I think, it doesn't do full program optimisation statically, I
imagine it's faster in some senses than rustc at compile-time). I can't
say why this is for sure, although I do have a potentially
plausible explanation.
Take gcc as a concrete example: its first release was in 1987. At that point,
I believe that the Cray-2
was the second fastest machine on the planet operating at 2 GFLOPS (at something
like £25,000,000 in today's money); for reference, my 2016 i7-6700K desktop
processor (£301) is something like 80-115 GFLOPS. Using GFLOPS as a measurement
of speed of anything is dangerous, particularly when it comes to compilers
(which do few, if any, floating point operations). But the general trend is clear:
a modern commodity processor is a couple of orders magnitude faster than a 30
year old supercomputer. It's safe to assume that compiling programs took quite a
bit longer in wall-clock time in 1987 than today, and that therefore people would have put quite
a lot of effort (consciously and unconsciously) into making compilers run acceptably
fast.
Fast forward to today, and the optimisations that made a compiler run not-too-slow
in the late 1980s make it blazingly fast today. In contrast, modern language designers and
compiler implementers are happy with compile times that are, relatively, much
slower but still, in absolute terms, just about acceptable (perhaps around a couple of
seconds for a medium-large file, as a rough rule of thumb; which, I would
guess, is probably faster in wall-clock time than a C compiler in 1987).
As a ridiculously simple comparison, compiling "hello world" on my laptop
(median value over 10 runs – I'm not pretending this is a paper-strength
methodology) takes 0.037s for gcc, 0.173s for Go, and 0.287s for Rust.
|
[5] |
In the context of LLVM/clang it appears that -O3
optimises no better than -O2 (at least in 2013 when that paper was
written).
|
[6] |
Abstract
interpretation is an attempt to do this. It's got real uses in bug
finding, but I'm not sure that it would do very well at performance prediction.
|
[7] |
Jamey Sharp has written an
approachable overview of this problem and some of the work in the area.
|
[8] |
My experience suggests that this is a more common issue in dynamic optimising
compilers, which in some sense have more opportunities – or, at least,
more information – to optimise, but also more constraints (chiefly time).
However, this may partly be a function of the richness of modern languages,
which are more likely to be dynamically compiled and optimised.
|
[9] |
My gut instinct is that many programs in modern languages such as Haskell would
run faster if they were dynamically (rather than statically) compiled. The basic
reason is simple. From a runtime perspective, the pattern matching that is
common in such languages is just dynamic typing in disguise (the fact that the
pattern matching statically enforces exhaustiveness is of no use beyond the type
checker). Thus the static compiler is either going to decline to inline the
branches of pattern matches, or it'll guess (often wrongly) that one branch is
more commonly taken, or (less likely) it'll try and inline all branches. Since
inlining makes pretty much every other optimisation much more effective, this
inability to statically optimise must cause performance issues. However, I
should be clear: I currently have no evidence for this other than my gut instinct,
which I suspect might not stand up in a court of justice.
|
[10] |
There have been several attempts to define a “safe” subset of C (e.g. this) that is immune to the
most egregious examples of undefined behaviour, but it's not clear that any of
these will gain sufficient momentum to become widely accepted. That's a pity,
but such is life.
|
Follow me on Twitter
|
|
|