"BOLT" post-link optimizer gives 15% speed boost to Clang
Johan
j at j.nl
Wed Oct 24 10:26:48 UTC 2018
On Wednesday, 24 October 2018 at 01:57:38 UTC, Walter Bright
wrote:
> On 10/23/2018 3:07 PM, Johan Engelen wrote:
>> Hi all,
>> "Post-link optimization", what? Yes, indeed, optimization
>> _after_ the _linker_ has generated a _binary_.
>> Read about this interesting work and the discussion here:
>> https://lists.llvm.org/pipermail/llvm-dev/2018-October/126985.html
>>
>> When applied to clang, the performance gain is 15% faster
>> execution (note, that's the result of applying BOLT on a clang
>> binary that was already built with PGO and LTO !)
>>
>> Cheers,
>> Johan
>>
>
> Digital Mars C++ had this in the 1990s.
>
> https://www.digitalmars.com/ctg/trace.html
>
> How it works is -gt switch is thrown to generate a runtime
> profile of the graph of how functions call each other. Then, a
> module definition file
> https://www.digitalmars.com/ctg/ctgDefFiles.html is generated
> that directs the linker to order the layout of functions in the
> executable to minimize cache misses.
>
> A 15% speedup is about right for this sort of optimization.
>
> Note that Digital Mars didn't invent this, it was invented by
> (I forgot who) who productized it as "The Segmentor".
>
> From
> https://github.com/facebookincubator/BOLT/blob/master/docs/OptimizingClang.md :
>
> "BOLT (Binary Optimization and Layout Tool) is designed to
> improve the application performance by laying out code in a
> manner that helps CPU better utilize its caching and branch
> predicting resources.
>
> "Before we can run BOLT optimizations, we need to collect the
> profile for Clang,
>
> Yup, it's reinvention of the same thing.
Sorry, but such statements show a large lack of insight in what
is going on here. This isn't about who invented something first,
the basic idea is of course very old. DMC++ most certainly didn't
and doesn't do what is being discussed.
I suggest you reread about it without your preconceptions.
BOLT does not use compiler-inserted instrumentation for execution
tracing (which btw is more than just function call tracing),
that's what (usually) PGO and LTO use. Instead, the profile that
BOLT uses is a sampling-based profile obtained e.g. using `perf`
and hardware counters. The code layout optimization performed is
not just putting functions in a better order, but moreso the
reordering of basic blocks inside functions. PGO and LTO already
do that too (and can also work with sampling based profiling but
have trouble using all profile data information inside
functions). Because the sampling profile data is more accurate
than compiler-inserted instrumentation added before many
optimization transformations, and because the optimization
happens on machinecode level (meaning that more of the sampling
profile data can be used well) substantial additional gains are
obtained by post-link optimization on top of PGO+LTO.
- Johan
More information about the Digitalmars-d
mailing list