"BOLT" post-link optimizer gives 15% speed boost to Clang

Johan j at j.nl
Wed Oct 24 10:26:48 UTC 2018


On Wednesday, 24 October 2018 at 01:57:38 UTC, Walter Bright 
wrote:
> On 10/23/2018 3:07 PM, Johan Engelen wrote:
>> Hi all,
>>    "Post-link optimization", what? Yes, indeed, optimization 
>> _after_ the _linker_ has generated a _binary_.
>> Read about this interesting work and the discussion here: 
>> https://lists.llvm.org/pipermail/llvm-dev/2018-October/126985.html
>> 
>> When applied to clang, the performance gain is 15% faster 
>> execution (note, that's the result of applying BOLT on a clang 
>> binary that was already built with PGO and LTO !)
>> 
>> Cheers,
>>    Johan
>> 
>
> Digital Mars C++ had this in the 1990s.
>
> https://www.digitalmars.com/ctg/trace.html
>
> How it works is -gt switch is thrown to generate a runtime 
> profile of the graph of how functions call each other. Then, a 
> module definition file 
> https://www.digitalmars.com/ctg/ctgDefFiles.html is generated 
> that directs the linker to order the layout of functions in the 
> executable to minimize cache misses.
>
> A 15% speedup is about right for this sort of optimization.
>
> Note that Digital Mars didn't invent this, it was invented by 
> (I forgot who) who productized it as "The Segmentor".
>
> From 
> https://github.com/facebookincubator/BOLT/blob/master/docs/OptimizingClang.md :
>
> "BOLT (Binary Optimization and Layout Tool) is designed to 
> improve the application performance by laying out code in a 
> manner that helps CPU better utilize its caching and branch 
> predicting resources.
>
> "Before we can run BOLT optimizations, we need to collect the 
> profile for Clang,
>
> Yup, it's reinvention of the same thing.

Sorry, but such statements show a large lack of insight in what 
is going on here. This isn't about who invented something first, 
the basic idea is of course very old. DMC++ most certainly didn't 
and doesn't do what is being discussed.
I suggest you reread about it without your preconceptions.

BOLT does not use compiler-inserted instrumentation for execution 
tracing (which btw is more than just function call tracing), 
that's what (usually) PGO and LTO use. Instead, the profile that 
BOLT uses is a sampling-based profile obtained e.g. using `perf` 
and hardware counters. The code layout optimization performed is 
not just putting functions in a better order, but moreso the 
reordering of basic blocks inside functions. PGO and LTO already 
do that too (and can also work with sampling based profiling but 
have trouble using all profile data information inside 
functions). Because the sampling profile data is more accurate 
than compiler-inserted instrumentation added before many 
optimization transformations, and because the optimization 
happens on machinecode level (meaning that more of the sampling 
profile data can be used well) substantial additional gains are 
obtained by post-link optimization on top of PGO+LTO.

- Johan



More information about the Digitalmars-d mailing list