Standard D, Mir D benchmarks against Numpy (BLAS)

Sun Mar 15 20:15:07 UTC 2020

On Sunday, 15 March 2020 at 12:13:39 UTC, Pavel Shkadzko wrote:
> On Saturday, 14 March 2020 at 08:01:33 UTC, Jacob Carlborg 
> wrote:
>> On 2020-03-12 13:59, Pavel Shkadzko wrote:
>>> I have done several benchmarks against Numpy for various 2D 
>>> matrix operations. The purpose was mere curiosity and spread 
>>> the word about Mir D library among the office data engineers.
>>> Since I am not a D expert, I would be happy if someone could 
>>> take a second look and double check.
>>> 
>>> https://github.com/tastyminerals/mir_benchmarks
>>> 
>>> Compile and run the project via: dub run --compiler=ldc 
>>> --build=release
>>
>> Have you tried to compile with LTO (Link Time Optimization) 
>> and PGO (Profile Guided Optimization) enabled? You should also 
>> link with the versions of Phobos and druntime that has been 
>> compiled with LTO.
>
> If for LTO the dub.json dflags-ldc: ["-flto=full"] is enough 
> then it doesn't improve anything.

Try:
     "dflags-ldc" : ["-flto=thin", 
"-defaultlib=phobos2-ldc-lto,druntime-ldc-lto", "-singleobj" ]

The "-defaultlib=..." parameter engages LTO for phobos and 
druntime. You can also use "-flto=full" rather than "thin". I've 
had good results with "thin". Not sure if the "-singleobj" 
parameter helps.

> For PGO, I am a bit confused how to use it with dub -- 
> dflags-ldc: ["-O3"]? It compiles but I see no difference. By 
> default, ldc2 should be using O2 -- good optimizations.

PGO (profile guided optimization) is a multi-step process. First 
step is create an instrumented build (-fprofile-instr-generate). 
Second step is to run the instrumented binary on a representative 
workload. Last step is to use the resulting workload in the final 
build (-fprofile-instr-use).

For information on PGO see Johan Engelen's blog page: 
https://johanengelen.github.io/ldc/2016/07/15/Profile-Guided-Optimization-with-LDC.html

I have done studies on LTO and PGO and found both beneficial, 
often significantly. The largest gains came in code running in 
tight loops that were included code pulled from libraries (e.g. 
phobos, druntime). It was hard to predict what code was going 
benefit from LTO/PGO.

I've found it tricky to use dub for the full PGO process. 
(Creating the instrumented build, generating the profile data, 
and using it in the final build process.) Mostly I've used make 
for this. I did get it to work in a simple performance test app: 
https://github.com/jondegenhardt/dcat-perf. It doesn't document 
how the PGO steps work, but it dub.json file is relatively short 
and repository README.md contains the build instructions for both 
LTO and LTO plus PGO.

--Jon