Benchmarking mir.ndslice + lubeck against numpy and Julia

Sun Jan 12 12:46:44 UTC 2020

On Sunday, 12 January 2020 at 09:43:46 UTC, p.shkadzko wrote:
> Yes, technically we are benchmarking against C numpy and 
> C/Fortran scipy through Python syntax layer. It should have 
> been "C/Fortran".

No, I think the names in your comparison are fair.

You are not comparing the languages themselve, but available 
solutions for scientific computing for each language.

Let me focus on your 5000 * 1000 SVD example. The thing is, all 
these languages / packages are just using LAPACK for that (see 
[1] [2] and [3]), so what you are actually comparing is:

- The default settings
- The LAPACK implementation that is used
- A bit of luck (your matrices are randomly created)
- The wrapping code (this is where the programming language 
actually matters)

The first important default setting is the algorithm. Lapack 
offers 'gesdd', a 'more efficient divide-and-conquer approach', 
and 'gesvd', a 'general rectangular approach' [4]. Python and 
Julia default to gesdd, while D's Lubeck defaults to gesvd.

I factored out the matrix generation code from the benchmark, and 
switched the D algorithm to 'gesdd', and the time went from ~17 
seconds to ~6 on my PC. Comparing that with Python, I got these 
times for some subsequent runs:

Python: 5.7, 6.3, 6.4, 5.3, 6.3
D (LDC release): 5.9, 5.8, 5.8, 5.9, 7.0, 5.7
D (DMD debug): 6.0, 6.5, 6.9, 6.7, 6.0

It can be seen that:

- The times vary between runs (possibly because randomized 
matrices, and other applications running on my PC)
- The times between scipy and Lubeck are comparable
- Compiler settings are not that important, the extra compilation 
time is worse than the gained speed (and when doing scientific 
computing you recompile a lot).

The second important setting is whether to compute full matrix U 
and Vt.

When computing the SVD of matrix A of size 5000x1000, the U*S*Vt 
have sizes 5000x5000, 5000x1000, and 1000x1000 respectively. 
However, since S is just a diagonal matrix with 1000 entries and 
the rest is all 0, you can choose to only compute U as a 
5000x1000 matrix. This obviously saves a lot of work.

Julia does this by default, while Python and D compute the full U 
by default.

When not computing the full matrices in Python and D, I get:

D: roughly 1.7 s
Python: roughly 1.2 s

I don't have Julia, but I bet that it will be in the same order 
of magnitude on my PC as well. The new code can be found below.

So in conclusion, Lubeck has some unfortunate default settings 
for benchmarking, and even with comparable settings D can be a 
little slower. This is either because inferior wrapping code, or 
because I am using a different LAPACK implementation. For D I use 
the LAPACK implementation of OpenBLAS that I compiled myself a 
while ago, I don't know what Python ships with.

In any case, while your benchmark is not representative of the 
actual languages, it is indeed unfortunate when people try 
something D, find it significantly slower than other languages, 
and someone on the forum has to point out all the steps to get D 
performance on par.

---

[1] 
https://github.com/scipy/scipy/blob/adc4f4f7bab120ccfab9383aba272954a0a12fb0/scipy/linalg/decomp_svd.py#L16-L139
[2] 
https://github.com/JuliaLang/julia/blob/aa35ee2d30065803410718378e5673c7f845da62/stdlib/LinearAlgebra/src/svd.jl#L93
[3] 
https://github.com/kaleidicassociates/lubeck/blob/72091ecdd545f9524a1d80e5880cda19845143d0/source/lubeck.d#L356

[4] 
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.svd.html

New Python code:
```
import numpy as np
from scipy.linalg import svd
import timeit
h = 5000
w = 1000
data = np.random.randint(0, 1024, w*h).reshape((h, w))
def svd_fun():
     u, s, v = svd(data, overwrite_a=False, full_matrices=False, 
lapack_driver='gesdd')

print(timeit.timeit(svd_fun, number=1))
```

New D code:
```
import std, mir.ndslice, lubeck;
import std.datetime.stopwatch: benchmark;

enum h = 5000;
enum w = 1000;

auto getMtx() {
     Xorshift rnd;
     rnd.seed(unpredictableSeed);
     auto matrix = generate(() => uniform(0, 1024, rnd))
         .take(w * h).array.sliced(h, w);
     return matrix;
}

auto svdFun(T)(T mtx) {
     auto result = mtx.svd!(No.allowDestroy, "gesdd")(Yes.slim); 
// gesvd gesdd
}

void main() {
     auto svdTime = benchmark!(() => getMtx.svdFun())(1);
     writeln(svdTime);
}
```