std.math API rework

Sun Oct 9 03:50:21 PDT 2016

On Friday, 7 October 2016 at 17:02:02 UTC, Andrei Alexandrescu 
wrote:
> On 10/07/2016 03:42 AM, Ilya Yaroshenko wrote:
>> For example, SUM_i of sqrt(fabs(a[i])) can be vectorised using
>> mir.ndslice.algorithm.
>> vxorps instruction can be used for fabs.
>> vsqrtps instruction can be used for sqrt.
>> LDC's @fastmath allows to re-associate summation elements.
>>
>> Depend on data cache level this allows to speed up iteration 8 
>> times for
>> single precision floating point number for AVX (16 times for 
>> AVX512?).
>
> Yah, 8 times is large enough to justify an important change.
>
>> Current std.math has following problems:
>>
>> 1. Math funcitons are not templates -> Phobos should be linked.
>
> This is also the case for C++ - most math functions are linked 
> from the C standard library. How do typical linear algebra 
> libraries similar in functionality with Mir (such as Eigen) 
> deal with this situation?

1) BLAS-like API requires only sqrt and fabs. The solutions used 
in Eigen depend on compiler. For example, the following code can 
be found:

```c++
template<> EIGEN_DEVICE_FUNC inline float4  pabs<float4>(const 
float4& a) {
   return make_float4(fabsf(a.x), fabsf(a.y), fabsf(a.z), 
fabsf(a.w));
}
template<> EIGEN_DEVICE_FUNC inline double2 pabs<double2>(const 
double2& a) {
   return make_double2(fabs(a.x), fabs(a.y));
}
```

2) Eigen, uBLAS and other use Expression Templates [1], which are 
used to compose few multiplications, additions/subtractions and 
maybe some per element operations on matrices and vectors. In the 
same time I have never seen that a lambda can be passed. C/C++ 
high performance libraries uses macroses/templates for type 
specification, but lambdas are not used.

This makes upcoming ndslice.algorithm a unique solution, which is 
more flexible, fast, and universal comparing with C++ Expression 
Templates. It still requires some rework, and LDC based DMD 2.072 
for further optimization.

> Also, one question is how does the existence of unused 
> functions impede the working of faster functions provided 
> separately? Is it a sticky point that std.math is he exact 
> module used?

Of course a separate module or dub can be provided instead. In 
addition, std.math should be splitted into package and reworked. 
So, instead of modifying std.math we can start a new math package.

> Trying to get a good grip on the matter. Generally you'd have a 
> very easy time convincing me that templates are a better way to 
> go :o). But we need to have a good motivation. Do you have a 
> brief example illustrating one proposed template and how it is 
> better than the old ways?

Yes, the example can be found at [2].

First template is better for BetterC mode. The example contains
a C program. The last paragraph in this post contains second part 
about this example.
The first part:

```c
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

float mir_alg_bar(float, float, float);

int main(int argc, char const *argv[])
{
	if(argc < 4)
	{
		puts("Usage: app number_a number_b number_с");
		return 1;
	}

	float a = atof(argv[1]);
	float b = atof(argv[2]);
	float c = atof(argv[3]);

	float d = mir_alg_bar(a, b, c);
	printf("%f\n", d);
	return 0;
}
```

This program should be linked with BetterC libray:

```sh
clang app.c alg/libmir-alg.a
```

`mir-alg` is a small betterC library, which uses a generic `mir` 
dummy (not a normal Mir for example simplicity). It can be linked 
as common C library and has  extern(C) nothrow @nogc interface.

```d
module alg_bar;

pragma(LDC_no_moduleinfo);

import ldc.attributes : fastmath;
import mir.alg;

extern(C) nothrow @nogc @fastmath:

float mir_alg_bar(float a, float b, float c) { return alg1!bar(a, 
b, c); };
```

Mir dummy contains 3 implementations `alg1`, `alg2`, `alg3`.

```d
module mir.alg;

import ldc.intrinsics : llvm_fabs;
import ldc.attributes : fastmath;

pragma(LDC_no_moduleinfo);

@fastmath
{
	auto alg1(alias f)(float a, float b, float c)
	{
		return f(a, llvm_fabs(b), c);
	}

	auto alg2(alias f)(float a, float b, float c)
	{
		return f(a, fabs(b), c);
	}

	auto alg3(alias f)(float a, float b, float c)
	{
		import std.math;
		return f(a, std.math.fabs(b), c);
	}
}

@fastmath
auto bar()(float a, float b, float c)
{
	return a * b + c;
}

float fabs(float  x) @safe pure nothrow @nogc { return 
llvm_fabs(x); }
```

`fabs` function declaration is the same as in LDC's Phobos fork.

`alg1` can be linked with C library in any optimization modes.
`alg2` and `alg3` uses function declarations and requir to link 
`libmir` dummy or `libphobos2` respectively. Making `fabs` 
template solves this problem.
LDC can inline `fabs` for `alg2` and `alg3`, but `O2` flag is 
required.

>>    1.a I strongly decided to move forward without DRuntime. A 
>> phobos as
>> source library is partially OK, but no linking dependencies 
>> should be.
>> BetterC mode is what required for Mir to replace OpenBLAS and 
>> Eigen.
>> New
>> cpuid, threads and mutexes should be provided too. New cpuid 
>> [1] is
>> already implemented (I just need to replace module constructor 
>> with
>> explicit initialization function).
>
> Do you think you can integrate the new cpuid implementation 
> with the existing interface (most likely greatly enhancing it) 
> without breaking the existing clients?

New cpuid has low level and hight level API. The hight level API 
will be reworked to intermediate level API without the module 
constructor. This is required for BetterC mode. Current DRuntime 
cpuid API can be implemented on top of new cpuid low level 
interface. However current DRuntime API can not be used for Mir. 
The reasons are:
   1. It is not compatible with betterC mode.
   2. It performs additional weird computations for cache level 
sizes. This makes me crazy to predict what returned value means. 
If an engineer asks about Level3 cache size, Level3 cache size 
should be returned instead current hell. See also Issue 16028 [3].
   3. It can not represent complex CPU topology, which is required 
by ARM (especially by server ARM CPUs). CPU information is 
protected on ARM CPUs, but it can be predefined by a user of be 
fetched from an OS.

> Same question for threads.
> Same question for mutexes.

Current DRuntime mutexes and threads can be implemented on top of 
nothrow @nogc successors.

>> My strong opinion is that a D library
>> for D is a wrong direction. A numeric D library should be a 
>> product for
>> other languages too, like many C libraries does. One my client 
>> is
>> thinking to invest to nothrow @nogc async I/O for production, 
>> so it may
>> help to move to betterC direction too.
>
> Sure. A different way to frame this is to make D friendlier 
> toward linking with other languages. The way I see it, if we 
> get alternatives for cpuid, threads, and mutexes in Mir, that 
> would benefit clients interested in linear algebra. If we get 
> them in DRuntime, that would benefit clients interested in 
> linear algebra and everything else. Clearly the impact would be 
> much larger.

We do not need to have DRuntime for the future, but existing 
users. Fat runtime (except generic algorithms) is red flag for 
software developers if they need to creat something like Eigen or 
hight perfromance web server. The number of such libraries is 
always small. In the same time these libraries make weather and a 
lot of packages will be build on top of them after a while. Dub 
allows to overload dependencies versions in dub.selections.json. 
This is what is required for continuous development.

Assume you manage a set of integrated infrastructure projects, 
which use a set of third sides DUB projects, which depends on 
DRuntime. Part of this infrastructure is open source. Consultancy 
for clients is main income. And you want to add support for 
modern CPU or add new system API for another one appleOS. A 
release has time constraints, so you can not wait for new 
compiler release. Also clients wants new features and backward 
compatability for older compiler in the same time. Plus testing 
complex infrastructure with a compiler fork requires additional 
efforts and time and clients would not be happy to deploy your 
compiler fork into their infrastructure and for their clients. In 
addition, you need to update forked DRuntime API usage for the 
third side projects. This is a stalemate situation and red flag 
for business.

Cpuid, threads, mutexes, event loop, async I/O, and numeric 
software as low level DUB packages with good community support 
and small release cycle are what we really need. I am not against 
hight level API. Furthermore, bindings to other languages is an 
option to provide simple and familiar API for users. But low 
level API is required.

Users do not care about `std`/`core` or other prefix. They want 
good support. Business requires reliability and flexibility, bugs 
are not a huge problem if an architecture allows to find and fix 
them. Really huge problem is a high level object-oriented 
GC-oriented X86-oriented DRuntime, which is dependency almost 
everywhere. I would like to see `std.glas` instead of `mir.glas`, 
but it should be provided as common dub project.

>>   2.b In context of 1.a, linking multiple binaries compiled 
>> with
>> different DRuntime/Phobos versions may cause significant 
>> problems.
>> DRuntime is not so stable like std C lib. One may say that I 
>> am doing
>> something wrong if I need to link libraries compiled with 
>> different
>> DRuntimes. But this is what will happen often with D in real 
>> world if D
>> start to replace C libraries (1.a). So, betterC without 
>> DRuntime /
>> Phobos linking dependencies is a direction to move forward. 
>> nothrow
>> @nogc generic Phobos code seems to be OK.
>
> Hmmm... well I seem to recall the C std lib in gcc has large 
> interoperability issues with its own previous versions, even 
> across minor releases. This has caused numerous headaches at 
> Facebook because the breakages always come without warning and 
> manifest themselves in obscure ways. On the Microsoft side 
> things are even worse, because they virtually guarantee that a 
> version of VS is not binary compatible with the previous ones 
> (I'm not kidding; it's deliberate).
>
> That sets a rather low baseline for us :o). Clearly we'd want 
> to do better, and we probably can. But I think it would be an 
> exaggeration to worry too much about such scenarios.
>
>> 2. Math funcitons are not templates -> They are not inlined -> 
>> No
>> vectorization + function calls in a loop body. One day this 
>> may be
>> fixed, but (1.a, 1.b).
>
> How to the likes of Eigen do it? Do they provide their own 
> templated implementation of <math.h>?

Seems like the recent LDC fixes this problem. Many thanks to our 
LDC team!
Eigen code is very weird, it uses templates and macroses in the 
same time with specialization for different compilers and clibs 
including Intel MKL.

> Have you investigated the much hailed link-time inlining?

This probably would not work for loop vectorization.

>> 3. Math funcitons are not aliases for LDC -> LDC's @fastmath 
>> would not
>> work for them. To enable @fastmath for this functions they 
>> should be
>> annotated with @fastmath, which is not acceptable. If a 
>> function is an
>> alias for llvm intrinsics, than @fastmath flag can be applied 
>> to a
>> function, which calls it.
>
> Not sure I udnerstand this, but it seems to me making the math 
> functions templates would solve it?

Yes. Templates can be replaced with aliases to the intrinsics for 
`version(LDC)`.
For the example above and for all optimization turning on only 
the `alg1`, which calls `llvm_fabs` directly, has fused 
operations. The reason is that fma composition is in the end of 
LLVM optimization pipeline. If one inlined function (bar) and 
root function (alg2) have `@fastmath` and another inlined 
function (fabs) has not `@fastmath`, the code for root will have 
fma, but inlined code for _both_ functions will not have fma. To 
perform other optimizations like vectorization LLVM needs to 
decompose fma and recompose it later.

Best regards,
Ilya

[1] https://en.wikipedia.org/wiki/Expression_templates
[2] 
https://github.com/libmir/temporary_experiments/tree/master/alias_vs_fun
[3] https://issues.dlang.org/show_bug.cgi?id=16028