Differences in results when using the same function in CTFE and Runtime

Sun Aug 18 12:57:41 UTC 2024

On 8/17/24 18:33, Quirin Schroll wrote:
> 
The normal use case for floating-point isn't perfectly reproducible 
results between different optimization levels.

I would imagine the vast majority of FLOPs nowadays are used in HPC and 
AI workloads. Reproducibility is at least a plus, particularly in a 
research context.

> However, differences between CTFE and RT are indeed unacceptable for core-language operations. Those are bugs.

No, they are not bugs, it's just the same kind of badly designed 
specification. According to the specification, you can get differences 
between RT and RT when running the exact same function. Of course you 
will get differences between CTFE and RT.

> The reason for that is probably because Walter didn't like that other 
> languages nailed down floating-point operations

Probably. C famously nails down floating-point operations, just like it 
nails down all the other types. D is really well-known for all of its 
unportable built-in data types, because Walter really does not like 
nailing things down and this is not one of D's selling points. /s

Anyway, at least LDC is sane on this at runtime by default. Otherwise I 
would have to switch language for use cases involving floating point, 
which would probably just make me abandon D in the long run.

> so that you'd get both less precise results *and* worse performance.

Imagine just manually using the data type that is most suitable for your 
use case.

> That would for example be 
> the case on an 80387 coprocessor, and (here's where my knowledge ends) 

Then your knowledge may be rather out of date. I get the x87 
shenanigans, but that's just not very relevant anymore. I am not 
targeting 32-bit x86 with anything nowadays.

> probably also true for basically all hardware today if you consider 
> `float` specifically. I know of no hardware, that supports single 
> precision, but not double precision. Giving you double precision instead 
> of single is at least basically free and possibly even a performance 
> boost, while also giving you more precision.

It's nonsense. If I want double, I ask for double. Also, it's definitely 
not true that going to double instead of single precision will boost 
your performance on a modern machine. If you are lucky it will not slow 
you down, but if the code can be auto-vectorized (or you are vectorizing 
manually), you are looking at least at a 2x slowdown.

> 
> An algorithm like Kahan summation must be implemented in a way that takes those optimizations into account.

I.e., do not try to implement this at all with the built-in 
floating-point types. It's impossible.

> This is exactly like in C++, signed integer overflow is undefined, not because it's undefined on the hardware, but because it allows for optimizations.

If you have to resort to invoking insane C++ precedent in order to 
defend a point, you have lost the debate. Anyway, it is not at all the 
same (triggered by overflow vs triggered by default, undefined behavior 
vs wrong result), and also, in D, signed overflow is actually defined 
behavior.

> D could easily add specific functions to `core.math` that specify operations as specifically IEEE-754 confirming. Using those, Phobos could give you types that are specified to produce results as specified by IEEE-754, with no interference by the optimizer. 

It does not do that. Anyway, I would expect that to go to std.numeric.

> You can't actually do the reverse, i.e. provide a type in Phobos that allows for optimizations of that sort but the core-language types are guaranteed to be unoptimized.

You say "unoptimized", I hear "not broken".

Anyway, clearly the default should be the variant with less pitfalls. If 
you really want to add some sort of flexible-precision data types, why 
not, but there should be a compiler flag to disable it.

> Such a type would have to be compiler-recognized, i.e. it would end up being a built-in type. 

I have no desire at all to suffer from irreproducible behavior because 
some dependency tried to max out on some irrelevant to me benchmark. I 
also have no desire at all to suffer from an unnecessary performance 
penalty just to recover reproducible behavior that is exposed directly 
by the hardware.

Of course, then there's the issue that libc math functions are not fully 
precise and have differences between implementations, but at least there 
seems to be some movement on that front, and this is easy to work around 
given that the built-in operations are sane.