[GSoC] 'Independency of D from the C Standard Library' progress and update thread

Sat Jul 6 16:10:28 UTC 2019

On Saturday, 6 July 2019 at 15:33:44 UTC, Piotrek wrote:
>
>
> I used the old repo for Dmemset. With Dmemutils it works now. I 
> removed static foreach from benchmark.d in order to run gdc.
> Text results:
> https://github.com/PiotrekDlang/Dmemutils/tree/master/Dmemset/output
>

Great, earlier today I realized that there were problems with 
static foreach,
so now it's only using mixin in the main repo.

Basically, I should have been able to do:
version (GNU)
{
     // mixin
}
else
{
     static foreach
}

but that didn't work, meaning GDC tried to compile static foreach

Anyway, the benchmarks look good. In DMD, small sizes are not so 
good but the big
ones are better. But DMD is not the focus, since it now changed 
to GDC, LDC.

If you're interested, there are a lot of things to say regarding 
optimization for DMD. Some have been said in this thread as 
initially the project was focused on DMD. I'm actually thinking 
of writing an article so that maybe I can help the next guy that 
tries to optimize for DMD. I don't think it's a good decision to 
care at all about optimization in DMD, but one might do. And it's 
a hard road.
A tl;dr is that, for me at least, the only way to reach parity 
with libc is using (inline) ASM.

But the important benchmarks are for GDC, LDC, which agree with 
my benchmarks
on AMD and the result is that Dmemset reaches total parity with 
libc memset().
That's great to have from an Intel user as well, thanks for your 
time!

>
> It seems it wasn't related to this change. Looks like heisen 
> optimization.
>

Again, DMD. Quite an unexpected compiler.

>
> Funnily enough, DMD (with Dmemset) holds the speed record, over 
> 50 GB/s, copying some big block sizes.
>

DMD might have been able to get these results
due to inlining that was unrelated to the actual function (i.e. 
the benchmark code got inlined).

>
> However, aren't smaller sizes more important?
>

Again, fortunately DMD is not the focus but I guess one way one 
can somewhat answer this question is to do a report of the sizes 
used in the D runtime, since this is targeted to the D runtime.
Something like this: 
https://forum.dlang.org/post/jdfiqpronazgglrkmwfq@forum.dlang.org

But this is not enough. A big part of optimization is to know the 
most
common cases (which could be the data format, size, hardware 
etc.) and optimize
for that first. And this is not adequate to show us the most 
common cases.

- For one, eventually different sizes might be added or removed 
and so the
common cases might change.
- Someone might want to use this function outside of the D 
runtime.

So, Dmemset() should be even or better than libc, which is 
(currently) achieved.

Note something interesting. GDC gets these results with the naive 
version. This
version is literally a 8-lines for loop.

>
> One issue is it should be tested on all variation of HW and OS.
> At least it can be placed in experimental module.

Right, it's currently PR'd to the D runtime: 
https://github.com/dlang/druntime/pull/2662
Just like you said, in an experimental module. :P

Best regards,
Stefanos