Reimplementing software building blocks like malloc and free in D

Sun Aug 12 08:34:46 UTC 2018

On Sunday, 12 August 2018 at 07:00:30 UTC, Eugene Wissner wrote:

>> Also what about other implementations like memset or memcmp. 
>> Have they been implemented or require work from scratch?
>
> These functions are still mostly implemented in asm, so I'm not 
> sure there is an "idiomatic D" way. I would only wrap them into 
> a function working with slices and checking the length. Mike?

Inline ASM is a feature of D, so "idiomatic D" includes assembly 
implementations.  Where D shines here is with it's 
metaprogramming capabilities and the ability to select an 
implementation at compile-time or compose implementations.  See 
https://github.com/JinShil/memcpyD/blob/master/memcpyd.d  There 
you will that the compiler selects an implementation based on 
size in powers of 2.  Sizes that aren't powers of 2 can be 
composed of the powers of 2 implementations.  For example a 6 
byte implementation would be comprised of a 4-byte implementation 
plus a the 2 byte implementation.

I have more code than what's currently check into that 
repository, but I stalled because I need AVX512 to do an 
optimized implementation, and that doesn't seem to be a feature 
of DMD at the moment.  That was one reason I posted 
https://wiki.dlang.org/SAOC_2018_ideas#Implement_AVX2_and_AVX-512_in_DMD.2FDRuntime on the wiki.

Agner Fog had this to say in 
https://agner.org/optimize/optimizing_assembly.pdf

---
There are several ways of moving large blocks of data. The most 
common methods are:
1. REP MOVS instruction.
2. If data are aligned: Read and write in a loop with the largest 
available register size.
3. If size is constant: inline move instructions.
165
4. If data are misaligned: First move as many bytes as required 
to make the destination aligned. Then read unaligned and write 
aligned in a loop with the largest available register size.
5. If data are misaligned: Read aligned, shift to compensate for 
misalignment and write aligned.
6. If the data size is too big for caching, use non-temporal 
writes to bypass the cache. Shift to compensate for misalignment, 
if necessary.

As you can see, it can be very difficult to choose the optimal 
method in a given situation. The best advice I can give for a 
universal memcpy function, based on my testing, is as follows:
* On Intel Wolfdale and earlier, Intel Atom, AMD K8 and earlier, 
and VIA Nano, use the aligned read - shift - aligned write method 
(5).
* On Intel Nehalem and later, method (4) is up to 33% faster than 
method (5).
166
* On AMD K10 and later and Bobcat, use the unaligned read - 
aligned write method (4).
* The non-temporal write method (6) can be used for data blocks 
bigger than half the size of the largest-level cache.
---

I think D is well suited for an implementation that takes all 
these things into consideration, selecting an appropriate 
implementation, and composing complex implementations of the 
simpler implementations.

Mike