Reimplementing software building blocks like malloc and free in D
Mike Franklin
slavo5150 at yahoo.com
Sun Aug 12 08:34:46 UTC 2018
On Sunday, 12 August 2018 at 07:00:30 UTC, Eugene Wissner wrote:
>> Also what about other implementations like memset or memcmp.
>> Have they been implemented or require work from scratch?
>
> These functions are still mostly implemented in asm, so I'm not
> sure there is an "idiomatic D" way. I would only wrap them into
> a function working with slices and checking the length. Mike?
Inline ASM is a feature of D, so "idiomatic D" includes assembly
implementations. Where D shines here is with it's
metaprogramming capabilities and the ability to select an
implementation at compile-time or compose implementations. See
https://github.com/JinShil/memcpyD/blob/master/memcpyd.d There
you will that the compiler selects an implementation based on
size in powers of 2. Sizes that aren't powers of 2 can be
composed of the powers of 2 implementations. For example a 6
byte implementation would be comprised of a 4-byte implementation
plus a the 2 byte implementation.
I have more code than what's currently check into that
repository, but I stalled because I need AVX512 to do an
optimized implementation, and that doesn't seem to be a feature
of DMD at the moment. That was one reason I posted
https://wiki.dlang.org/SAOC_2018_ideas#Implement_AVX2_and_AVX-512_in_DMD.2FDRuntime on the wiki.
Agner Fog had this to say in
https://agner.org/optimize/optimizing_assembly.pdf
---
There are several ways of moving large blocks of data. The most
common methods are:
1. REP MOVS instruction.
2. If data are aligned: Read and write in a loop with the largest
available register size.
3. If size is constant: inline move instructions.
165
4. If data are misaligned: First move as many bytes as required
to make the destination aligned. Then read unaligned and write
aligned in a loop with the largest available register size.
5. If data are misaligned: Read aligned, shift to compensate for
misalignment and write aligned.
6. If the data size is too big for caching, use non-temporal
writes to bypass the cache. Shift to compensate for misalignment,
if necessary.
As you can see, it can be very difficult to choose the optimal
method in a given situation. The best advice I can give for a
universal memcpy function, based on my testing, is as follows:
* On Intel Wolfdale and earlier, Intel Atom, AMD K8 and earlier,
and VIA Nano, use the aligned read - shift - aligned write method
(5).
* On Intel Nehalem and later, method (4) is up to 33% faster than
method (5).
166
* On AMD K10 and later and Bobcat, use the unaligned read -
aligned write method (4).
* The non-temporal write method (6) can be used for data blocks
bigger than half the size of the largest-level cache.
---
I think D is well suited for an implementation that takes all
these things into consideration, selecting an appropriate
implementation, and composing complex implementations of the
simpler implementations.
Mike
More information about the Digitalmars-d
mailing list