Array fill performance differences between for, foreach, slice

Wed Apr 1 16:33:11 UTC 2020

On 4/1/20 11:23 AM, data pulverizer wrote:
> Thanks for all the suggestions made so far. I am still interested in 
> looking at the implementation details of the slice assign `arr[] = x` 
> which I can't seem to find. Before I made my initial post, I tried doing 
> a `memcpy` and `memmove` under a `for` loop but it did not change the 
> performance or get the same kind of performance as the initial slice 
> performance so I didn't bother to mention them, I haven't tried it with 
> the suggested compiler flags though.

Using disassembly, on run.dlang.io, it says it's using __memsetDouble.

> 
> @StevenSchveighoffer also suggested using `memset` (as well as `memcpy`) 
> please correct me if I am wrong but it looks as if `memset` can only 
> write from an `int` sized source and I need the source size to be any 
> potential size (T).

Again, the compiler uses whatever tools are available. It might be 
memset, it might be something else.

In the case of your code, it's using __memsetDouble, which I have no 
idea where it's defined (probably libc).

> ----------------------------------------------------------------------
> 
> On a related aside I noticed that the timing was reduced across the 
> board so much so that the initial slice time halved when initialising with:
> 
> ```
> auto arr = (cast(T*)GC.malloc(T.sizeof*n, GC.BlkAttr.NO_SCAN | 
> GC.BlkAttr.APPENDABLE))[0..n];
> ```
> 
> Instead of:
> 
> ```
> auto arr = new T[n];
> ```

What this means is, don't scan the block for pointers during a GC 
collect cycle. If you have pointers in your T, this is a very bad idea. 
Not only that, but this does not initialize the appendable data at the 
end of the block.

In addition, GC.malloc just zero-initializes the data. If you do new 
T[n], and T has an initializer, it's going to be a lot more expensive.

If you are going to use this, remove the GC.BlkAttr.APPENDABLE.

In the case of double, it is initialized to NaN.

This could explain the difference in timing.

> 
> I noticed that `GC.malloc()` is based on `gc_malloc()` which gives the 
> bit mask option that makes it  faster than `core.stdc.stdlib: malloc`. 
> Is `gc_malloc` OS dependent? I can't find it in the standard C library, 
> the only reference I found for it is 
> [here](https://linux.die.net/man/3/gc) and it is named slightly 
> differently but appears to be the same function. In `core.memory`, it is 
> specified by the `extern (C)` declaration 
> (https://github.com/dlang/druntime/blob/master/src/core/memory.d) so I 
> guess it must be somewhere on my system?

It's in the D garbage collector, here: 
https://github.com/dlang/druntime/blob/2eec30b35bab308a37298331353bdce5fee1b657/src/gc/proxy.d#L166

extern(C) functions can be implemented in D. The major difference 
between standard D functions and extern(C) is that the latter does not 
do name mangling.

-Steve