Array fill performance differences between for, foreach, slice
Steven Schveighoffer
schveiguy at gmail.com
Wed Apr 1 16:33:11 UTC 2020
On 4/1/20 11:23 AM, data pulverizer wrote:
> Thanks for all the suggestions made so far. I am still interested in
> looking at the implementation details of the slice assign `arr[] = x`
> which I can't seem to find. Before I made my initial post, I tried doing
> a `memcpy` and `memmove` under a `for` loop but it did not change the
> performance or get the same kind of performance as the initial slice
> performance so I didn't bother to mention them, I haven't tried it with
> the suggested compiler flags though.
Using disassembly, on run.dlang.io, it says it's using __memsetDouble.
>
> @StevenSchveighoffer also suggested using `memset` (as well as `memcpy`)
> please correct me if I am wrong but it looks as if `memset` can only
> write from an `int` sized source and I need the source size to be any
> potential size (T).
Again, the compiler uses whatever tools are available. It might be
memset, it might be something else.
In the case of your code, it's using __memsetDouble, which I have no
idea where it's defined (probably libc).
> ----------------------------------------------------------------------
>
> On a related aside I noticed that the timing was reduced across the
> board so much so that the initial slice time halved when initialising with:
>
> ```
> auto arr = (cast(T*)GC.malloc(T.sizeof*n, GC.BlkAttr.NO_SCAN |
> GC.BlkAttr.APPENDABLE))[0..n];
> ```
>
> Instead of:
>
> ```
> auto arr = new T[n];
> ```
What this means is, don't scan the block for pointers during a GC
collect cycle. If you have pointers in your T, this is a very bad idea.
Not only that, but this does not initialize the appendable data at the
end of the block.
In addition, GC.malloc just zero-initializes the data. If you do new
T[n], and T has an initializer, it's going to be a lot more expensive.
If you are going to use this, remove the GC.BlkAttr.APPENDABLE.
In the case of double, it is initialized to NaN.
This could explain the difference in timing.
>
> I noticed that `GC.malloc()` is based on `gc_malloc()` which gives the
> bit mask option that makes it faster than `core.stdc.stdlib: malloc`.
> Is `gc_malloc` OS dependent? I can't find it in the standard C library,
> the only reference I found for it is
> [here](https://linux.die.net/man/3/gc) and it is named slightly
> differently but appears to be the same function. In `core.memory`, it is
> specified by the `extern (C)` declaration
> (https://github.com/dlang/druntime/blob/master/src/core/memory.d) so I
> guess it must be somewhere on my system?
It's in the D garbage collector, here:
https://github.com/dlang/druntime/blob/2eec30b35bab308a37298331353bdce5fee1b657/src/gc/proxy.d#L166
extern(C) functions can be implemented in D. The major difference
between standard D functions and extern(C) is that the latter does not
do name mangling.
-Steve
More information about the Digitalmars-d-learn
mailing list