D perfomance

Mon Apr 27 13:19:45 UTC 2020

On 4/27/20 1:04 AM, Mathias LANG wrote:
> On Sunday, 26 April 2020 at 16:20:19 UTC, Steven Schveighoffer wrote:
>>
>> In terms of performance, depending on the task at hand, D1 code is 
>> slower than D2 appending, by the fact that there's a thread-local 
>> cache for appending for D2, and D1 only has a global one-array cache 
>> for the same. However, I'm assuming that since you were focused on D1, 
>> your usage naturally was written to take advantage of what D1 has to 
>> offer.
>>
>> The assumeSafeAppend call also uses this cache, and so it should be 
>> quite fast. But setting length to 0 is a ton faster, because you 
>> aren't calling an opaque function.
>>
>> So depending on the usage pattern, D2 with assumeSafeAppend can be 
>> faster, or it could be slower.
> 
> Well, Sociomantic didn't use any kind of multi-threading in "user code".
> We had single-threaded fibers for concurrency, and process-level scaling 
> for parallelism.
> Some corner cases were using threads, but it was for low level things 
> (e.g. low latency file IO on Linux), which were highly scrutinized and 
> stayed clear of the GC AFAIK.
> 
> Note that accessing TLS *does* have a cost which is higher than 
> accessing a global. 

That is a minor cost compared to the actual appending.

> By this reasoning, I would assume that D2 appending 
> would definitely be slower, although I never profiled it.

I tested the performance when I added the feature. D2 was significantly 
and measurably faster (at least for the appending 2 or more arrays). I 
searched through my old email, for appending 5M bytes to 2 arrays the 
original code was 13.99 seconds (on whatever system I was using in 
2009), and 1.53 seconds with the cache.

According to that email, I had similar results even with a 1-element 
cache, so somehow my code was faster, but I didn't know why. Quite 
possibly it's because the cache in D1 for looking up block info is 
behind the GC lock.

Literally the only thing that is more expensive in D2 vs. D1 was the 
truncation of arrays. In D1 this is setting the length to 0, in D2, you 
needed to call assumeSafeAppend. This is why I suggested a flag that 
allows you to enable the original behavior.

> What I did 
> profile tho, is `assumeSafeAppend`. The fact that it looks up GC 
> metadata (taking the GC lock in the process) made it quite expensive 
> given how often it was called (in D1 it was simply a no-op, and called 
> defensively).

The cache I referred to is to look up the GC metadata. In essence, when 
you append, you will look it up anyway. Either assumeSafeAppend or 
append will get the GC metadata into the cache, and then it is a 
straight lookup in the cache and this doesn't take a lock or do any 
expensive searches. The cache is designed to favor the most recent 
arrays first.

This is an 8 element cache, so there are still cases where you will be 
having issues (like if you round-robin append to 9 arrays). I believe 8 
elements was a sweet spot for performance that allowed reasonably fast 
appending with a reasonable number of concurrent arrays.

Where D1 will fall down is if you are switching between more than one 
array, because the cache in D1 is only one element. Even if you are 
doing just one array, the cache is not for the array runtime, but for 
the GC. And it is based on the pointer queried, not the block data. A GC 
collection, for instance, is going to invalidate the cache.

> 
>>> IIRC Mathias has suggested that it should be possible to tag arrays 
>>> as intended for this kind of re-use, so that stomping prevention will 
>>> never trigger, and you don't have to `assumeSafeAppend` each time you 
>>> reduce the length.
>>
>> I spoke for a while with Dicebot at Dconf 2016 or 17 about this issue. 
>> IIRC, I suggested either using a custom type or custom runtime. He was 
>> not interested in either of these ideas, and it makes sense (large 
>> existing code base, didn't want to stray from mainline D).
>>
>> By far, the best mechanism to use is a custom type. Not only will that 
>> fix this problem as you can implement whatever behavior you want, but 
>> you also do not need to call opaque functions for appending either. It 
>> should outperform everything you could do in a generic runtime.
> 
> Well... Here's nothing I never really quite understood actually: Mihails 
> *did* introduce a buffer type. See 
> https://github.com/sociomantic-tsunami/ocean/blob/36c9fda09544ee5a0695a74186b06b32feda82d4/src/ocean/core/Buffer.d#L116-L130 
> 
> And we also had a (very old) similar utility here: 
> https://github.com/sociomantic-tsunami/ocean/blob/36c9fda09544ee5a0695a74186b06b32feda82d4/src/ocean/util/container/ConcatBuffer.d 
> 
> I always wanted to unify this, but never got to it. But if you look at 
> the first link, it calls `assumeSafeAppend` twice, before and after 
> setting the length. In practice it is only necessary *after* reducing 
> the length, but as I mentioned, this is defensive programming.

Yeah, that is unnecessary. It is not going to be that expensive, 
especially if you just were appending to that array, but again, more 
expensive than setting a word to 0.

> 
> For reference, most of our applications had a principled buffer use. The 
> buffers would rarely be appended to from more than one, perhaps two 
> places. However, slices to the buffer would be passed around quite 
> liberally. So a buffer type from which one could borrow would indeed 
> have been optimal.

This all works actually better with the new runtime. The old one would 
reallocate if you appended to a slice that didn't start at the block 
start. The new version can detect it's at the end and allow appending.

> 
>> Note that this was before (I think) destructor calls were added. The 
>> destructor calls are something that assumeSafeAppend is going to do, 
>> and won't be done with just setting length to 0.
>>
>> However, there are other options. We could introduce a druntime 
>> configuration option so when this specific situation happens (slice 
>> points at start of block and has 0 length), assumeSafeAppend is called 
>> automatically on the first append. Jonathan is right that this is not 
>> @safe, but it could be an opt-in configuration option.
>>
>> I don't think configuring specific arrays makes a lot of sense, as 
>> this would require yet another optional bit that would have to be 
>> checked and allocated for all arrays.
>>
> 
> I don't even know if we had a single case where we had arrays of objects 
> with destructors. The vast majority of our buffer were `char[]` and 
> `ubyte[]`. We had some elaborate types, but I think destructors + buffer 
> would have been frowned upon in code review.

Of course! D1 didn't have destructors for structs ;)

> 
> Also the reason we didn't modify druntime to just have the D1 behavior 
> (that would have been a trivial change) was because how dependent on the 
> new behavior druntime had become. It was also the motivation for the 
> suggestion Joe mentioned. AFAIR I mentioned it in an internal issue, did 
> a PoC implementation, but never got it to a state were it was mergeable.

Having a flag per array is going to be costly, but actually, there's a 
lot more junk in the block itself. Perhaps there's a spare bit somewhere 
that can be a flag for the append behavior.

> 
> Also, while a custom type might sound better, it doesn't really interact 
> well with the rest of the runtime, and it's an extra word to pass around 
> (if passed by value).

The "extra value" can be stored elsewhere -- just like the GC you could 
provide metadata for the capacity in a global AA or something.

In any case, there were options. The way druntime is written, it's 
pretty good performance, in most cases BETTER performance than D1 for 
idiomatic D code. In fact the Tango folks asked me if I could add the 
feature to Tango's druntime, but I couldn't because it depends on TLS.

For code that was highly focused on optimizing D1 with its 
idiosyncracies, it probably has worse performance.

The frustration is understandable, but without the possibility of 
adaptation, there's not much one can do.

-Steve