rt_finalize WTFs?

Sun Dec 4 20:41:08 PST 2011

Thanks for the benchmark.  I ended up deciding to just create a second 
function, rt_finalize_gc, that gets rid of a whole bunch of cruft that 
isn't necessary in the GC case.  I think it's worth the small amount of 
code duplication it creates.  Here are the results of my efforts so far: 
  https://github.com/dsimcha/druntime/wiki/GC-Optimizations-Round-2 . 
I've got one other good idea that I think will shave a few seconds off 
the Tree1 benchmark if I don't run into any unforeseen obstacles in 
implementing it.

On 12/4/2011 10:07 PM, Martin Nowak wrote:
> On Mon, 05 Dec 2011 02:46:27 +0100, dsimcha <dsimcha at yahoo.com> wrote:
>
>> I'm at my traditional passtime of trying to speed up D's garbage
>> collector again, and I've stumbled on the fact that rt_finalize is
>> taking up a ridiculous share of the time (~30% of total runtime) on a
>> benchmark where huge numbers of classes **that don't have
>> destructors** are being created and collected. Here's the code to this
>> function, from lifetime.d:
>>
>> extern (C) void rt_finalize(void* p, bool det = true)
>> {
>> debug(PRINTF) printf("rt_finalize(p = %p)\n", p);
>>
>> if (p) // not necessary if called from gc
>> {
>> ClassInfo** pc = cast(ClassInfo**)p;
>>
>> if (*pc)
>> {
>> ClassInfo c = **pc;
>> byte[] w = c.init;
>>
>> try
>> {
>> if (det || collectHandler is null || collectHandler(cast(Object)p))
>> {
>> do
>> {
>> if (c.destructor)
>> {
>> fp_t fp = cast(fp_t)c.destructor;
>> (*fp)(cast(Object)p); // call destructor
>> }
>> c = c.base;
>> } while (c);
>> }
>> if ((cast(void**)p)[1]) // if monitor is not null
>> _d_monitordelete(cast(Object)p, det);
>> (cast(byte*) p)[0 .. w.length] = w[]; // WTF?
>> }
>> catch (Throwable e)
>> {
>> onFinalizeError(**pc, e);
>> }
>> finally // WTF?
>> {
>> *pc = null; // zero vptr
>> }
>> }
>> }
>> }
>>
>> Getting rid of the stuff I've marked with //WTF? comments (namely the
>> finally block and the re-initializing of the memory occupied by the
>> finalized object) speeds things up by ~15% on the benchmark in
>> question. Why do we care what state the blob of memory is left in
>> after we finalize it? I can kind of see that we want to clear things
>> if delete/clear was called manually and we want to leave the object in
>> a state that doesn't look valid. However, this has significant
>> performance costs and IIRC is already done in clear() and delete is
>> supposed to be deprecated. Furthermore, I'd like to get rid of the
>> finally block entirely, since I assume its presence and the effect on
>> the generated code is causing the slowdown, not the body, which just
>> assigns a pointer.
>>
>> Is there any good reason to keep this code around?
>
> Not for the try block. With errors being not recoverable you don't need
> to care
> about zeroing the vtbl or you could just copy the code into the catch
> handler.
> This seems to cause less spilled variables.
>
> Most expensive is the call to a memcpy at PLT, replace it with something
> inlineable.
> Zeroing is not much faster than copying init[] for small classes.
>
> At least zeroing should be worth it, unless the GC would not scan the
> memory otherwise.
>
> gcbench/tree1 => 41.8s => https://gist.github.com/1432117 =>
> gcbench/tree1 => 33.4s
>
> Please add useful benchmarks to druntime.
>
> martin