Article: Increasing the D Compiler Speed by Over 75%

Fri Aug 2 09:47:25 PDT 2013

On 8/2/2013 6:16 AM, Dmitry Olshansky wrote:
> 31-Jul-2013 22:20, Walter Bright пишет:
>> On 7/31/2013 8:26 AM, Dmitry Olshansky wrote:
>>> Ouch... to boot it's always aligned by word size, so
>>> key % sizeof(size_t) == 0
>>> ...
>>> rendering lower 2-3 bits useless, that would make straight slice lower
>>> bits
>>> approach rather weak :)
>>
>> Yeah, I realized that, too. Gotta shift it right 3 or 4 bits.
>
> And that helped a bit... Anyhow after doing a bit more pervasive integer hash
> power of 2 tables stand up to their promise.
>
> The pull that reaps the minor speed benefit over the original (~2% speed gain!):
> https://github.com/D-Programming-Language/dmd/pull/2436

2% is worth taking.

> Not bad given that _aaGetRValue takes only a fraction of time itself.
>
> I failed to see much of any improvement on Win32 though, allocations are
> dominating the picture.
>
> And sharing the joy of having a nice sampling profiler, here is what AMD
> CodeAnalyst have to say (top X functions by CPU clocks not halted).
>
> Original DMD:
>
> Function     CPU clocks     DC accesses     DC misses
> RTLHeap::Alloc     49410     520     3624
> Obj::ledata     10300     1308     3166
> Obj::fltused     6464     3218     6
> cgcs_term     4018     1328     626
> TemplateInstance::semantic     3362     2396     26
> Obj::byte     3212     506     692
> vsprintf     3030     3060     2
> ScopeDsymbol::search     2780     1592     244
> _pformat     2506     2772     16
> _aaGetRvalue     2134     806     304
> memmove     1904     1084     28
> strlen     1804     486     36
> malloc     1282     786     40
> Parameter::foreach     1240     778     34
> StringTable::search     952     220     42
> MD5Final     918     318
>
> Variation of DMD with pow-2 tables:
>
> Function     CPU clocks     DC accesses     DC misses
> RTLHeap::Alloc     51638     552     3538
> Obj::ledata     9936     1346     3290
> Obj::fltused     7392     2948     6
> cgcs_term     3892     1292     638
> TemplateInstance::semantic     3724     2346     20
> Obj::byte     3280     548     676
> vsprintf     3056     3006     4
> ScopeDsymbol::search     2648     1706     220
> _pformat     2560     2718     26
> memcpy     2014     1122     46
> strlen     1694     494     32
> _aaGetRvalue     1588     658     278
> Parameter::foreach     1266     658     38
> malloc     1198     758     44
> StringTable::search     970     214     24
> MD5Final     866     274     2
>
>
> This underlies the point that DMC RTL allocator is the biggest speed detractor.
> It is "followed" by ledata (could it be due to linear search inside?) and
> surprisingly the tiny Obj::fltused is draining lots of cycles (is it called that
> often?).

It's not fltused() that is taking up time, it is the static function following 
it. The sampling profiler you're using is unaware of non-global function names.