Article: Increasing the D Compiler Speed by Over 75%

Fri Aug 2 06:16:38 PDT 2013

31-Jul-2013 22:20, Walter Bright пишет:
> On 7/31/2013 8:26 AM, Dmitry Olshansky wrote:
>> Ouch... to boot it's always aligned by word size, so
>> key % sizeof(size_t) == 0
>> ...
>> rendering lower 2-3 bits useless, that would make straight slice lower
>> bits
>> approach rather weak :)
>
> Yeah, I realized that, too. Gotta shift it right 3 or 4 bits.

And that helped a bit... Anyhow after doing a bit more pervasive integer 
hash power of 2 tables stand up to their promise.

The pull that reaps the minor speed benefit over the original (~2% speed 
gain!):
https://github.com/D-Programming-Language/dmd/pull/2436

Not bad given that _aaGetRValue takes only a fraction of time itself.

I failed to see much of any improvement on Win32 though, allocations are 
dominating the picture.

And sharing the joy of having a nice sampling profiler, here is what AMD 
CodeAnalyst have to say (top X functions by CPU clocks not halted).

Original DMD:

Function	 CPU clocks	 DC accesses	 DC misses
RTLHeap::Alloc	 49410	 520	 3624
Obj::ledata	 10300	 1308	 3166
Obj::fltused	 6464	 3218	 6
cgcs_term	 4018	 1328	 626
TemplateInstance::semantic	 3362	 2396	 26
Obj::byte	 3212	 506	 692
vsprintf	 3030	 3060	 2
ScopeDsymbol::search	 2780	 1592	 244
_pformat	 2506	 2772	 16
_aaGetRvalue	 2134	 806	 304
memmove	 1904	 1084	 28
strlen	 1804	 486	 36
malloc	 1282	 786	 40
Parameter::foreach	 1240	 778	 34
StringTable::search	 952	 220	 42
MD5Final	 918	 318	

Variation of DMD with pow-2 tables:

Function	 CPU clocks	 DC accesses	 DC misses
RTLHeap::Alloc	 51638	 552	 3538
Obj::ledata	 9936	 1346	 3290
Obj::fltused	 7392	 2948	 6
cgcs_term	 3892	 1292	 638
TemplateInstance::semantic	 3724	 2346	 20
Obj::byte	 3280	 548	 676
vsprintf	 3056	 3006	 4
ScopeDsymbol::search	 2648	 1706	 220
_pformat	 2560	 2718	 26
memcpy	 2014	 1122	 46
strlen	 1694	 494	 32
_aaGetRvalue	 1588	 658	 278
Parameter::foreach	 1266	 658	 38
malloc	 1198	 758	 44
StringTable::search	 970	 214	 24
MD5Final	 866	 274	 2

This underlies the point that DMC RTL allocator is the biggest speed 
detractor. It is "followed" by ledata (could it be due to linear search 
inside?) and surprisingly the tiny Obj::fltused is draining lots of 
cycles (is it called that often?).

-- 
Dmitry Olshansky