Ready for review: new std.uni

Sat Jan 12 00:57:07 PST 2013

12-Jan-2013 09:17, David Nadlinger пишет:
> On Friday, 11 January 2013 at 20:57:57 UTC, Dmitry Olshansky wrote:
>> You can print total counts after each bench, there is a TLS varaible
>> written at the end of it. But anyway I like your numbers! :)
>
> Okay, I couldn't resist having a short look at the results, specifically
> the benchmark of the new isSymbol implementation, where LDC beats DMD by
> roughly 10x. The reason for the nice performance results is mainly that
> LDC optimizes the classifyCall loop containing the trie lookup down to
> the following fairly optimal piece of code (eax is the overall counter
> that gets stored to lastCount):

So these are legit? Coooooool!

BTW I'm having about 2-3 times better numbers on DMD 32bits with oldish 
AMD K10. Can you test 32bit versions also, could it be some glitch in 
64bit codegen?

>
> ---
>    40bc90:       8b 55 00                mov    edx,DWORD PTR [rbp+0x0]
>    40bc93:       89 d6                   mov    esi,edx
>    40bc95:       c1 ee 0d                shr    esi,0xd
>    40bc98:       40 0f b6 f6             movzx  esi,sil
>    40bc9c:       0f b6 34 31             movzx  esi,BYTE PTR [rcx+rsi*1]
>    40bca0:       48 83 c5 04             add    rbp,0x4
>    40bca4:       0f b6 da                movzx  ebx,dl
>    40bca7:       c1 e6 05                shl    esi,0x5
>    40bcaa:       c1 ea 08                shr    edx,0x8
>    40bcad:       83 e2 1f                and    edx,0x1f
>    40bcb0:       09 f2                   or     edx,esi
>    40bcb2:       41 0f b7 14 50          movzx  edx,WORD PTR [r8+rdx*2]
>    40bcb7:       c1 e2 08                shl    edx,0x8
>    40bcba:       09 da                   or     edx,ebx
>    40bcbc:       48 c1 ea 06             shr    rdx,0x6
>    40bcc0:       4c 01 ca                add    rdx,r9
>    40bcc3:       48 8b 14 d1             mov    rdx,QWORD PTR [rcx+rdx*8]
>    40bcc7:       48 0f a3 da             bt     rdx,rbx
>    40bccb:       83 d0 00                adc    eax,0x0
>    40bcce:       48 ff cf                dec    rdi
>    40bcd1:       75 bd                   jne    40bc90
> ---

This looks quite nice indeed.

>
> The code DMD generates for the lookup, on the other hand, is pretty
> ugly, including several values being spilled to the stack, and also
> doesn't get inlined.

To be honest one of the major problems I see with DMD is a lack of 
principled reliable inliner. Currently it may inline or not 2 equivalent 
  pieces of code just because one of it has early return,  or switch 
statement or whatever. And it's about to time to start inlining 
functions with loops as it's not 90-s anymore.

> [1] The reasons for which I'm focusing on LLVM here are not so much its
> technical qualities as its liberal BSD-like license – if it is good
> enough for Apple, Intel (also a compiler vendor) and their lawyer teams,
> it is probably also for us. The code could even be integrated into
> commercial products such as DMC without problems.
>

I like LLVM, and next to everybody in industry like it. Another example 
is AMD. They are building their compiler infrastructure for GPUs on top 
of LLVM.

> [2] And for any typos which might undermine my credibility – it is way
> too early in the morning here.

-- 
Dmitry Olshansky