D outperformed by C++, what am I doing wrong?

Sun Aug 13 02:15:48 PDT 2017

On Sunday, 13 August 2017 at 09:08:14 UTC, Petar Kirov 
[ZombineDev] wrote:
>
> There's one especially interesting result:
>
> This instantiation:
>
> sum_subranges(std.range.iota!(int, int).iota(int, int).Result, 
> uint)
>
> of the following function:
>
> auto sum_subranges(T)(T input, uint range)
> {
>     import std.range : chunks, ElementType, array;
>     import std.algorithm : map;
>     return input.chunks(range).map!(sum);
> }
>
> gets optimized with LDC to:
>   push rax
>   test edi, edi
>   je .LBB2_2
>   mov edx, edi
>   mov rax, rsi
>   pop rcx
>   ret
> .LBB2_2:
>   lea rsi, [rip + .L.str.3]
>   lea rcx, [rip + .L.str]
>   mov edi, 45
>   mov edx, 89
>   mov r8d, 6779
>   call _d_assert_msg at PLT
>
> I.e. the compiler turned a O(n) algorithm to O(1), which is 
> quite neat. It is also quite surprising to me that it looks 
> like even dmd managed to do a similar optimization:
>
> sum_subranges(std.range.iota!(int, int).iota(int, int).Result, 
> uint):
>     push   rbp
>     mov    rbp,rsp
>     sub    rsp,0x30
>     mov    DWORD PTR [rbp-0x8],edi
>     mov    r9d,DWORD PTR [rbp-0x8]
>     test   r9,r9
>     jne    41
>     mov    r8d,0x1b67
>     mov    ecx,0x0
>     mov    eax,0x61
>     mov    rdx,rax
>     mov    QWORD PTR [rbp-0x28],rdx
>     mov    edx,0x0
>     mov    edi,0x2d
>     mov    rsi,rdx
>     mov    rdx,QWORD PTR [rbp-0x28]
>     call   41
> 41: mov    QWORD PTR [rbp-0x20],rsi
>     mov    QWORD PTR [rbp-0x18],r9
>     mov    rdx,QWORD PTR [rbp-0x18]
>     mov    rax,QWORD PTR [rbp-0x20]
>     mov    rsp,rbp algorithms a
>     pop    rbp
>     ret
>
> Moral of the story: templates + ranges are an awesome 
> combination.

Change the parameter for this array size to be taken from stdin 
and I assume that these optimizations will go away.