[Issue 12176] New: Possible std.algorithm.sum optimization for short fixed-size arrays

Sat Feb 15 03:11:43 PST 2014

https://d.puremagic.com/issues/show_bug.cgi?id=12176

           Summary: Possible std.algorithm.sum optimization for short
                    fixed-size arrays
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Phobos
        AssignedTo: nobody at puremagic.com
        ReportedBy: bearophile_hugs at eml.cc

--- Comment #0 from bearophile_hugs at eml.cc 2014-02-15 03:11:41 PST ---
I suggest to modify std.algorithm.sum (to add an overload) to let it accept
fixed-sized arrays too, and then add a specialization for short fixed-sized
arrays like:

int[3] a = [10, 20, 30];
immutable tot = a.sum;

This expands the places where you are willing to use sum() because with such
specialization a normal D compiler is able to produce inlined loop-free asm
code similar to writing down a manual sum of the array items:

immutable tot = a[0] + a[1] + a[2];

Generic library code should not be slower than built-in features, where
possible, and short fixed-sized arrays are not uncommon in high performance
code (and currently DMD doesn't inline code containing a loop. And sometimes
even with LDC2 loops are not unrolled if their size is a run-time value. To
unroll them it should inline the function and then see that the run-time length
is actually a compile-time one).

If you know at compile-time the length of an array, this is precious
information, and it's bad to just throw such information away, especially when
such length is small (when the length is large it makes no difference).

This specialization should be present only for small arrays (like Length <= 5).

For the other cases and to reduce template bloat the sum() specialized on
fixed-sized arrays should just call (with a "static if") the normal sum for
dynamic arrays with a slicing:

auto sum(size_t N)(T[N] arr) {
    static if (N < 6) {
       // short-case implementation here ...
    } else {
        return sum(arr[]);
    }
}

Using the sum() code from std.algorithm, this main() gets compiled by the
latest ldc2:

int main() {
    int[3] a = [10, 20, 30];
    return cast(int)a[].sum;
}

As (-O -release -inline -noboundscheck -output-s):

__Dmain:
    pushl   %ebp
    pushl   %ebx
    pushl   %edi
    pushl   %esi
    subl    $28, %esp
    movl    $10, 16(%esp)
    movl    $20, 20(%esp)
    movl    $30, 24(%esp)
    leal    16(%esp), %edi
    movl    %edi, 4(%esp)
    movl    $3, (%esp)
    xorl    %esi, %esi
    calll   __D3std5array12__T5emptyTiZ5emptyFNaNbNdNfxAiZb
    subl    $8, %esp
    testb   $1, %al
    jne LBB0_3
    movl    $3, %eax
    movl    %edi, %ecx
    leal    20(%esp), %ebx
    xorl    %esi, %esi
    movl    $2, %ebp
    xorl    %edi, %edi
    .align  16, 0x90
LBB0_2:
    movl    %ecx, 4(%esp)
    movl    %eax, (%esp)
    calll   __D3std5array12__T5frontTiZ5frontFNaNbNcNdNfAiZi
    subl    $8, %esp
    movl    (%eax), %eax
    movl    %eax, %ecx
    sarl    $31, %ecx
    addl    %eax, %esi
    adcl    %ecx, %edi
    movl    %ebx, 8(%esp)
    movl    %ebp, 12(%esp)
    movl    %ebx, 4(%esp)
    movl    %ebp, (%esp)
    calll   __D3std5array12__T5emptyTiZ5emptyFNaNbNdNfxAiZb
    subl    $8, %esp
    movl    8(%esp), %ecx
    decl    %ebp
    addl    $4, %ebx
    testb   $1, %al
    movl    12(%esp), %eax
    je  LBB0_2
LBB0_3:
    movl    %esi, %eax
    addl    $28, %esp
    popl    %esi
    popl    %edi
    popl    %ebx
    popl    %ebp
    ret

The idea proposed here is just an optimization, so in theory a sufficiently
smart compiler should not need it.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------