My ACCU 2016 keynote video available online

Thu May 19 05:54:48 PDT 2016

On Thursday, 19 May 2016 at 12:04:31 UTC, Andrei Alexandrescu 
wrote:
> On 5/19/16 4:12 AM, Jens Müller wrote:

> What test data did you use?

An instance for benchmarking is generated as follows. Given nnz 
which is the sum of non-zero indices in input vector a and b.

auto lengthA = uniform!"[]"(0, nnz, gen);
auto lengthB = nnz - lengthA;

auto a = iota(0, nnz).randomSample(lengthA, gen).map!(i => 
Pair(i, 10)).array();
auto b = iota(0, nnz).randomSample(lengthB, gen).map!(i => 
Pair(i, 10)).array();

So I take a random sample of (0, ..., nnz) for each input.
Any better idea? I've seen that people generate vectors with 
larger gaps.

> 10%-20% win on dot product is significant because for many 
> algorithms dot product is a kernel and dominates everything 
> else. For those any win goes straight to the bottom line.

Sure. Still I wasn't sure whether I got the idea from your talk. 
So maybe there is/was more.

>> The base line (dot1 in the graphs) is the straightforward 
>> version
>>
>> ---
>> size_t i,j = 0;
>> double sum = 0;
>> while (i < a.length && j < b.length)
>> {
>>      if (a[i].index < b[j].index) i++;
>>      else if (a[i].index > b[j].index) j++;
>>      else
>>      {
>>          assert(a[i].index == b[j].index);
>>          sum += a[i].value * b[j].value;
>>          i++;
>>          j++;
>>      }
>> }
>> return sum;
>> ---
>
> That does redundant checking. There's a better baseline:
>
> ---
> if (a.length == 0 || b.length == 0)
>     return 0;
> const amax = a.length - 1, bmax = b.length - 1;
> size_t i,j = 0;
> double sum = 0;
> for (;;)
> {
>     if (a[i].index < b[j].index) {
>         if (i++ == amax) break;
>     }
>     else if (a[i].index > b[j].index) {
>         bumpJ: if (j++ == bmax) break;
>     }
>     else
>     {
>         assert(a[i].index == b[j].index);
>         sum += a[i].value * b[j].value;
>         if (i++ == amax) break;
>         goto bumpJ;
>     }
> }
> return sum;
> ---

I check that.

> Then if you add the sentinel you only need the bounds tests in 
> the third case.

I post the sentinel code later. Probably there is something to 
improve there as well.

>> BTW the effects vary greatly for different compilers.
>> For example with dmd the optimized version is slowest. The 
>> baseline is
>> best. Weird. With gdc the optimized is best and gdc's code is 
>> always
>> faster than dmd's code. With ldc it's really strange. Slower 
>> than dmd. I
>> assume I'm doing something wrong here.
>>
>> Used compiler flags
>> dmd v2.071.0
>> -wi -dw -g -O -inline -release -noboundscheck
>> gdc (crosstool-NG 203be35 - 20160205-2.066.1-e95a735b97) 5.2.0
>> -Wall  -g -O3 -fomit-frame-pointer -finline-functions -frelease
>> -fno-bounds-check -ffast-math
>> ldc (0.15.2-beta2) based on DMD v2.066.1 and LLVM 3.6.1
>> -wi -dw -g -O3 -enable-inlining -release -boundscheck=off
>>
>> Am I missing some flags?
>
> These look reasonable.

But ldc looks so bad.
Any comments from ldc users or developers? Because I see this in 
many other measurements as well. I would love to have another 
compiler producing efficient like gdc.
For example what's equivalent to gdc's -ffast-math in ldc.

>> I uploaded my plots.
>> - running time 
>> https://www.scribd.com/doc/312951947/Running-Time
>> - speed up https://www.scribd.com/doc/312951964/Speedup
>
> What is dot2? Could you please redo the experiments with the 
> modified code as well?

dot2 is an optimization for jumping over gaps more quickly 
replacing the first two if statements with while statements.
But my benchmark tests have no large gaps but interestingly it 
does make things worse.

Jens