My ACCU 2016 keynote video available online

Thu May 19 05:04:31 PDT 2016

On 5/19/16 4:12 AM, Jens Müller wrote:
> The code applying the sentinel optimization assumes mutability of
> the input. That needs to be checked for.

Indeed. As I mentioned after discussing find, I didn't worry about those 
checks assuming they were obvious.

> That's fine for partition
> because that is assumed to be in-place. But for other algorithms it's
> not so obvious. It's sad that the optimization works only for
> non-const input.

Several optimizations only apply to mutable data. Others apply to 
immutable data. It's the way things go.

> I didn't get the idea behind sentinels for sparse dot product. I picked the
> smallest of the last elements (so you need bidirectional ranges) and fix
> up as
> needed. For gdc I get a speedup (baseline over new implementation) of
> 1.2 in
> best case and >1.0 in worst case. On average it's about 1.1 I would say. I
> expected more. How would you approach sentinels with the sparse dot
> product. Can
> you elaborate the idea from the video? I didn't get it.

That's the idea - to only need to bounds check on one of the three 
possibilities.

What test data did you use?

10%-20% win on dot product is significant because for many algorithms 
dot product is a kernel and dominates everything else. For those any win 
goes straight to the bottom line.

> The base line (dot1 in the graphs) is the straightforward version
>
> ---
> size_t i,j = 0;
> double sum = 0;
> while (i < a.length && j < b.length)
> {
>      if (a[i].index < b[j].index) i++;
>      else if (a[i].index > b[j].index) j++;
>      else
>      {
>          assert(a[i].index == b[j].index);
>          sum += a[i].value * b[j].value;
>          i++;
>          j++;
>      }
> }
> return sum;
> ---

That does redundant checking. There's a better baseline:

---
if (a.length == 0 || b.length == 0)
     return 0;
const amax = a.length - 1, bmax = b.length - 1;
size_t i,j = 0;
double sum = 0;
for (;;)
{
     if (a[i].index < b[j].index) {
         if (i++ == amax) break;
     }
     else if (a[i].index > b[j].index) {
         bumpJ: if (j++ == bmax) break;
     }
     else
     {
         assert(a[i].index == b[j].index);
         sum += a[i].value * b[j].value;
         if (i++ == amax) break;
         goto bumpJ;
     }
}
return sum;
---

Then if you add the sentinel you only need the bounds tests in the third 
case.

> BTW the effects vary greatly for different compilers.
> For example with dmd the optimized version is slowest. The baseline is
> best. Weird. With gdc the optimized is best and gdc's code is always
> faster than dmd's code. With ldc it's really strange. Slower than dmd. I
> assume I'm doing something wrong here.
>
> Used compiler flags
> dmd v2.071.0
> -wi -dw -g -O -inline -release -noboundscheck
> gdc (crosstool-NG 203be35 - 20160205-2.066.1-e95a735b97) 5.2.0
> -Wall  -g -O3 -fomit-frame-pointer -finline-functions -frelease
> -fno-bounds-check -ffast-math
> ldc (0.15.2-beta2) based on DMD v2.066.1 and LLVM 3.6.1
> -wi -dw -g -O3 -enable-inlining -release -boundscheck=off
>
> Am I missing some flags?

These look reasonable.

> I uploaded my plots.
> - running time https://www.scribd.com/doc/312951947/Running-Time
> - speed up https://www.scribd.com/doc/312951964/Speedup

What is dot2? Could you please redo the experiments with the modified 
code as well?

Thanks!

Andrei