Let's avoid range specializations - make ranges great again!

Sat Jun 4 10:35:23 PDT 2016

In Phobos (especially in std.algorithm) a lot of specialization 
between a RandomAccessRange and an InputRange are used even 
though only an InputRange is required.
The reason for this is solely performance - imho we should start 
to tackle this and remove such specializations. Let's create 
better code!

To quote David from the recent PR 
(https://github.com/dlang/phobos/pull/4265) where adding a 
identity specialization achieves a 5-10x speedup for 
{min,max}Element.

> I think that we have a larger issue at hand here. Ranges are of 
> prime strategic importance for D as an accessible way of 
> composing complex operations out of reusable primitives. A 
> large part of this comes from the fact that they are supposed 
> to be zero-cost abstractions – after the compiler optimiser has 
> had its way with a piece of code, we expect it to be equivalent 
> to a hand-written loop.

1) Should it matter whether generic range looping or 
"specialized" r[i] access is used?

Answer for dmd it does (7x slower), ldc does the job (nearly) 
correctly.

```
> dmd -release -O -boundscheck=off test_looping.d && 
> ./test_looping
random_access 685 ms, 351 μs, and 9 hnsecs
foreach       685 ms, 888 μs, and 4 hnsecs
foreach_ref   685 ms, 654 μs, and 4 hnsecs
for_range_api 7 secs, 211 ms, 530 μs, and 3 hnsecs

```
> ldc -release -O3 -boundscheck=off test_looping.d && 
> ./test_looping
random_access 142 ms and 88 μs
foreach       142 ms, 8 μs, and 3 hnsecs
foreach_ref   142 ms, 400 μs, and 9 hnsecs
for_range_api 143 ms, 394 μs, and 1 hnsec
```

Benchmarking code: 
https://github.com/wilzbach/perf-ranges/blob/master/test_looping.d
Bug report for DMD: https://issues.dlang.org/show_bug.cgi?id=16120

(Tested with both ldc 0.17 and 1.0)

2) Should it matter whether I use r.save or r[i..$]?

Answer: it does for dmd and ldc.

```
> dmd -release -O -boundscheck=off test_save.d && ./test_save
random_access  424 ms and 72 μs
foreach_random 420 ms, 754 μs, and 5 hnsecs
for_range_api  2 secs, 562 ms, 415 μs, and 5 hnsecs

> ldc -release -O3 -boundscheck=off test_save.d && ./test_save
random_access  245 ms, 713 μs, and 4 hnsecs
foreach_random 241 ms, 533 μs, and 1 hnsec
for_range_api  349 ms, 582 μs, and 9 hnsecs
```

Benchmarking code: 
https://github.com/wilzbach/perf-ranges/blob/master/test_save.d

Open question: how can we make DMD/LDC smarter, so that they 
inline range primitives and the promise about the zero-cost 
abstractions is full-filled?