C++Now! 2012 slides

Fri Jun 8 15:34:40 PDT 2012

On 6/7/12 6:50 PM, Peter Alexander wrote:
> On Thursday, 7 June 2012 at 22:29:09 UTC, Andrei Alexandrescu wrote:
>> Great points, example could be a lot better. Maybe it's time you do
>> get started on find(). Destroy or be destroyed.
>
> Ok...
>
> This overload:
>
> R find(alias pred = "a == b", R, E)(R haystack, E needle)
> if (isInputRange!R &&
> is(typeof(binaryFun!pred(haystack.front, needle)) : bool))
> {
> for (; !haystack.empty; haystack.popFront())
> {
> if (binaryFun!pred(haystack.front, needle)) break;
> }
> return haystack;
> }
>
>
> Is fine. In fact it's practically perfect. It's the obvious solution to
> a simple problem. The biggest problem is the "is typeof" syntax, but
> that's not your fault.

So far so good.

> Second overload:
>
> R1 find(alias pred = "a == b", R1, R2)(R1 haystack, R2 needle)
> if (isForwardRange!R1 && isForwardRange!R2
> && is(typeof(binaryFun!pred(haystack.front, needle.front)) : bool)
> && !isRandomAccessRange!R1)
> {
> static if (is(typeof(pred == "a == b")) && pred == "a == b" &&
> isSomeString!R1 && isSomeString!R2
> && haystack[0].sizeof == needle[0].sizeof)
> {
> //return cast(R1) find(representation(haystack), representation(needle));
> // Specialization for simple string search
> alias Select!(haystack[0].sizeof == 1, ubyte[],
> Select!(haystack[0].sizeof == 2, ushort[], uint[]))
> Representation;
> // Will use the array specialization
> return cast(R1) .find!(pred, Representation, Representation)
> (cast(Representation) haystack, cast(Representation) needle);
> }
> else
> {
> return simpleMindedFind!pred(haystack, needle);
> }
> }
>
> As far as I can tell, this is a proxy for various substring search
> implementations.
>
> Problem 1: Why is find overloaded to do substring search? Why does it do
> substring and not subset or subsequence? substring is a completely
> different algorithm from linear search and even has different asymptotic
> running time. This is needlessly overloaded, it adds nothing.

This overload is an optimization that exploits the fact that comparing 
two strings for equality is the same as comparing their representations.

> Problem 2: The attempted specialisation is atrocious. It compares the
> predicate string with "a == b".

Yah, that's to detect the defaulted parameter (not to detect all 
comparisons for equality). I should have defined a separate overload 
that took one less parameter, but back when I wrote find() there were 
some compiler bugs that prevented me from doing so.

> I just did a quick check, and this means
> that these two calls DO NOT use the same algorithm:
>
> string a = ..., b = ...;
> auto fast = find!"a == b"(a, b);
> auto slow = find!"a==b"(a, b);
>
> I have to be honest, I have only just noticed this, but that's really
> sloppy.

Again, the point there is to detect the default (most frequently used), 
not all equality comparisons (which would be a much harder task). 
Probably it's worth revisiting this implementation at a point and do it 
the right way: an overload of find() for strings when a lambda is not 
specified.

> It's also a direct symptom of over-complicated code. As a user, I would
> 100% expect these calls to be the same.

Why? Would you expect that these calls are also the same?

find!((a, b) => a == b)(a, b);
find!((a, b) => b == a)(a, b);
find!((a, b) => !(b != a))(a, b);
auto lambda = (a, b) => a == b;
find!lambda(a, b);

I'm sure you figure how difficult the task of detecting all semantically 
equivalent lambdas are. The point there is, again, to optimize the 
frequent case when people just call find(a, b). That's it.

> If I accidentally used the
> second version, I would have no warning: my code would just run slower
> and I'd be none the wiser. Only upon careful inspection of the source
> could you discover what this "simple" code does, and you would be
> horrified like I am now.

I don't see any reason that justifies one being horrified.

> The next two overloads of find appears to implement a couple of
> reasonably fast substring. Again, it would take me a minute or so to
> figure out exactly what algorithm was being used.

This is actually not bad at all. Standard library implementations are 
very highly leveraged, and writing them in a "simple" manner that a 
beginner can understand immediately is optimizing along the wrong 
dimensions. No language that's actually used does that, including those 
for which traditionally performance is not a front concern, such as Lisp 
or Haskell.

> After that there's a multi-range find. Seems simple enough, but it's yet
> another overload to consider and I'm not sure it's commonly used enough
> to even warrant existence. I'd hate to think what would happen if I
> wanted the single-range search but accidentally added an extra parameter.

Implementing multi-range properly is difficult by an end user and a 
natural thing to add to the standard library.

> In summary: the specialisation detection is shocking, which leads me to
> question what other static ifs are accidentally firing or failing. If
> the code was more simple, and less overloaded it would be easy to reason
> about all this, but it's not. The various find implementations span a
> few hundred lines, and all have numerous compile time branches and
> checks. The cyclomatic complexity must be huge.

I don't think you are putting forth good arguments to simplify things. 
Essentially you're saying you want to read the source of the standard 
library but don't care to have it run fast.

> How I'd do things:
>
> - The first version of find would be the only one. No overloads, no
> specialisation.

So out the window is fast string searching and many other simple 
optimizations such as e.g. looking for the last element of a 
bidirectional sequence in a random-access range first.

> - substring would be a separate, non-predicated function. It would be
> specialised for strings. I'm too tired now to think about the variety of
> range specialisations, but I'm sure there's a more elegant way to handle
> to combinations.

So now when we have strings we can use the slow find and the fast 
substring for essentially the same operation? Why? "Well we thought it 
makes the standard library implementation one bit simpler." Ehm.

> - Drop the variadic find altogether.

I guess we could do that, but again, it's difficult to implement 
correctly by the user. Then there's the backwards compatibility to worry 
about.

> - Get rid of BoyerMooreFinder + find overload, replace with a boyerMoore
> function.

That's an experiment that I haven't decided whether to declare 
successful or not. I figured the fundamental optimizations of find(a, b) 
are effected as follows:

1. Add structure to a (e.g. a is sorted, has a trie associated with it etc);

2. Add structure to b (e.g. KMP, BM, RK)

3. Add structure to both (e.g. search a sorter range in another)

The decision on which algorithm to use is often dictated by this concern 
- where am I willing to focus the preprocessing effort? So I thought I'd 
clarify that in the way find(a, b) is invoked: depending on the 
structure of a and b, different search decisions are taken.

I'd been planning to add several other find implementations along these 
lines, but haven't gotten around to it yet.

Andrei