Improving std.algorithm.find

Jonathan M Davis jmdavisprog at gmail.com
Sat Jul 17 21:19:37 PDT 2010


On Saturday 17 July 2010 15:55:26 Andrei Alexandrescu wrote:
> I was thinking of improving std.find. We have this bug report:
> 
> http://d.puremagic.com/issues/show_bug.cgi?id=3923
> 
> which is pretty vague but does have a point.
> 
> For starters, it should be told that std.algorithm.find _does_ a lot,
> which at least partially justifies its complexity. One thing that I
> haven't seen other find()s doing is to be able to search in one pass a
> given range for multiple other ranges, like this:
> 
> int[] a = [ 1, 4, 2, 3 ];
> assert(find(a, 5, 4, 3) == tuple([ 4, 2, 3 ], 2));
> 
> When passed more than two arguments, find returns a tuple continuing the
> searched ranged positioned on the element found and the 1-based index of
> the parameter that was found. The trick is that find() makes exactly one
> pass through the searched range, which is often more efficient than
> searching the same range for each element in turn. Also the one-pass
> approach works with input ranges so it doesn't put pressure on the range
> capabilities.
> 
> However the simplified find() looks like, I'd like to keep this feature
> unless it brings serious aggravation. Right now it's the #1 factor that
> complicates find's signature and implementation.
> 
> Another aspect I'd like to discuss is use of Boyer-Moore and related
> fast finding techniques. Currently the use of B-M is explicit and a bit
> awkward. I was thinking instead to automatically detect its
> appropriatedness and use it transparently. Bearophile posted at a link
> to such a blend string search routine
> (http://effbot.org/zone/stringlib.htm) that I think can be generalized
> quite easily.
> 
> Any ideas - please chime in.
> 
> 
> Andrei

I think that the problem with find() is not so much find() but it's documentation. 
In all honesty, anything with a return type like FindResult!(Range, Ranges) is 
going to scare people off. The find() that takes only a predicate and the range to 
search in is much better:

Range find(alias pred, Range)(Range haystack);

Now, unfortunately, I'm not sure how to improve the main version of find other 
than the return type.

FindResult!(Range,Ranges) find(alias pred = "a == b", Range, Ranges...)(Range 
haystack, Ranges needles);

becomes

Range find(alias pred = "a == b", Range, Ranges...)(Range haystack, Ranges 
needles);

However, that (alias pred = "a == b", Range, Ranges...) is still pretty ugly. 
Unfortunately, I'm not sure how you would eliminate that. The only part that the 
caller would normally care about is alias pred = "a == b", but the rest is part 
of the template definition as well, and I don't think that we have a way to clean 
that up at the moment.

The other problem is that you're getting two totally different return types 
depending on whether you give it one needle or many (which may make Range not 
work as a return type anyway since what you get in the case of multiple needs 
isn't exactly a Range). So, I'd suggest either making it so that they return the 
same thing and you have to do a comparison to figure out which needle was found 
(in many cases, all you care about is that one of them was found anyway, but it 
would be less efficient in cases where you do care), or you should split find() 
into two functions - one which takes a single needle and one which takes many. 
That should make things a bit clearer. It doesn't really change the 
functionality, but the docs would be less cluttered. Another option would be to 
make find only return a range for multiple needles but have a separate function 
which returns a tuple if you really care, thereby simplifying find() a bit. In 
such a case, it would probably still be best to separate find() itself into two 
functions, but it wouldn't be as critical.

As for integrating BoyerMoore into find(), if you can do it cleanly and 
seemlessly, I'm all for it. It would be good if there were a note about it in 
the documentation so that those using it wouldn't feel the need to reimplement 
it themselves, but I'm all for having versions of find() be specialized for the 
type that they're searching on if it can be done seemlessly. I'd even suggest 
doing the same for the various container types which could benefit from a 
specialized find - though perhaps those would be better served by having a 
special version of find() as part of their definition which std.algorithm.find() 
then calls.

If anything, I've been more interested in canFind() and until() being made to 
match up with find(). I'd like to be able to give them both pretty much the same 
arguments and then get the bool from canFind() and the range that find() would 
have walked over in the case of until(). A function which gave you both the 
range that until() would have given you as well as the one which find() would 
have given you would be nice as well. I previously opened a bug with thoughts 
along those lines: http://d.puremagic.com/issues/show_bug.cgi?id=3888

In any case, I think that find() itself is more or less fine from a usage 
standpoint. It's the docs that need help. And splitting up the function and 
trying to simplify the signature in the docs (whether the signature itself is 
really simplified or not) is what really needs to happen. The other thing would 
be to add a lot more examples. That would be a big boon for a lot of 
std.algorithm functions. They're not necessarily hard to use, and the examples 
show it much more clearly than the signatures often do. The problem is that the 
signatures are so ugly. And the fact that they have to be (in terms of the 
actual function if not the documentation) is pretty much required given what 
they do, so I'm not sure how much that can really be fixed. Maybe some extensions 
to DDoc need to be done for it happen, but it would be good if we could find a 
way to give simplified function signatures for a lot of these functions (possibly 
with a way in the docs to see the full function signature) so that users of the 
docs can see what they need to to use the functions without having to wade 
through the nasty (albeit necessarily so) signatures.

- Jonathan M Davis


More information about the Digitalmars-d mailing list