Improving std.algorithm.find
Jonathan M Davis
jmdavisprog at gmail.com
Sat Jul 17 21:19:37 PDT 2010
On Saturday 17 July 2010 15:55:26 Andrei Alexandrescu wrote:
> I was thinking of improving std.find. We have this bug report:
>
> http://d.puremagic.com/issues/show_bug.cgi?id=3923
>
> which is pretty vague but does have a point.
>
> For starters, it should be told that std.algorithm.find _does_ a lot,
> which at least partially justifies its complexity. One thing that I
> haven't seen other find()s doing is to be able to search in one pass a
> given range for multiple other ranges, like this:
>
> int[] a = [ 1, 4, 2, 3 ];
> assert(find(a, 5, 4, 3) == tuple([ 4, 2, 3 ], 2));
>
> When passed more than two arguments, find returns a tuple continuing the
> searched ranged positioned on the element found and the 1-based index of
> the parameter that was found. The trick is that find() makes exactly one
> pass through the searched range, which is often more efficient than
> searching the same range for each element in turn. Also the one-pass
> approach works with input ranges so it doesn't put pressure on the range
> capabilities.
>
> However the simplified find() looks like, I'd like to keep this feature
> unless it brings serious aggravation. Right now it's the #1 factor that
> complicates find's signature and implementation.
>
> Another aspect I'd like to discuss is use of Boyer-Moore and related
> fast finding techniques. Currently the use of B-M is explicit and a bit
> awkward. I was thinking instead to automatically detect its
> appropriatedness and use it transparently. Bearophile posted at a link
> to such a blend string search routine
> (http://effbot.org/zone/stringlib.htm) that I think can be generalized
> quite easily.
>
> Any ideas - please chime in.
>
>
> Andrei
I think that the problem with find() is not so much find() but it's documentation.
In all honesty, anything with a return type like FindResult!(Range, Ranges) is
going to scare people off. The find() that takes only a predicate and the range to
search in is much better:
Range find(alias pred, Range)(Range haystack);
Now, unfortunately, I'm not sure how to improve the main version of find other
than the return type.
FindResult!(Range,Ranges) find(alias pred = "a == b", Range, Ranges...)(Range
haystack, Ranges needles);
becomes
Range find(alias pred = "a == b", Range, Ranges...)(Range haystack, Ranges
needles);
However, that (alias pred = "a == b", Range, Ranges...) is still pretty ugly.
Unfortunately, I'm not sure how you would eliminate that. The only part that the
caller would normally care about is alias pred = "a == b", but the rest is part
of the template definition as well, and I don't think that we have a way to clean
that up at the moment.
The other problem is that you're getting two totally different return types
depending on whether you give it one needle or many (which may make Range not
work as a return type anyway since what you get in the case of multiple needs
isn't exactly a Range). So, I'd suggest either making it so that they return the
same thing and you have to do a comparison to figure out which needle was found
(in many cases, all you care about is that one of them was found anyway, but it
would be less efficient in cases where you do care), or you should split find()
into two functions - one which takes a single needle and one which takes many.
That should make things a bit clearer. It doesn't really change the
functionality, but the docs would be less cluttered. Another option would be to
make find only return a range for multiple needles but have a separate function
which returns a tuple if you really care, thereby simplifying find() a bit. In
such a case, it would probably still be best to separate find() itself into two
functions, but it wouldn't be as critical.
As for integrating BoyerMoore into find(), if you can do it cleanly and
seemlessly, I'm all for it. It would be good if there were a note about it in
the documentation so that those using it wouldn't feel the need to reimplement
it themselves, but I'm all for having versions of find() be specialized for the
type that they're searching on if it can be done seemlessly. I'd even suggest
doing the same for the various container types which could benefit from a
specialized find - though perhaps those would be better served by having a
special version of find() as part of their definition which std.algorithm.find()
then calls.
If anything, I've been more interested in canFind() and until() being made to
match up with find(). I'd like to be able to give them both pretty much the same
arguments and then get the bool from canFind() and the range that find() would
have walked over in the case of until(). A function which gave you both the
range that until() would have given you as well as the one which find() would
have given you would be nice as well. I previously opened a bug with thoughts
along those lines: http://d.puremagic.com/issues/show_bug.cgi?id=3888
In any case, I think that find() itself is more or less fine from a usage
standpoint. It's the docs that need help. And splitting up the function and
trying to simplify the signature in the docs (whether the signature itself is
really simplified or not) is what really needs to happen. The other thing would
be to add a lot more examples. That would be a big boon for a lot of
std.algorithm functions. They're not necessarily hard to use, and the examples
show it much more clearly than the signatures often do. The problem is that the
signatures are so ugly. And the fact that they have to be (in terms of the
actual function if not the documentation) is pretty much required given what
they do, so I'm not sure how much that can really be fixed. Maybe some extensions
to DDoc need to be done for it happen, but it would be good if we could find a
way to give simplified function signatures for a lot of these functions (possibly
with a way in the docs to see the full function signature) so that users of the
docs can see what they need to to use the functions without having to wade
through the nasty (albeit necessarily so) signatures.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list