Improving std.algorithm.find

Sun Jul 18 14:30:10 PDT 2010

On Sun, Jul 18, 2010 at 00:55, Andrei Alexandrescu  wrote:
*****
    I was thinking of improving std.find.

    For starters, it should be told that std.algorithm.find _does_ a lot,
which at least partially justifies its complexity. One thing that I haven't
seen other find()s doing is to be able to search in one pass a given range
for multiple other ranges, like this:

    int[] a = [ 1, 4, 2, 3 ];
    assert(find(a, 5, 4, 3) == tuple([ 4, 2, 3 ], 2));

    When passed more than two arguments, find returns a tuple continuing the
searched ranged positioned on the element found and the 1-based index of the
parameter that was found. The trick is that find() makes exactly one pass
through the searched range, which is often more efficient than searching the
same range for each element in turn. Also the one-pass approach works with
input ranges so it doesn't put pressure on the range capabilities.
*****

You could consider than find() is a multiple-returns function: it may return
0, 1 or many values (the successive finds). Which can, obviously, be easily
represented by a range. I see find() as bit like filter(): its return type
is a range determined by the predicate and whose elements are the rest of
the input range.
so:
find([ 1, 4, 2, 3, 4, 0 ], 4) returns a range whose elements are [4,2,3,4,0]
and [4,0].

If no needle is present in haystack, then the result range is empty. It's
just it's an empty range of ranges, not an empty version of the original
range.

A complement information can be how many steps were taken (that is, the
index of the found elements):  ([4,2,3,4,0], 1), ([4,0], 4). A policy or
another function could take care of that: findIndexed(haystack, needle) or
find!(withIndex)(haystack, needle).

The advantage of returning a range is the possibility to iterate on the
results. I, for one, like the idea of having find() in the same family that
workhorses like map() or filter()... The drawback is the necessity to test
the result with empty to know if you had results, but that's not different
from the current version.

So, you could have a findFirst() that does not return a range, but one
element, like the current find() does. Or call what I suggest findAll().

Now, for many needles... Currently, you pass the predicate "a==b" to find()
and use it on all needles. A possible generalization is to have a isOneOf()
predicate, called with isOneOf(a, b0, b1, ... bn) where a is the current
element being tested and the b's are the needles passed to find(). This
generic calling of pred with (a, needles) could be done for any predicate,
like for example:

find!(isBetween)(haystack, 0, 10)

where isBetween(a, left, right) returns true if (left <= a && a <= right).

The default would be "a==b" for one needle and isOneOf(a, b...) if
needles.length > 1. So you still get the current behavior, with a bit of
possible customization thrown in, like the isBetween example. I don't think
that's possible today?

The limit of this is you cannot easily get the index of the found needle.
btw, could you remind us why you used a 1-based index? My opinion on this is
that if you want to find any needle, then you mostly care for them being
present, and much less for wich one is the first. Also, the found element is
accessible by taking the front of the retuned range. Why provide it to the
caller?

Another problem I see is to get the nice current behavior of having a needle
being a range like haystack:

find([0,1,2,3,4,3,2,1], [4,3]) => [4,3,2,1]

And that's I particularly like.

*****
    Another aspect I'd like to discuss is use of Boyer-Moore and related
fast finding techniques. Currently the use of B-M is explicit and a bit
awkward. I was thinking instead to automatically detect its appropriatedness
and use it transparently. Bearophile posted at a link to such a blend string
search routine (http://effbot.org/zone/stringlib.htm) that I think can be
generalized quite easily.
*****

Indeed, reading this, it seems like it could be generalized easily. You need
a random-access range with a defined length. So the usage of such an algo
could indeed be done transparently. Hmm...

Philippe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20100718/35e7c66e/attachment.html>