Improving std.algorithm.find

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Sun Jul 18 18:19:35 PDT 2010


On 07/18/2010 04:30 PM, Philippe Sigaud wrote:
> On Sun, Jul 18, 2010 at 00:55, Andrei Alexandrescu  wrote:
> You could consider than find() is a multiple-returns function: it may
> return 0, 1 or many values (the successive finds). Which can, obviously,
> be easily represented by a range. I see find() as bit like filter(): its
> return type is a range determined by the predicate and whose elements
> are the rest of the input range.
> so:
> find([ 1, 4, 2, 3, 4, 0 ], 4) returns a range whose elements are
> [4,2,3,4,0] and [4,0].
>
> If no needle is present in haystack, then the result range is empty.
> It's just it's an empty range of ranges, not an empty version of the
> original range.
>
> A complement information can be how many steps were taken (that is, the
> index of the found elements):  ([4,2,3,4,0], 1), ([4,0], 4). A policy or
> another function could take care of that: findIndexed(haystack, needle)
> or find!(withIndex)(haystack, needle).

I agree that a findAll function yielding a range is nice, however it's 
not the simplest interface. I think there should be a function that just 
finds something somewhere.

> Now, for many needles... Currently, you pass the predicate "a==b" to
> find() and use it on all needles. A possible generalization is to have a
> isOneOf() predicate, called with isOneOf(a, b0, b1, ... bn) where a is
> the current element being tested and the b's are the needles passed to
> find(). This generic calling of pred with (a, needles) could be done for
> any predicate, like for example:
>
> find!(isBetween)(haystack, 0, 10)

Aren't such scenarios better served by putting the limits inside the 
predicate? Otherwise the list of what can be done could go on forever.

> The limit of this is you cannot easily get the index of the found
> needle. btw, could you remind us why you used a 1-based index? My
> opinion on this is that if you want to find any needle, then you mostly
> care for them being present, and much less for wich one is the first.

I used one-based index to make tests easier - as you said, most often 
you care for the presence so zero/nonzero is easiest to tell. Then I 
thought it wouldn't hurt to provide the index since it's not any extra work.

> Also, the found element is accessible by taking the front of the retuned
> range. Why provide it to the caller?

Because that doesn't generalize well to searching ranges (as oposed to 
individual elements).

> Another problem I see is to get the nice current behavior of having a
> needle being a range like haystack:
>
> find([0,1,2,3,4,3,2,1], [4,3]) => [4,3,2,1]
>
> And that's I particularly like.

Not sure I understand whether that's good or bad, but I also want to 
provide the likes of findSkip() (it's been discussed here under a 
different name, findConsume) that positions the searched range after the 
found range:

findSkip([0,1,2,3,4,3,2,1], [4,3]) => [2,1]

It's very useful in practice.

> *****
>      Another aspect I'd like to discuss is use of Boyer-Moore and
> related fast finding techniques. Currently the use of B-M is explicit
> and a bit awkward. I was thinking instead to automatically detect its
> appropriatedness and use it transparently. Bearophile posted at a link
> to such a blend string search routine
> (http://effbot.org/zone/stringlib.htm) that I think can be generalized
> quite easily.
> *****
>
> Indeed, reading this, it seems like it could be generalized easily. You
> need a random-access range with a defined length. So the usage of such
> an algo could indeed be done transparently. Hmm...

The current B-M implementation already works with any random-access 
range with length, but it's not integrated transparently with the 
regular find. I wonder what the threshold conditions might look like.

Andrei


More information about the Digitalmars-d mailing list