Improving std.algorithm.find

Mon Jul 19 12:12:42 PDT 2010

On Mon, Jul 19, 2010 at 03:19, Andrei Alexandrescu <
SeeWebsiteForEmail at erdani.org> wrote:

>
> I agree that a findAll function yielding a range is nice, however it's not
> the simplest interface. I think there should be a function that just finds
> something somewhere.
>
>
Yes, I agree that's the common case.
What I personally would like is a way to find an element in an ordered
container (binary tree...)

>
>  Now, for many needles... Currently, you pass the predicate "a==b" to
>> find() and use it on all needles. A possible generalization is to have a
>> isOneOf() predicate, called with isOneOf(a, b0, b1, ... bn) where a is
>> the current element being tested and the b's are the needles passed to
>> find(). This generic calling of pred with (a, needles) could be done for
>> any predicate, like for example:
>>
>> find!(isBetween)(haystack, 0, 10)
>>
>
> Aren't such scenarios better served by putting the limits inside the
> predicate? Otherwise the list of what can be done could go on forever.

The problem is that, as the predicate is passed at compile-time, the limits
must be CT-defined too. I'd like the limits to be defined at runtime. I run
into this regularly, while using predicates in std.algorithms.

A not-so-uncommon scenario for me is having one range delivering limits (or
zones to explore, or any other runtime arguments), and finding a value
inside those limits in another range. Ideally, I'd like to use find() on a
spatial structure like an R-tree. ( http://en.wikipedia.org/wiki/R-tree ).

But I admit it's not that common.

>
>  The limit of this is you cannot easily get the index of the found
>> needle. btw, could you remind us why you used a 1-based index? My
>> opinion on this is that if you want to find any needle, then you mostly
>> care for them being present, and much less for wich one is the first.
>>
>
> I used one-based index to make tests easier - as you said, most often you
> care for the presence so zero/nonzero is easiest to tell. Then I thought it
> wouldn't hurt to provide the index since it's not any extra work.

Ah, I thought the scenario was: 1) test for the returned range (as the
tuple._0 element) being empty or not. 2) if not empty, something was found,
get the index. I forgot getting 0 as index would mean 'no element found'.
If I understood you correctly, then I'd prefer a -1 index indicating
failure, and a 0-based indexing on the needles if on is found. But that's a
cosmetic issue.

>
>  Also, the found element is accessible by taking the front of the retuned
>> range. Why provide it to the caller?
>>
>
> Because that doesn't generalize well to searching ranges (as oposed to
> individual elements).

Good point.

>
>  Another problem I see is to get the nice current behavior of having a
>> needle being a range like haystack:
>>
>> find([0,1,2,3,4,3,2,1], [4,3]) => [4,3,2,1]
>>
>> And that's I particularly like.
>>
>
> Not sure I understand whether that's good or bad,

That's good. I mean, the current situation is good and that's a
functionality I like.

> but I also want to provide the likes of findSkip() (it's been discussed
> here under a different name, findConsume) that positions the searched range
> after the found range:
>
> findSkip([0,1,2,3,4,3,2,1], [4,3]) => [2,1]
>
> It's very useful in practice.
>

I see that. And and empty range if it found nothing?
Another useful one is split() to cut the range in two or three parts :
before the match, the match itself (which can be one of the needles, so we
do not know it in advance) and the part after that. Return that triple as a
tuple. Hey, that way you can even implement replace: return chain(firstpart,
replacedmatch, tailpart).

Anything that can mimic regex functions on generic ranges is good, IMO. I
regularly find myself wanting to do some regex-like matching. We do not need
real pattern matching but even with simplified predicates, we can do many
things.

>
>  *****
>>
>>     Another aspect I'd like to discuss is use of Boyer-Moore and
>> related fast finding techniques. Currently the use of B-M is explicit
>> and a bit awkward. I was thinking instead to automatically detect its
>> appropriatedness and use it transparently. Bearophile posted at a link
>> to such a blend string search routine
>> (http://effbot.org/zone/stringlib.htm) that I think can be generalized
>> quite easily.
>> *****
>>
>> Indeed, reading this, it seems like it could be generalized easily. You
>> need a random-access range with a defined length. So the usage of such
>> an algo could indeed be done transparently. Hmm...
>>
>
> The current B-M implementation already works with any random-access range
> with length, but it's not integrated transparently with the regular find. I
> wonder what the threshold conditions might look like.
>

I'm sorry I didn't look at your implementation, as I don't know Boyer-Moore
that much. Does it need to know the alphabet size or do you use another
variant?

Philippe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20100719/9943690f/attachment.html>