[up for grabs]Two-way string searching

Dmitry Olshansky dmitry.olsh at gmail.com
Mon Jun 17 10:13:37 PDT 2013


17-Jun-2013 15:36, monarch_dodra пишет:
> On Sunday, 16 June 2013 at 23:09:31 UTC, Andrei Alexandrescu wrote:
>> On 6/16/13 6:16 PM, Dmitry Olshansky wrote:
>>> It have long bugged me that Phobos std.algorithm.find is slow. Far
>>> slower then strstr (esp on *nix where it competes against GLIBC[1]).
>>>
>>> The current state of the art in searching substring
>>> with O(1) space requirement is Two-Way algorithm:
>>>
>>> http://www-igm.univ-mlv.fr/~lecroq/string/node26.html
>>>
>>> which is linear in time.
>>>
>>> I could imagine it would be interesting to implement a generic version
>>> as well. Any takers? I'd gladly go for it myself but unfortunately way
>>> too busy on other fronts (I planed to do it for a couple of months
>>> already).
>>>
>>> [1] See a relevant bug report and discussion e.g on glibc
>>> http://sourceware.org/bugzilla/show_bug.cgi?id=5514
>>
>> Awesome idea! Could you please submit an enhancement request for this?
>> I wonder what kind of ranges that could work on.
>>
>> Andrei
>
> One of the "problems" is that find is array agnostic, so doesn't know
> how to squeeze out all the juice out arrays, such as:
> * No bounds check on accesses
> * No bounds check on slicing
> * vectorized comparisons
>
> I took the existing find(RA, BD) code, and modified it to operate on
> find(ARR, ARR).
>
> On average, I'm getting roughly 20%-25% performance improvements
> (-release -O -inline), although the result is of course highly dependent
> on the tested input.
>
> Goes to say that by addapting the existing algorithm to simply better
> exploit arrays, there is already room for good improvements.
>

Agreed and there are many tricks beyond that. Like making a repeated bit 
pattern and searching one machine word at a time (if element size < 
size_t.sizeof).

And this is all nice and have its place too but I'm mostly talking about 
O(N*M) vs O(N).

Anyway, filed:
http://d.puremagic.com/issues/show_bug.cgi?id=10392


> Given that string-to-same-width-string boils back down integral array
> search, the gains would also be had for strings.
>
> --------
> I was also able to squeeze out similar performace boosts for find(R, E),
> with minimal code changes, exploiting better iteration semantics based
> on the type iterated (range, RA array, or narrow string).
>
> --------
> I can start by improving find(R, E), because it is a small but easy and
> effective change.
>
> For find(R, R), things are a bit more dicey to properly integrate, so I
> don't want to do anything right now.
>
> But the point is that there is still room for substantial improvements
> without modifying the algorithm too much...


-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list