[up for grabs]Two-way string searching

monarch_dodra monarchdodra at gmail.com
Mon Jun 17 04:36:10 PDT 2013


On Sunday, 16 June 2013 at 23:09:31 UTC, Andrei Alexandrescu 
wrote:
> On 6/16/13 6:16 PM, Dmitry Olshansky wrote:
>> It have long bugged me that Phobos std.algorithm.find is slow. 
>> Far
>> slower then strstr (esp on *nix where it competes against 
>> GLIBC[1]).
>>
>> The current state of the art in searching substring
>> with O(1) space requirement is Two-Way algorithm:
>>
>> http://www-igm.univ-mlv.fr/~lecroq/string/node26.html
>>
>> which is linear in time.
>>
>> I could imagine it would be interesting to implement a generic 
>> version
>> as well. Any takers? I'd gladly go for it myself but 
>> unfortunately way
>> too busy on other fronts (I planed to do it for a couple of 
>> months
>> already).
>>
>> [1] See a relevant bug report and discussion e.g on glibc
>> http://sourceware.org/bugzilla/show_bug.cgi?id=5514
>
> Awesome idea! Could you please submit an enhancement request 
> for this? I wonder what kind of ranges that could work on.
>
> Andrei

One of the "problems" is that find is array agnostic, so doesn't 
know how to squeeze out all the juice out arrays, such as:
* No bounds check on accesses
* No bounds check on slicing
* vectorized comparisons

I took the existing find(RA, BD) code, and modified it to operate 
on find(ARR, ARR).

On average, I'm getting roughly 20%-25% performance improvements 
(-release -O -inline), although the result is of course highly 
dependent on the tested input.

Goes to say that by addapting the existing algorithm to simply 
better exploit arrays, there is already room for good 
improvements.

Given that string-to-same-width-string boils back down integral 
array search, the gains would also be had for strings.

--------
I was also able to squeeze out similar performace boosts for 
find(R, E), with minimal code changes, exploiting better 
iteration semantics based on the type iterated (range, RA array, 
or narrow string).

--------
I can start by improving find(R, E), because it is a small but 
easy and effective change.

For find(R, R), things are a bit more dicey to properly 
integrate, so I don't want to do anything right now.

But the point is that there is still room for substantial 
improvements without modifying the algorithm too much...


More information about the Digitalmars-d mailing list