faster splitter

Tue May 24 05:51:23 PDT 2016

On 05/24/2016 03:54 AM, qznc wrote:
> On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:
>> On 05/23/2016 03:11 PM, qznc wrote:
>>> Actually, std find should be faster, since it could use the Boyer Moore
>>> algorithm instead of naive string matching.
>>
>> Conventional wisdom has it that find() is brute force and that's that,
>> but probably it's time to destroy. Selectively using advanced
>> searching algorithms for the appropriate inputs is very DbI-ish.
>>
>> There are a few nice precedents of blend algorithms, see e.g.
>> http://effbot.org/zone/stringlib.htm.
>>
>> Writing a generic subsequence search blend algorithm, one that chooses
>> the right algorithm based on a combination of static and dynamic
>> decisions, is quite a fun and challenging project. Who wanna?
>
> I observed that Boyer-Moore from std is still slower. My guess is due to
> the relatively short strings in my benchmark and the need to allocate
> for the tables. Knuth-Morris-Pratt might be worthwhile, because only
> requires a smaller table, which could be allocated on the stack.

There's also Rabin-Karp that can be used when the sought range is 
relatively short.

> The overall sentiment seems to be that KMP vs BM depends on the input.
> This means an uninformed find function could only use heuristics.
> Anyway, Phobos could use a KnuthMorrisPrattFinder implementation next to
> the BoyerMooreFinder. I filed an issue [0].

Great. I keep on thinking how the advanced algorithms should "kick in" 
only after a partial match was long enough and frequent enough. As long 
as the partial match is short and infrequent, brute force will do best.

> Apart from advanced algorithms, find should not be slower than a naive
> nested-for-loop implementation.
>
> [0] https://issues.dlang.org/show_bug.cgi?id=16066

Nice, thanks.

Andrei