faster splitter

Andrei Alexandrescu via Digitalmars-d digitalmars-d at puremagic.com
Tue May 31 11:36:51 PDT 2016


On 5/31/16 1:54 PM, qznc wrote:
> On Tuesday, 31 May 2016 at 01:55:16 UTC, Andrei Alexandrescu wrote:
>> I agree it's difficult to characterize the behavior of substring
>> search with one number. There are many dimensions of variation. (But
>> there's no reason for an emotional response.) A few possible baselines
>> come to mind:
>>
>> * Search a long string for a one-character string, match and fail.
>
> There is a special version of find for searching a single char in a
> string. Using a one-letter needle string is more like a user mistake
> than something to optimize for.

Oh, I see what you mean - it's a string consisting only of one 
character, not one character 'x'. I agree, but I'm just saying it's a 
sound baseline.

>> * Take an English text string. Search for a substring consisting of
>> its last portion (e.g. 5% of the length).
>
> How long should the english text be? A Tweet? A book? A Gigabyte of log
> files?

I'm thinking of a few exponentially increasing sizes. e.g. 100, 1000, 
10_000, 100_000, 1_000_000.

> English text means basically ASCII and no Unicode?

My understanding (and correct me if I'm wrong) is that we're discussing 
search with the default predicate, ignoring equivalent grapheme 
representations etc. So there's no need to mind encoding at all, which 
means the text could be in any language. Foreign languages would further 
increase the vocabulary, but that's the only effect.

>> * Take an English text string. Search for a substring consisting of a
>> fraction of the text (e.g. 3%) with additional characters prepended.
>> Repeat for appended.
>
> Why the prepend/append? To force a mismatch?

Yah, and it's a good test for strings that have some equal portion yet 
they are different, which seems to me puts a greater strain on the 
algorithms. It is arguably a likely situation in the real world.


Thanks,

Andrei



More information about the Digitalmars-d mailing list