Major performance problem with std.array.front()

Mon Mar 10 14:44:22 PDT 2014

On 3/7/2014 8:40 AM, Michel Fortin wrote:
> On 2014-03-07 03:59:55 +0000, "bearophile" <bearophileHUGS at lycos.com> said:
>
>> Walter Bright:
>>
>>> I understand this all too well. (Note that we currently have a
>>> different silent problem: unnoticed large performance problems.)
>>
>> On the other hand your change could introduce Unicode-related bugs in
>> future code (that the current Phobos avoids) (and here I am not
>> talking about code breakage).
>
> The way Phobos works isn't any more correct than dealing with code
> units. Many graphemes span on multiple code points -- because of
> combined diacritics or character variant modifiers -- and decoding at
> the code-point level is thus often insufficient for correctness.
>

Well, it is *more* correct, as many western languages are more likely in 
current Phobos to "just work" in most cases. It's just that things still 
aren't completely correct overall.

>  From my experience, I'd suggest these basic operations for a "string
> range" instead of the regular range interface:
>
> .empty
> .frontCodeUnit
> .frontCodePoint
> .frontGrapheme
> .popFrontCodeUnit
> .popFrontCodePoint
> .popFrontGrapheme
> .codeUnitLength (aka length)
> .codePointLength (for dchar[] only)
> .codePointLengthLinear
> .graphemeLengthLinear
>
> Someone should be able to mix all the three 'front' and 'pop' function
> variants above in any code dealing with a string type. In my XML parser
> for instance I regularly use frontCodeUnit to avoid the decoding penalty
> when matching the next character with an ASCII one such as '<' or '&'.
> An API like the one above forces you to be aware of the level you're
> working on, making bugs and inefficiencies stand out (as long as you're
> familiar with each representation).
>
> If someone wants to use a generic array/range algorithm with a string,
> my opinion is that he should have to wrap it in a range type that maps
> front and popFront to one of the above variant. Having to do that should
> make it obvious that there's an inefficiency there, as you're using an
> algorithm that wasn't tailored to work with strings and that more
> decoding than strictly necessary is being done.
>

I actually like this suggestion quite a bit.