Major performance problem with std.array.front()

Michel Fortin michel.fortin at michelf.ca
Fri Mar 7 05:40:31 PST 2014


On 2014-03-07 03:59:55 +0000, "bearophile" <bearophileHUGS at lycos.com> said:

> Walter Bright:
> 
>> I understand this all too well. (Note that we currently have a 
>> different silent problem: unnoticed large performance problems.)
> 
> On the other hand your change could introduce Unicode-related bugs in 
> future code (that the current Phobos avoids) (and here I am not talking 
> about code breakage).

The way Phobos works isn't any more correct than dealing with code 
units. Many graphemes span on multiple code points -- because of 
combined diacritics or character variant modifiers -- and decoding at 
the code-point level is thus often insufficient for correctness.

The problem with Unicode strings is that the representation you must 
work with depends on the things you want to do. If you want to count 
the characters then you need graphemes; if you want to parse XML then 
you'll need to work with code points (in theory, in practice you might 
still want direct access to code units for performance reasons); and if 
you want to slice or copy a string then you need to deal with code 
units. Because of this multiple-representation-for-different-purpose 
thing, generic algorithms for arrays don't map very well to string.

>From my experience, I'd suggest these basic operations for a "string 
range" instead of the regular range interface:

.empty
.frontCodeUnit
.frontCodePoint
.frontGrapheme
.popFrontCodeUnit
.popFrontCodePoint
.popFrontGrapheme
.codeUnitLength (aka length)
.codePointLength (for dchar[] only)
.codePointLengthLinear
.graphemeLengthLinear

Someone should be able to mix all the three 'front' and 'pop' function 
variants above in any code dealing with a string type. In my XML parser 
for instance I regularly use frontCodeUnit to avoid the decoding 
penalty when matching the next character with an ASCII one such as '<' 
or '&'. An API like the one above forces you to be aware of the level 
you're working on, making bugs and inefficiencies stand out (as long as 
you're familiar with each representation).

If someone wants to use a generic array/range algorithm with a string, 
my opinion is that he should have to wrap it in a range type that maps 
front and popFront to one of the above variant. Having to do that 
should make it obvious that there's an inefficiency there, as you're 
using an algorithm that wasn't tailored to work with strings and that 
more decoding than strictly necessary is being done.

-- 
Michel Fortin
michel.fortin at michelf.ca
http://michelf.ca



More information about the Digitalmars-d mailing list