Major performance problem with std.array.front()
Michel Fortin
michel.fortin at michelf.ca
Fri Mar 7 05:40:31 PST 2014
On 2014-03-07 03:59:55 +0000, "bearophile" <bearophileHUGS at lycos.com> said:
> Walter Bright:
>
>> I understand this all too well. (Note that we currently have a
>> different silent problem: unnoticed large performance problems.)
>
> On the other hand your change could introduce Unicode-related bugs in
> future code (that the current Phobos avoids) (and here I am not talking
> about code breakage).
The way Phobos works isn't any more correct than dealing with code
units. Many graphemes span on multiple code points -- because of
combined diacritics or character variant modifiers -- and decoding at
the code-point level is thus often insufficient for correctness.
The problem with Unicode strings is that the representation you must
work with depends on the things you want to do. If you want to count
the characters then you need graphemes; if you want to parse XML then
you'll need to work with code points (in theory, in practice you might
still want direct access to code units for performance reasons); and if
you want to slice or copy a string then you need to deal with code
units. Because of this multiple-representation-for-different-purpose
thing, generic algorithms for arrays don't map very well to string.
>From my experience, I'd suggest these basic operations for a "string
range" instead of the regular range interface:
.empty
.frontCodeUnit
.frontCodePoint
.frontGrapheme
.popFrontCodeUnit
.popFrontCodePoint
.popFrontGrapheme
.codeUnitLength (aka length)
.codePointLength (for dchar[] only)
.codePointLengthLinear
.graphemeLengthLinear
Someone should be able to mix all the three 'front' and 'pop' function
variants above in any code dealing with a string type. In my XML parser
for instance I regularly use frontCodeUnit to avoid the decoding
penalty when matching the next character with an ASCII one such as '<'
or '&'. An API like the one above forces you to be aware of the level
you're working on, making bugs and inefficiencies stand out (as long as
you're familiar with each representation).
If someone wants to use a generic array/range algorithm with a string,
my opinion is that he should have to wrap it in a range type that maps
front and popFront to one of the above variant. Having to do that
should make it obvious that there's an inefficiency there, as you're
using an algorithm that wasn't tailored to work with strings and that
more decoding than strictly necessary is being done.
--
Michel Fortin
michel.fortin at michelf.ca
http://michelf.ca
More information about the Digitalmars-d
mailing list