Major performance problem with std.array.front()
Vladimir Panteleev
vladimir at thecybershadow.net
Fri Mar 7 03:56:56 PST 2014
On Friday, 7 March 2014 at 02:37:11 UTC, Walter Bright wrote:
> In "Lots of low hanging fruit in Phobos" the issue came up
> about the automatic encoding and decoding of char ranges.
>
> Throughout D's history, there are regular and repeated
> proposals to redesign D's view of char[] to pretend it is not
> UTF-8, but UTF-32. I.e. so D will automatically generate code
> to decode and encode on every attempt to index char[].
I'm glad I'm not the only one who feels this way. Implicit
decoding must die.
I strongly believe that implicit decoding of character points in
std.range has been a mistake.
- Algorithms such as "countUntil" will count code points. These
numbers are useless for slicing, and can introduce hard-to-find
bugs.
- In lots of places, I've discovered that Phobos did UTF decoding
(thus murdering performance) when it didn't need to. Such cases
included format (now fixed), appender (now fixed), startsWith
(now fixed - recently), skipOver (still unfixed). These have
caused latent bugs in my programs that happened to be fed non-UTF
data. There's no reason for why D should fail on non-UTF data if
it has no reason to decode it in the first place! These failures
have only served to identify places in Phobos where redundant
decoding was occurring.
Furthermore, it doesn't actually solve anything completely! The
only thing it solves is a subset of cases for a subset of
languages!
People want to look at a string "character by character". If a
Unicode code point is a character in your language and alphabet,
I'm really happy for you, but that's not how it is for everyone.
Combining marks, complex scripts etc. make this point just a
fallacy that in the end will cause programmers to make mistakes
that will affect certain users somewhere.
Why do people want to look at individual characters? There are a
lot of misconceptions about Unicode, and I think some of that
applies here.
- Do you want to split a string by whitespace? Some languages
have no notion of whitespace. What do you need it for? Line
wrapping? Employ the Unicode line-breaking algorithm instead.
- Do you want to uppercase the first letter of a string? Some
language have no notion of letter case, and some use it for
different reasons. Furthermore, even languages with a Latin-based
alphabet may not have 1:1 mapping for case, e.g. the German ß
letter.
- Do you want to count how wide a string will be in a fixed-point
font? Wrong... Combining and control characters, zero-width
whitespace, etc. will render this approach futile.
- Do you want to split or flush a stream to a character device at
a point so that there's no garbage? I believe, this is the case
in TDPL's mention of the subject. Again, combining characters or
complex scripts will still be broken by this approach.
You need to either go all-out and provide complete
implementations of the relevant Unicode algorithms to perform
tasks such as the above that will work in all locales, or you
need to draw a line somewhere for which languages, alphabets,
locales do you want to support in your program. D's line is drawn
at the point where it considers that code points == characters,
however the outcome of this is clear nowhere in its documentation
and for such an arbitrary decision (from a cultural point of
view), it is embedded too deep into the language itself. With
std.ascii, at least, it's clear to the user that the functions
there will only work with English or languages using the same
alphabet.
This doesn't apply universally. There are still cases like, e.g.,
regular expression ranges. [a-z] makes sense in English, and
[а-я] makes sense in Russian, but I don't think that makes sense
for all languages. However, for the most part, I think implicit
decoding must be axed, and instead we need implementations of
Unicode algorithms and the documentation to instruct users why
and how to use them.
More information about the Digitalmars-d
mailing list