Major performance problem with std.array.front()

Fri Mar 7 03:56:56 PST 2014

On Friday, 7 March 2014 at 02:37:11 UTC, Walter Bright wrote:
> In "Lots of low hanging fruit in Phobos" the issue came up 
> about the automatic encoding and decoding of char ranges.
>
> Throughout D's history, there are regular and repeated 
> proposals to redesign D's view of char[] to pretend it is not 
> UTF-8, but UTF-32. I.e. so D will automatically generate code 
> to decode and encode on every attempt to index char[].

I'm glad I'm not the only one who feels this way. Implicit 
decoding must die.

I strongly believe that implicit decoding of character points in 
std.range has been a mistake.

- Algorithms such as "countUntil" will count code points. These 
numbers are useless for slicing, and can introduce hard-to-find 
bugs.

- In lots of places, I've discovered that Phobos did UTF decoding 
(thus murdering performance) when it didn't need to. Such cases 
included format (now fixed), appender (now fixed), startsWith 
(now fixed - recently), skipOver (still unfixed). These have 
caused latent bugs in my programs that happened to be fed non-UTF 
data. There's no reason for why D should fail on non-UTF data if 
it has no reason to decode it in the first place! These failures 
have only served to identify places in Phobos where redundant 
decoding was occurring.

Furthermore, it doesn't actually solve anything completely! The 
only thing it solves is a subset of cases for a subset of 
languages!

People want to look at a string "character by character". If a 
Unicode code point is a character in your language and alphabet, 
I'm really happy for you, but that's not how it is for everyone. 
Combining marks, complex scripts etc. make this point just a 
fallacy that in the end will cause programmers to make mistakes 
that will affect certain users somewhere.

Why do people want to look at individual characters? There are a 
lot of misconceptions about Unicode, and I think some of that 
applies here.

- Do you want to split a string by whitespace? Some languages 
have no notion of whitespace. What do you need it for? Line 
wrapping? Employ the Unicode line-breaking algorithm instead.

- Do you want to uppercase the first letter of a string? Some 
language have no notion of letter case, and some use it for 
different reasons. Furthermore, even languages with a Latin-based 
alphabet may not have 1:1 mapping for case, e.g. the German ß 
letter.

- Do you want to count how wide a string will be in a fixed-point 
font? Wrong... Combining and control characters, zero-width 
whitespace, etc. will render this approach futile.

- Do you want to split or flush a stream to a character device at 
a point so that there's no garbage? I believe, this is the case 
in TDPL's mention of the subject. Again, combining characters or 
complex scripts will still be broken by this approach.

You need to either go all-out and provide complete 
implementations of the relevant Unicode algorithms to perform 
tasks such as the above that will work in all locales, or you 
need to draw a line somewhere for which languages, alphabets, 
locales do you want to support in your program. D's line is drawn 
at the point where it considers that code points == characters, 
however the outcome of this is clear nowhere in its documentation 
and for such an arbitrary decision (from a cultural point of 
view), it is embedded too deep into the language itself. With 
std.ascii, at least, it's clear to the user that the functions 
there will only work with English or languages using the same 
alphabet.

This doesn't apply universally. There are still cases like, e.g., 
regular expression ranges. [a-z] makes sense in English, and 
[а-я] makes sense in Russian, but I don't think that makes sense 
for all languages. However, for the most part, I think implicit 
decoding must be axed, and instead we need implementations of 
Unicode algorithms and the documentation to instruct users why 
and how to use them.