Why not flag away the mistakes of the past?

Jonathan M Davis newsgroup.d at jmdavisprog.com
Wed Mar 7 14:05:43 UTC 2018


On Wednesday, March 07, 2018 13:40:20 Nick Treleaven via Digitalmars-d 
wrote:
> On Wednesday, 7 March 2018 at 13:24:25 UTC, Jonathan M Davis
>
> wrote:
> > I'd actually argue that that's the lesser of the problems with
> > auto-decoding. The big problem is that it's auto-decoding. Code
> > points are almost always the wrong level to be operating at.
>
> For me the fundamental problem is having char[] in the language
> at all, meaning a Unicode string. Arbitrary slicing and indexing
> are not Unicode compatible, if we revisit this we need a String
> type that doesn't support those operations. Plus the issue of
> string validation - a Unicode string type should be assumed to
> have valid contents - unsafe data should only be checked at
> string construction time, so iterating should always be nothrow.

In principle, char is supposed to be a UTF-8 code unit, and strings are
supposed to be validated up front rather than constantly validated, but it's
never been that way in practice.

Regardless, having char[] be sliceable is actually perfectly fine and
desirable. That's exactly what you want whenever you operate on code units,
and it's frequently the case that you want to be operating at the code unit
level. But the programmer needs to be able to reasonably control when code
units, code points, or graphemes are used, because each has their time and
place. If we had a string type, it would need to provide access to each of
those levels and likely would not be directly sliceable at all, because
slicing a string is kind of meaningless, because in principle, a string is
just on opaque piece of character data. It's when you're dealing at the code
unit, code point, or grapheme level that you actually start operating on
pieces of a string, and that means that the level that you're operating at
needs to be defined.

Having char[] be an array of code units works quite well, because then you
have efficiency by default. You then need to wrap it in another range type
when appropriate to get a range of code points or graphemes, or you need to
explicitly decode when appropriate. Whereas right now, what we have is
Phobos being "helpful" and constantly decoding for us such that we get
needlessy inefficient code, and it's at the code point level, which is
usually not the level you want to operate at. So, you don't have efficiency
or correctness.

Ultimately, it really doesn't work to hide the details of Unicode and not
have the programmer worry about code units, code points, and graphemes
unless you don't care about efficency. As such, what we really need is to
cleanly give the programmer the tools to manage Unicode without the language
or library assuming what the programmer wants - especially assuming an
inefficient default. The language itself actually does a decent job of that.
It's Phobos that dropped the ball on that one, because Andrei didn't know
about graphemes and tried to make Phobos Unicode-correct by default.
Instead, we get inefficient and incorrect by defaltu.

- Jonathan M Davis



More information about the Digitalmars-d mailing list