Dicebot on leaving D: It is anarchy driven development in all its glory.

Jonathan M Davis newsgroup.d at jmdavisprog.com
Mon Aug 27 01:28:25 UTC 2018


On Sunday, August 26, 2018 5:12:10 PM MDT FeepingCreature via Digitalmars-d 
wrote:
> On Sunday, 26 August 2018 at 22:44:05 UTC, Walter Bright wrote:
> > On 8/26/2018 8:43 AM, Chris wrote:
> >> I wanted to get rid of autodecode and I even offered to test
> >> it on my string heavy code to see what breaks (and maybe write
> >> guidelines for the transition), but somehow the whole idea of
> >> getting rid of autodecode was silently abandoned. What more
> >> could I do?
> >
> > It's not silently abandoned. It will break just about every D
> > program out there. I have a hard time with the idea that
> > breakage of old code is inexcusable, so let's break every old
> > program?
>
> Can I just throw in here that I like autodecoding and I think
> it's good?
> If you want ranges that iterate over bytes, then just use arrays
> of bytes. If you want Latin1 text, use Latin1 strings. If you
> want Unicode, you get Unicode iteration. This seems right and
> proper to me. Hell I'd love if the language was *more* aggressive
> about validating casts to strings.

The problem is that auto-decoding doesn't even give you correct Unicode
handling. At best, it's kind of like using UTF-16 instead of ASCII but
assuming that a UTF-16 code unit can always contain an entire character
(which is frequently what you get in programs written in languages like Java
or C#). A bunch more characters then work properly, but plenty of characters
still don't. It's just a lot harder to realize it, because it's far from
fail-fast. In general, doing everything at the code point level with Unicode
(as auto-decoding does) is very much broken. It's just that it's a lot less
obvious, because that much more works - and it comes with the bonus of being
far less efficient.

If you wanted everything to "just work" out of the box without having to
worry about Unicode, you could probably do it if everything operated at the
grapheme cluster level, but that be would horribly inefficient. The sad
reality is that if you want your string-processing code to be at all fast
while still being correct, you have to have at least a basic understanding
of Unicode and use it correctly - and that rarely means doing much of
anything at the code point level. It's much more likely that it needs to be
at either the code unit or grapheme level. But either way, without a
programmer understanding the details and programming accordingly, the code
is just plain going to be wrong somewhere. The idea that we can have
string-processing "just work" without the programmer having to worry about
the details of Unicode is unfortunately largely a fallacy - at least if you
care about efficiency.

By operating at the code point level, we're just generating code that looks
like it works when it doesn't really, and it's less efficient. It certainly
works in more cases than just using ASCII would, but it's still broken for
Unicode handling just like if the code were assuming that char was always an
entire character. As such, I don't really see how there can be much defense
for auto-decoding. It was done on the incorrect assumption that code points
represented actually characters (for that you actually need graphemes) and
that the loss in speed was worth the correctness, with the idea that anyone
wanting the speed could work around the auto-decoding. We could get
something like that if we went to the grapheme level, but that would hurt
performance that much more. Either way, operating at the code point level
everywhere is just plain wrong. This isn't just a case of "it's annoying" or
"we're don't like it." It objectively results in incorrect code.

- Jonathan M Davis





More information about the Digitalmars-d mailing list