The Case Against Autodecode

Tue May 31 13:01:14 PDT 2016

On Tuesday, May 31, 2016 15:33:38 Andrei Alexandrescu via Digitalmars-d wrote:
> On 05/31/2016 02:53 PM, Jonathan M Davis via Digitalmars-d wrote:
> > walkLength treats a code point like it's a character.
>
> No, it treats a code point like it's a code point. -- Andrei

Wasn't the whole point of operating at the code point level by default to
make it so that code would be operating on full characters by default
instead of chopping them up as is so easy to do when operating at the code
unit level? Thanks to how Phobos treats strings as ranges of dchar, most D
code treats code points as if they were characters. So, whether it's correct
or not, a _lot_ of D code is treating walkLength like it returns the number
of characters in a string. And if walkLength doesn't provide the number of
characters in a string, why would I want to use it under normal
circumstances? Why would I want to be operating at the code point level in
my code? It's not necessarily a full character, since it's not necessarily a
grapheme. So, by using walkLength and front and popFront and whatnot with
strings, I'm not getting full characters. I'm still only getting pieces of
characters - just like would happen if strings were treated as ranges of
code units. I'm just getting bigger pieces of the characters out of the
deal. But if they're not full characters, what's the point?

I am sure that there is code that is going to want to operate at the code
point level, but your average program is either operating on strings as a
whole or individual characters. As long as strings are being operated on as
a whole, code units are generally plenty, and careful encoding of characters
into code units for comparisons means that much of the time that you want to
operate on individual characters, you can still operate at the code unit
level. But if you can't, then you need the grapheme level, because a code
point is not necessarily a full character.

So, what is the point of operating on code points in your average D program?
walkLength will not always tell me the number of characters in a string.
front risks giving me a partial character rather than a whole one. Slicing
dchar[] risks chopping up characters just like slicing char[] does.
Operating on code points by default does not result in correct string
processing.

I honestly don't see how autodecoding is defensible. We may not be able to
get rid of it due to the breakage that doing that would cause, but I fail to
see how it is at all desirable that we have autodecoded strings. I can
understand how we got it if it's based on a misunderstanding on your part
about how Unicode works. We all make mistakes. But I fail to see how
autodecoding wasn't a mistake. It's the worst of both worlds - inefficient
while still incorrect. At least operating at the code unit level would be
fast while being incorrect, and it would be obviously incorrect once you did
anything with non-ASCII values, whereas it's easy to miss that ranges of
dchar are doing the wrong thing too

- Jonathan M Davis