The Case Against Autodecode

Fri May 13 03:38:09 PDT 2016

On Thursday, May 12, 2016 13:15:45 Walter Bright via Digitalmars-d wrote:
> On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
>  > I am as unclear about the problems of autodecoding as I am about the
>  > necessity to remove curl. Whenever I ask I hear some arguments that work
>  > well emotionally but are scant on reason and engineering. Maybe it's
>  > time to rehash them? I just did so about curl, no solid argument seemed
>  > to come together. I'd be curious of a crisp list of grievances about
>  > autodecoding. -- Andrei
>
> Here are some that are not matters of opinion.
>
> 1. Ranges of characters do not autodecode, but arrays of characters do. This
> is a glaring inconsistency.
>
> 2. Every time one wants an algorithm to work with both strings and ranges,
> you wind up special casing the strings to defeat the autodecoding, or to
> decode the ranges. Having to constantly special case it makes for more
> special cases when plugging together components. These issues often escape
> detection when unittesting because it is convenient to unittest only with
> arrays.
>
> 3. Wrapping an array in a struct with an alias this to an array turns off
> autodecoding, another special case.
>
> 4. Autodecoding is slow and has no place in high speed string processing.
>
> 5. Very few algorithms require decoding.
>
> 6. Autodecoding has two choices when encountering invalid code units - throw
> or produce an error dchar. Currently, it throws, meaning no algorithms
> using autodecode can be made nothrow.
>
> 7. Autodecode cannot be used with unicode path/filenames, because it is
> legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out
> in the wild that pure Unicode is not universal - there's lots of dirty
> Unicode that should remain unmolested, and autocode does not play with
> that.
>
> 8. In my work with UTF-8 streams, dealing with autodecode has caused me
> considerably extra work every time. A convenient timesaver it ain't.
>
> 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
> importing std.array one way or another, and then autodecode is there.
>
> 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of
> being arrays in the first place.
>
> 11. Indexing an array produces different results than autodecoding, another
> glaring special case.

It also results in constantly special-casing algorithms for narrow strings
in order to avoid auto-decoding. Phobos does this all over the place. We
have a ridiculous amount of code in Phobos just to avoid auto-decoding, and
anyone who wants high performance will have to do the same.

And it's not like auto-decoding is even correct. It would be one thing if
auto-decoding were fully correct but slow, but to be fully correct, it would
need to operate at the grapheme level, not the code point level. So, by
default, we get slower code without actually getting fully correct code.

So, we're neither fast nor correct. We _are_ correct in more cases than we'd
be if we simply acted like ASCII was all there was, but what we end up with
is the illusion that we're correct when we're not. IIRC, Andrei talked in
TDPL about how Java's choice to go with UTF-16 was worse than the choice to
go with UTF-8, because it was correct in many more cases to operate on the
code unit level as if a code unit were a character, and it was therefore
harder to realize that what you were doing was wrong, whereas with UTF-8,
it's obvious very quickly. We currently have that same problem with
auto-decoding except that it's treating UTF-32 code units as if they were
full characters rather than treating UTF-16 code units as if they were full
characters.

Ideally, algorithms would be Unicode aware as appropriate, but the default
would be to operate on code units with wrappers to handle decoding by code
point or grapheme. Then it's easy to write fast code while still allowing
for full correctness. Granted, it's not necessarily easy to get correct code
that way, but anyone who wants fully correctness without caring about
efficiency can just use ranges of graphemes. Ranges of code points are rare
regardless.

Based on what I've seen in previous conversations on auto-decoding over the
past few years (be it in the newsgroup, on github, or at dconf), most of the
core devs think that auto-decoding was a major blunder that we continue to
pay for. But unfortunately, even if we all agree that it was a huge mistake
and want to fix it, the question remains of how to do that without breaking
tons of code - though since AFAIK, Andrei is still in favor of
auto-decoding, we'd have a hard time going forward with plans to get rid of
it even if we had come up with a good way of doing so. But I would love it
if we could get rid of auto-decoding and clean up string handling in D.

- Jonathan M Davis