The Case Against Autodecode

Wed Jun 1 03:04:42 PDT 2016

On Tuesday, 31 May 2016 at 16:29:33 UTC, Joakim wrote:
> UTF-8 is an antiquated hack that needs to be eradicated.  It 
> forces all other languages than English to be twice as long, 
> for no good reason, have fun with that when you're downloading 
> text on a 2G connection in the developing world.

I assume you're talking about the web here. In this case, plain 
text makes up only a minor part of the entire traffic, the 
majority of which is images (binary data), javascript and 
stylesheets (almost pure ASCII), and HTML markup (ditto). It's 
like not significant even without taking compression into 
account, which is ubiquitous.

> It is unnecessarily inefficient, which is precisely why 
> auto-decoding is a problem.

No, inefficiency is the least of the problems with auto-decoding.

> It is only a matter of time till UTF-8 is ditched.

This is ridiculous, even if your other claims were true.

>
> D devs should lead the way in getting rid of the UTF-8 
> encoding, not bickering about how to make it more palatable.  I 
> suggested a single-byte encoding for most languages, with 
> double-byte for the ones which wouldn't fit in a byte.  Use 
> some kind of header or other metadata to combine strings of 
> different languages, _rather than encoding the language into 
> every character!_

I think I remember that post, and - sorry to be so blunt - it was 
one of the worst things I've ever seen proposed regarding text 
encoding.

>
> The common string-handling use case, by far, is strings with 
> only one language, with a distant second some substrings in a 
> second language, yet here we are putting the overhead into 
> every character to allow inserting characters from an arbitrary 
> language!  This is madness.

No. The common string-handling use case is code that is unaware 
which script (not language, btw) your text is in.