The Case Against Autodecode

Tue May 31 09:29:33 PDT 2016

On Monday, 30 May 2016 at 17:35:36 UTC, Chris wrote:
> On Monday, 30 May 2016 at 16:03:03 UTC, Marco Leise wrote:
>
>> *** http://site.icu-project.org/home#TOC-What-is-ICU-
>
> I was actually talking about ICU with a colleague today. Could 
> it be that Unicode itself is broken? I've often heard criticism 
> of Unicode but never looked into it.

Part of it is the complexity of written language, part of it is 
bad technical decisions.  Building the default string type in D 
around the horrible UTF-8 encoding was a fundamental mistake, 
both in terms of efficiency and complexity.  I noted this in one 
of my first threads in this forum, and as Andrei said at the 
time, nobody agreed with me, with a lot of hand-waving about how 
efficiency wasn't an issue or that UTF-8 arrays were fine.  
Fast-forward years later and exactly the issues I raised are now 
causing pain.

UTF-8 is an antiquated hack that needs to be eradicated.  It 
forces all other languages than English to be twice as long, for 
no good reason, have fun with that when you're downloading text 
on a 2G connection in the developing world.  It is unnecessarily 
inefficient, which is precisely why auto-decoding is a problem.  
It is only a matter of time till UTF-8 is ditched.

D devs should lead the way in getting rid of the UTF-8 encoding, 
not bickering about how to make it more palatable.  I suggested a 
single-byte encoding for most languages, with double-byte for the 
ones which wouldn't fit in a byte.  Use some kind of header or 
other metadata to combine strings of different languages, _rather 
than encoding the language into every character!_

The common string-handling use case, by far, is strings with only 
one language, with a distant second some substrings in a second 
language, yet here we are putting the overhead into every 
character to allow inserting characters from an arbitrary 
language!  This is madness.

Yes, the complexity of diacritics and combining characters will 
remain, but that is complexity that is inherent to the variety of 
written language.  UTF-8 is not: it is just a bad technical 
decision, likely chosen for ASCII compatibility and some 
misguided notion that being able to combine arbitrary language 
strings with no other metadata was worthwhile.  It is not.