The Case Against Autodecode

Joakim via Digitalmars-d digitalmars-d at puremagic.com
Tue May 31 13:48:12 PDT 2016


On Tuesday, 31 May 2016 at 20:20:46 UTC, Marco Leise wrote:
> Am Tue, 31 May 2016 16:29:33 +0000
> schrieb Joakim <dlang at joakim.fea.st>:
>
>> Part of it is the complexity of written language, part of it 
>> is bad technical decisions.  Building the default string type 
>> in D around the horrible UTF-8 encoding was a fundamental 
>> mistake, both in terms of efficiency and complexity.  I noted 
>> this in one of my first threads in this forum, and as Andrei 
>> said at the time, nobody agreed with me, with a lot of 
>> hand-waving about how efficiency wasn't an issue or that UTF-8 
>> arrays were fine. Fast-forward years later and exactly the 
>> issues I raised are now causing pain.
>
> Maybe you can dig up your old post and we can look at each of 
> your complaints in detail.

Not interested.  I believe you were part of that thread then.  
Google it if you want to read it again.

>> UTF-8 is an antiquated hack that needs to be eradicated.  It 
>> forces all other languages than English to be twice as long, 
>> for no good reason, have fun with that when you're downloading 
>> text on a 2G connection in the developing world.  It is 
>> unnecessarily inefficient, which is precisely why 
>> auto-decoding is a problem. It is only a matter of time till 
>> UTF-8 is ditched.
>
> You don't download twice the data. First of all, some
> languages had two-byte encodings before UTF-8, and second web
> content is full of HTML syntax and gzip compressed afterwards.

The vast majority can be encoded in a single byte, and are 
unnecessarily forced to two bytes by the inefficient UTF-8/16 
encodings.  HTML syntax is a non sequitur; compression helps but 
isn't as efficient as a proper encoding.

> Take this Thai Wikipedia entry for example:
> https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2
> The download of the gzipped html is 11% larger in UTF-8 than
> in Thai TIS-620 single-byte encoding. And that is dwarfed by
> the size of JS + images. (I don't have the numbers, but I
> expect the effective overhead to be ~2%).

Nobody on a 2G connection is waiting minutes to download such 
massive web pages.  They are mostly sending text to each other on 
their favorite chat app, and waiting longer and using up more of 
their mobile data quota if they're forced to use bad encodings.

> Ironically a lot of symbols we take for granted would then
> have to be implemented as HTML entities using their Unicode
> code points(sic!). Amongst them basic stuff like dashes, degree
> (°) and minute (′), accents in names, non-breaking space or
> footnotes (↑).

No, they just don't use HTML, opting for much superior mobile 
apps instead. :)

>> D devs should lead the way in getting rid of the UTF-8 
>> encoding, not bickering about how to make it more palatable.  
>> I suggested a single-byte encoding for most languages, with 
>> double-byte for the ones which wouldn't fit in a byte.  Use 
>> some kind of header or other metadata to combine strings of 
>> different languages, _rather than encoding the language into 
>> every character!_
>
> That would have put D on an island. "Some kind of header" would 
> be a horrible mess to have in strings, because you have to 
> account for it when concatenating strings and scan for them all 
> the time to see if there is some interspersed 2 byte encoding 
> in the stream. That's hardly better than UTF-8. And yes, a huge 
> amount of websites mix scripts and a lot of other text uses the 
> available extra symbols like ° or α,β,γ.

Let's see: a constant-time addition to a header or constantly 
decoding every character every time I want to manipulate the 
string... I wonder which is a better choice?!  You would not 
"intersperse" any other encodings, unless you kept track of those 
substrings in the header.  My whole point is that such mixing of 
languages or "extra symbols" is an extreme minority use case: the 
vast majority of strings are a single language.

>> The common string-handling use case, by far, is strings with 
>> only one language, with a distant second some substrings in a 
>> second language, yet here we are putting the overhead into 
>> every character to allow inserting characters from an 
>> arbitrary language!  This is madness.
>
> No thx, madness was when we couldn't reliably open text files, 
> because nowhere was the encoding stored and when you had to 
> compile programs for each of a dozen codepages, so localized 
> text would be rendered correctly. And your retro codepage 
> system wont convince the world to drop Unicode either.

Unicode _is_ a retro codepage system, they merely standardized a 
bunch of the most popular codepages.  So that's not going away no 
matter what system you use. :)

>> Yes, the complexity of diacritics and combining characters 
>> will remain, but that is complexity that is inherent to the 
>> variety of written language.  UTF-8 is not: it is just a bad 
>> technical decision, likely chosen for ASCII compatibility and 
>> some misguided notion that being able to combine arbitrary 
>> language strings with no other metadata was worthwhile.  It is 
>> not.
>
> The web proves you wrong. Scripts do get mixed often. Be it 
> Wikipedia, a foreign language learning site or mathematical 
> symbols.

Those are some of the least-trafficked parts of the web, which 
itself is dying off as the developing world comes online through 
mobile apps, not the bloated web stack.

Anyway, I'm not interested in rehashing this dumb argument again. 
The UTF-8/16 encodings are a horrible mess, and D made a big 
mistake by baking them in.


More information about the Digitalmars-d mailing list