The Case Against Autodecode

Marco Leise via Digitalmars-d digitalmars-d at puremagic.com
Tue May 31 13:20:46 PDT 2016


Am Tue, 31 May 2016 16:29:33 +0000
schrieb Joakim <dlang at joakim.fea.st>:

> Part of it is the complexity of written language, part of it is 
> bad technical decisions.  Building the default string type in D 
> around the horrible UTF-8 encoding was a fundamental mistake, 
> both in terms of efficiency and complexity.  I noted this in one 
> of my first threads in this forum, and as Andrei said at the 
> time, nobody agreed with me, with a lot of hand-waving about how 
> efficiency wasn't an issue or that UTF-8 arrays were fine.  
> Fast-forward years later and exactly the issues I raised are now 
> causing pain.

Maybe you can dig up your old post and we can look at each of
your complaints in detail.

> UTF-8 is an antiquated hack that needs to be eradicated.  It 
> forces all other languages than English to be twice as long, for 
> no good reason, have fun with that when you're downloading text 
> on a 2G connection in the developing world.  It is unnecessarily 
> inefficient, which is precisely why auto-decoding is a problem.  
> It is only a matter of time till UTF-8 is ditched.

You don't download twice the data. First of all, some
languages had two-byte encodings before UTF-8, and second web
content is full of HTML syntax and gzip compressed afterwards.
Take this Thai Wikipedia entry for example:
https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2
The download of the gzipped html is 11% larger in UTF-8 than
in Thai TIS-620 single-byte encoding. And that is dwarfed by
the size of JS + images. (I don't have the numbers, but I
expect the effective overhead to be ~2%).
Ironically a lot of symbols we take for granted would then
have to be implemented as HTML entities using their Unicode
code points(sic!). Amongst them basic stuff like dashes, degree
(°) and minute (′), accents in names, non-breaking space or
footnotes (↑).

> D devs should lead the way in getting rid of the UTF-8 encoding, 
> not bickering about how to make it more palatable.  I suggested a 
> single-byte encoding for most languages, with double-byte for the 
> ones which wouldn't fit in a byte.  Use some kind of header or 
> other metadata to combine strings of different languages, _rather 
> than encoding the language into every character!_

That would have put D on an island. "Some kind of header"
would be a horrible mess to have in strings, because you have
to account for it when concatenating strings and scan for them
all the time to see if there is some interspersed 2 byte
encoding in the stream. That's hardly better than UTF-8. And
yes, a huge amount of websites mix scripts and a lot of other
text uses the available extra symbols like ° or α,β,γ.

> The common string-handling use case, by far, is strings with only 
> one language, with a distant second some substrings in a second 
> language, yet here we are putting the overhead into every 
> character to allow inserting characters from an arbitrary 
> language!  This is madness.

No thx, madness was when we couldn't reliably open text files,
because nowhere was the encoding stored and when you had to
compile programs for each of a dozen codepages, so localized
text would be rendered correctly. And your retro codepage
system wont convince the world to drop Unicode either.

> Yes, the complexity of diacritics and combining characters will 
> remain, but that is complexity that is inherent to the variety of 
> written language.  UTF-8 is not: it is just a bad technical 
> decision, likely chosen for ASCII compatibility and some 
> misguided notion that being able to combine arbitrary language 
> strings with no other metadata was worthwhile.  It is not.

The web proves you wrong. Scripts do get mixed often. Be it
Wikipedia, a foreign language learning site or mathematical
symbols.

-- 
Marco



More information about the Digitalmars-d mailing list