The Case Against Autodecode
Marco Leise via Digitalmars-d
digitalmars-d at puremagic.com
Tue May 31 13:20:46 PDT 2016
Am Tue, 31 May 2016 16:29:33 +0000
schrieb Joakim <dlang at joakim.fea.st>:
> Part of it is the complexity of written language, part of it is
> bad technical decisions. Building the default string type in D
> around the horrible UTF-8 encoding was a fundamental mistake,
> both in terms of efficiency and complexity. I noted this in one
> of my first threads in this forum, and as Andrei said at the
> time, nobody agreed with me, with a lot of hand-waving about how
> efficiency wasn't an issue or that UTF-8 arrays were fine.
> Fast-forward years later and exactly the issues I raised are now
> causing pain.
Maybe you can dig up your old post and we can look at each of
your complaints in detail.
> UTF-8 is an antiquated hack that needs to be eradicated. It
> forces all other languages than English to be twice as long, for
> no good reason, have fun with that when you're downloading text
> on a 2G connection in the developing world. It is unnecessarily
> inefficient, which is precisely why auto-decoding is a problem.
> It is only a matter of time till UTF-8 is ditched.
You don't download twice the data. First of all, some
languages had two-byte encodings before UTF-8, and second web
content is full of HTML syntax and gzip compressed afterwards.
Take this Thai Wikipedia entry for example:
https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2
The download of the gzipped html is 11% larger in UTF-8 than
in Thai TIS-620 single-byte encoding. And that is dwarfed by
the size of JS + images. (I don't have the numbers, but I
expect the effective overhead to be ~2%).
Ironically a lot of symbols we take for granted would then
have to be implemented as HTML entities using their Unicode
code points(sic!). Amongst them basic stuff like dashes, degree
(°) and minute (′), accents in names, non-breaking space or
footnotes (↑).
> D devs should lead the way in getting rid of the UTF-8 encoding,
> not bickering about how to make it more palatable. I suggested a
> single-byte encoding for most languages, with double-byte for the
> ones which wouldn't fit in a byte. Use some kind of header or
> other metadata to combine strings of different languages, _rather
> than encoding the language into every character!_
That would have put D on an island. "Some kind of header"
would be a horrible mess to have in strings, because you have
to account for it when concatenating strings and scan for them
all the time to see if there is some interspersed 2 byte
encoding in the stream. That's hardly better than UTF-8. And
yes, a huge amount of websites mix scripts and a lot of other
text uses the available extra symbols like ° or α,β,γ.
> The common string-handling use case, by far, is strings with only
> one language, with a distant second some substrings in a second
> language, yet here we are putting the overhead into every
> character to allow inserting characters from an arbitrary
> language! This is madness.
No thx, madness was when we couldn't reliably open text files,
because nowhere was the encoding stored and when you had to
compile programs for each of a dozen codepages, so localized
text would be rendered correctly. And your retro codepage
system wont convince the world to drop Unicode either.
> Yes, the complexity of diacritics and combining characters will
> remain, but that is complexity that is inherent to the variety of
> written language. UTF-8 is not: it is just a bad technical
> decision, likely chosen for ASCII compatibility and some
> misguided notion that being able to combine arbitrary language
> strings with no other metadata was worthwhile. It is not.
The web proves you wrong. Scripts do get mixed often. Be it
Wikipedia, a foreign language learning site or mathematical
symbols.
--
Marco
More information about the Digitalmars-d
mailing list