Why UTF-8/16 character encodings?

H. S. Teoh hsteoh at quickfur.ath.cx
Fri May 24 15:42:37 PDT 2013


On Sat, May 25, 2013 at 01:21:25AM +0400, Dmitry Olshansky wrote:
> 24-May-2013 21:05, Joakim пишет:
[...]
> >This triggered a long-standing bugbear of mine: why are we using
> >these variable-length encodings at all?  Does anybody really care
> >about UTF-8 being "self-synchronizing," ie does anybody actually use
> >that in this day and age?  Sure, it's backwards-compatible with ASCII
> >and the vast majority of usage is probably just ASCII, but that means
> >the other languages don't matter anyway.  Not to mention taking the
> >valuable 8-bit real estate for English and dumping the longer
> >encodings on everyone else.
> >
> >I'd just use a single-byte header to signify the language and then
> >put the vast majority of languages in a single byte encoding, with
> >the few exceptional languages with more than 256 characters encoded
> >in two bytes.
> 
> You seem to think that not only UTF-8 is bad encoding but also one
> unified encoding (code-space) is bad(?).
> 
> Separate code spaces were the case before Unicode (and utf-8). The
> problem is not only that without header text is meaningless (no easy
> slicing) but the fact that encoding of data after header strongly
> depends a variety of factors -  a list of encodings actually. Now
> everybody has to keep a (code) page per language to at least know if
> it's 2 bytes per char or 1 byte per char or whatever. And you still
> work on a basis that there is no combining marks and regional
> specific stuff :)

I remember those bad ole days of gratuitously-incompatible encodings. I
wish those days will never ever return again. You'd get a text file in
some unknown encoding, and the only way to make any sense of it was to
guess what encoding it might be and hope you get lucky. Not only so, the
same language often has multiple encodings, so adding support for a
single new language required supporting several new encodings and being
able to tell them apart (often with no info on which they are, if you're
lucky, or if you're unlucky, with *wrong* encoding type specs -- for
example, I *still* get email from outdated systems that claim to be
iso-8859 when it's actually KOI8R).

Prepending the encoding to the data doesn't help, because it's pretty
much guaranteed somebody will cut-n-paste some segment of that data and
save it without the encoding type header (or worse, some program will
try to "fix" broken low-level code by prepending a default encoding type
to everything, regardless of whether it's actually in that encoding or
not), thus ensuring nobody will be able to reliably recognize what
encoding it is down the road.


> In fact it was even "better" nobody ever talked about header they just
> assumed a codepage with some global setting. Imagine yourself creating
> a font rendering system these days - a hell of an exercise in
> frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ
> then ...).

Not to mention, if the sysadmin changes the default locale settings, you
may suddenly discover that a bunch of your text files have become
gibberish, because some programs blindly assume that every text file is
in the current locale-specified language.

I tried writing language-agnostic text-processing programs in C/C++
before the widespread adoption of Unicode. It was a living nightmare.
The Posix spec *seems* to promise language-independence with its locale
functions, but actually, the whole thing is one big inconsistent and
under-specified mess that has many unspecified, implementation-specific
behaviours that you can't rely on.  The APIs basically assume that you
set your locale's language once, and never change it, and every single
file you'll ever want to read must be encoded in that particular
encoding. If you try to read another encoding, too bad, you're screwed.
There isn't even a standard for locale names that you could use to
manually switch to inside your program (yes there are de facto
conventions, but there *are* systems out there that don't follow it).

And many standard library functions are affected by locale settings
(once you call setlocale, *anything* could change, like string
comparison, output encoding, etc.), making it a hairy mess to get
input/output of multiple encodings to work correctly. Basically, you
have to write everything manually, because the standard library can't
handle more than a single encoding correctly (well, not without extreme
amounts of pain, that is). So you're back to manipulating bytes
directly. Which means you have to keep large tables of every single
encoding you ever wish to support. And encoding-specific code to deal
with exceptions with those evil variant encodings that are supposedly
the same as the official standard of that encoding, but actually have
one or two subtle differences that cause your program to output
embarrassing garbage characters every now and then.

For all of its warts, Unicode fixed a WHOLE bunch of these problems, and
made cross-linguistic data sane to handle without pulling out your hair,
many times over.  And now we're trying to go back to that nightmarish
old world again? No way, José!


[...]
> >Make your header a little longer and you could handle those also.
> >Yes, it wouldn't be strictly backwards-compatible with ASCII, but it
> >would be so much easier to internationalize.  Of course, there's also
> >the monoculture we're creating; love this UTF-8 rant by tuomov,
> >author of one the first tiling window managers for linux:
> >
> We want monoculture! That is to understand each without all these
> "par-le-vu-france?" and codepages of various complexity(insanity).

Yeah, those codepages were an utter nightmare to deal with. Everybody
and his neighbour's dog invented their own codepage, sometimes multiple
codepages for a single language, all of which are gratuitously
incompatible with each other. Every codepage has its own peculiarities
and exceptions, and programs have to know how to deal with all of them.
Only to get broken again as soon as somebody invents yet another
codepage two years later, or creates yet another codepage variant just
for the heck of it.

If you're really concerned about encoding size, just use a compression
library -- they're readily available these days. Internally, the program
can just use UTF-16 for the most part -- UTF-32 is really only necessary
if you're routinely delving outside BMP, which is very rare.

As far as Phobos is concerned, Dmitry's new std.uni module has powerful
code-generation templates that let you write code that operate directly
on UTF-8 without needing to convert to UTF-32 first. Well, OK, maybe
we're not quite there yet, but the foundations are in place, and I'm
looking forward to the day when string functions will no longer have
implicit conversion to UTF-32, but will directly manipulate UTF-8 using
optimized state tables generated by std.uni.


> Want small - use compression schemes which are perfectly fine and
> get to the precious 1byte per codepoint with exceptional speed.
> http://www.unicode.org/reports/tr6/

+1.  Using your own encoding is perfectly fine. Just don't do that for
data interchange. Unicode was created because we *want* a single
standard to communicate with each other without stupid broken encoding
issues that used to be rampant on the web before Unicode came along.

In the bad ole days, HTML could be served in any random number of
encodings, often out-of-sync with what the server claims the encoding
is, and browsers would assume arbitrary default encodings that for the
most part *appeared* to work but are actually fundamentally b0rken.
Sometimes webpages would show up mostly-intact, but with a few
characters mangled, because of deviations / variations on codepage
interpretation, or non-standard characters being used in a particular
encoding. It was a total, utter mess, that wasted who knows how many
man-hours of programming time to work around. For data interchange on
the internet, we NEED a universal standard that everyone can agree on.


> >http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06
> >
> >The emperor has no clothes, what am I missing?
> 
> And borrowing the arguments from from that rant: locale is borked
> shit when it comes to encodings. Locales should be used for tweaking
> visual like numbers, date display an so on.
[...]

I found that rant rather incoherent. I didn't find any convincing
arguments as to why we should return to the bad old scheme of codepages
and gratuitous complexity, just a lot of grievances about why
monoculture is "bad" without much supporting evidence.

UTF-8, for all its flaws, is remarkably resilient to mangling -- you can
cut-n-paste any byte sequence and the receiving end can still make some
sense of it.  Not like the bad old days of codepages where you just get
one gigantic block of gibberish. A properly-synchronizing UTF-8 function
can still recover legible data, maybe with only a few characters at the
ends truncated in the worst case. I don't see how any codepage-based
encoding is an improvement over this.


T

-- 
There are 10 kinds of people in the world: those who can count in
binary, and those who can't.


More information about the Digitalmars-d mailing list