Why UTF-8/16 character encodings?

Sat May 25 00:13:50 PDT 2013

On Friday, 24 May 2013 at 22:44:24 UTC, H. S. Teoh wrote:
> I remember those bad ole days of gratuitously-incompatible 
> encodings. I
> wish those days will never ever return again. You'd get a text 
> file in
> some unknown encoding, and the only way to make any sense of it 
> was to
> guess what encoding it might be and hope you get lucky. Not 
> only so, the
> same language often has multiple encodings, so adding support 
> for a
> single new language required supporting several new encodings 
> and being
> able to tell them apart (often with no info on which they are, 
> if you're
> lucky, or if you're unlucky, with *wrong* encoding type specs 
> -- for
> example, I *still* get email from outdated systems that claim 
> to be
> iso-8859 when it's actually KOI8R).
This is an argument for UCS, not UTF-8.

> Prepending the encoding to the data doesn't help, because it's 
> pretty
> much guaranteed somebody will cut-n-paste some segment of that 
> data and
> save it without the encoding type header (or worse, some 
> program will
> try to "fix" broken low-level code by prepending a default 
> encoding type
> to everything, regardless of whether it's actually in that 
> encoding or
> not), thus ensuring nobody will be able to reliably recognize 
> what
> encoding it is down the road.
This problem already exists for UTF-8, breaking ASCII 
compatibility in the process:

http://en.wikipedia.org/wiki/Byte_order_mark

Well, at the very least adding garbage ASCII data in the front, 
just as my header would do. ;)

> For all of its warts, Unicode fixed a WHOLE bunch of these 
> problems, and
> made cross-linguistic data sane to handle without pulling out 
> your hair,
> many times over.  And now we're trying to go back to that 
> nightmarish
> old world again? No way, José!
No, I'm suggesting going back to one element of that "old world," 
single-byte encodings, but using UCS or some other standardized 
character set to avoid all those incompatible code pages you had 
to deal with.

> If you're really concerned about encoding size, just use a 
> compression
> library -- they're readily available these days. Internally, 
> the program
> can just use UTF-16 for the most part -- UTF-32 is really only 
> necessary
> if you're routinely delving outside BMP, which is very rare.
True, but you're still doubling your string size with UTF-16 and 
non-ASCII text.  My concerns are the following, in order of 
importance:

1. Lost programmer productivity due to these dumb variable-length 
encodings.  That is the biggest loss from UTF-8's complexity.

2. Lost speed and memory due to using either an unnecessarily 
complex variable-length encoding or because you translated 
everything to 32-bit UTF-32 to get back to constant-width.

3. Lost bandwidth from using a fatter encoding.

> As far as Phobos is concerned, Dmitry's new std.uni module has 
> powerful
> code-generation templates that let you write code that operate 
> directly
> on UTF-8 without needing to convert to UTF-32 first. Well, OK, 
> maybe
> we're not quite there yet, but the foundations are in place, 
> and I'm
> looking forward to the day when string functions will no longer 
> have
> implicit conversion to UTF-32, but will directly manipulate 
> UTF-8 using
> optimized state tables generated by std.uni.
There is no way this can ever be as performant as a 
constant-width single-byte encoding.

> +1.  Using your own encoding is perfectly fine. Just don't do 
> that for
> data interchange. Unicode was created because we *want* a single
> standard to communicate with each other without stupid broken 
> encoding
> issues that used to be rampant on the web before Unicode came 
> along.
>
> In the bad ole days, HTML could be served in any random number 
> of
> encodings, often out-of-sync with what the server claims the 
> encoding
> is, and browsers would assume arbitrary default encodings that 
> for the
> most part *appeared* to work but are actually fundamentally 
> b0rken.
> Sometimes webpages would show up mostly-intact, but with a few
> characters mangled, because of deviations / variations on 
> codepage
> interpretation, or non-standard characters being used in a 
> particular
> encoding. It was a total, utter mess, that wasted who knows how 
> many
> man-hours of programming time to work around. For data 
> interchange on
> the internet, we NEED a universal standard that everyone can 
> agree on.
I disagree.  This is not an indictment of multiple encodings, it 
is one of multiple unspecified or _broken_ encodings.  Given how 
difficult UTF-8 is to get right, all you've likely done is 
replace multiple broken encodings with a single encoding with 
multiple broken implementations.

> UTF-8, for all its flaws, is remarkably resilient to mangling 
> -- you can
> cut-n-paste any byte sequence and the receiving end can still 
> make some
> sense of it.  Not like the bad old days of codepages where you 
> just get
> one gigantic block of gibberish. A properly-synchronizing UTF-8 
> function
> can still recover legible data, maybe with only a few 
> characters at the
> ends truncated in the worst case. I don't see how any 
> codepage-based
> encoding is an improvement over this.
Have you ever used this self-synchronizing future of UTF-8?  Have 
you ever heard of anyone using it?  There is no reason why this 
kind of limited checking of data integrity should be rolled into 
the encoding.  Maybe this made sense two decades ago when 
everyone had plans to stream text or something, but nobody does 
that nowadays.  Just put a checksum in your header and you're 
good to go.

Unicode is still a "codepage-based encoding," nothing has changed 
in that regard.  All UCS did is standardize a bunch of 
pre-existing code pages, so that some of the redundancy was taken 
out.  Unfortunately, the UTF-8 encoding then bloated the 
transmission format and tempted devs to use this unnecessarily 
complex format for processing too.