Why UTF-8/16 character encodings?

Joakim joakim at airpost.net
Fri May 24 10:05:55 PDT 2013


On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
> toUpper/lower cannot be made in place if it should handle all 
> Unicode. Some characters will change their length when convert 
> to/from uppercase. Examples of these are the German double S 
> and some Turkish I.

This triggered a long-standing bugbear of mine: why are we using 
these variable-length encodings at all?  Does anybody really care 
about UTF-8 being "self-synchronizing," ie does anybody actually 
use that in this day and age?  Sure, it's backwards-compatible 
with ASCII and the vast majority of usage is probably just ASCII, 
but that means the other languages don't matter anyway.  Not to 
mention taking the valuable 8-bit real estate for English and 
dumping the longer encodings on everyone else.

I'd just use a single-byte header to signify the language and 
then put the vast majority of languages in a single byte 
encoding, with the few exceptional languages with more than 256 
characters encoded in two bytes.  OK, that doesn't cover 
multi-language strings, but that is what, .000001% of usage?  
Make your header a little longer and you could handle those also. 
  Yes, it wouldn't be strictly backwards-compatible with ASCII, 
but it would be so much easier to internationalize.  Of course, 
there's also the monoculture we're creating; love this UTF-8 rant 
by tuomov, author of one the first tiling window managers for 
linux:

http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06

The emperor has no clothes, what am I missing?


More information about the Digitalmars-d mailing list