Why UTF-8/16 character encodings?
Joakim
joakim at airpost.net
Fri May 24 10:05:55 PDT 2013
On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
> toUpper/lower cannot be made in place if it should handle all
> Unicode. Some characters will change their length when convert
> to/from uppercase. Examples of these are the German double S
> and some Turkish I.
This triggered a long-standing bugbear of mine: why are we using
these variable-length encodings at all? Does anybody really care
about UTF-8 being "self-synchronizing," ie does anybody actually
use that in this day and age? Sure, it's backwards-compatible
with ASCII and the vast majority of usage is probably just ASCII,
but that means the other languages don't matter anyway. Not to
mention taking the valuable 8-bit real estate for English and
dumping the longer encodings on everyone else.
I'd just use a single-byte header to signify the language and
then put the vast majority of languages in a single byte
encoding, with the few exceptional languages with more than 256
characters encoded in two bytes. OK, that doesn't cover
multi-language strings, but that is what, .000001% of usage?
Make your header a little longer and you could handle those also.
Yes, it wouldn't be strictly backwards-compatible with ASCII,
but it would be so much easier to internationalize. Of course,
there's also the monoculture we're creating; love this UTF-8 rant
by tuomov, author of one the first tiling window managers for
linux:
http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06
The emperor has no clothes, what am I missing?
More information about the Digitalmars-d
mailing list