Why UTF-8/16 character encodings?

Dmitry Olshansky dmitry.olsh at gmail.com
Fri May 24 14:21:25 PDT 2013


24-May-2013 21:05, Joakim пишет:
> On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
>> toUpper/lower cannot be made in place if it should handle all Unicode.
>> Some characters will change their length when convert to/from
>> uppercase. Examples of these are the German double S and some Turkish I.
>
> This triggered a long-standing bugbear of mine: why are we using these
> variable-length encodings at all?  Does anybody really care about UTF-8
> being "self-synchronizing," ie does anybody actually use that in this
> day and age?  Sure, it's backwards-compatible with ASCII and the vast
> majority of usage is probably just ASCII, but that means the other
> languages don't matter anyway.  Not to mention taking the valuable 8-bit
> real estate for English and dumping the longer encodings on everyone else.
>
> I'd just use a single-byte header to signify the language and then put
> the vast majority of languages in a single byte encoding, with the few
> exceptional languages with more than 256 characters encoded in two
> bytes.

You seem to think that not only UTF-8 is bad encoding but also one 
unified encoding (code-space) is bad(?).

Separate code spaces were the case before Unicode (and utf-8). The 
problem is not only that without header text is meaningless (no easy 
slicing) but the fact that encoding of data after header strongly 
depends a variety of factors -  a list of encodings actually. Now 
everybody has to keep a (code) page per language to at least know if 
it's 2 bytes per char or 1 byte per char or whatever. And you still work 
on a basis that there is no combining marks and regional specific stuff :)

In fact it was even "better" nobody ever talked about header they just 
assumed a codepage with some global setting. Imagine yourself creating a 
font rendering system these days - a hell of an exercise in frustration 
(okay how do I render 0x88 ? mm if that is in codepage XYZ then ...).

> OK, that doesn't cover multi-language strings, but that is what,
> .000001% of usage?

This just shows you don't care for multilingual stuff at all. Imagine 
any language tutor/translator/dictionary on the Web. For instance most 
languages need to intersperse ASCII (also keep in mind e.g. HTML 
markup). Books often feature citations in native language (or e.g. 
latin) along with translations.

Now also take into account math symbols, currency symbols and beyond. 
Also these days cultures are mixing in wild combinations so you might 
need to see the text even if you can't read it. Unicode is not only 
"encode characters from all languages". It needs to address universal 
representation of symbolics used in writing systems at large.

> Make your header a little longer and you could handle
> those also.  Yes, it wouldn't be strictly backwards-compatible with
> ASCII, but it would be so much easier to internationalize.  Of course,
> there's also the monoculture we're creating; love this UTF-8 rant by
> tuomov, author of one the first tiling window managers for linux:
>
We want monoculture! That is to understand each without all these 
"par-le-vu-france?" and codepages of various complexity(insanity).

Want small - use compression schemes which are perfectly fine and get to 
the precious 1byte per codepoint with exceptional speed.
http://www.unicode.org/reports/tr6/

> http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06
>
> The emperor has no clothes, what am I missing?

And borrowing the arguments from from that rant: locale is borked shit 
when it comes to encodings. Locales should be used for tweaking visual 
like numbers, date display an so on.

-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list