Why UTF-8/16 character encodings?

H. S. Teoh hsteoh at quickfur.ath.cx
Sat May 25 13:42:10 PDT 2013


On Sat, May 25, 2013 at 10:07:41AM +0200, Joakim wrote:
[...]
> The vast majority of non-english alphabets in UCS can be encoded in
> a single byte.  It is your exceptions that are not relevant.

I'll have you know that Chinese, Korean, and Japanese account for a
significant percentage of the world's population, and therefore
arguments about "vast majority" are kinda missing the forest for the
trees. If you count the number of *alphabets* that can be encoded in a
single byte, you can get a majority, but that in no way reflects actual
usage.

[...]
> >The only alternatives to a variable width encoding I can see are:
> >- Single code page per string
> >This is completely useless because now you can't concatenate
> >strings of different code pages.
> I wouldn't be so fast to ditch this.  There is a real argument to be
> made that strings of different languages are sufficiently different
> that there should be no multi-language strings.  Is this the best
> route?  I'm not sure, but I certainly wouldn't dismiss it out of hand.

This is so patently absurd I don't even know how to begin to answer...
have you actually dealt with any significant amount of text at all? A
large amount of text in today's digital world are at least bilingual, if
not more. Even in pure English text, you occasionally need a foreign
letter in order to transcribe a borrowed/quoted word, e.g., "cliché",
"naïve", etc.. Under your scheme, it would be impossible to encode any
text that contains even a single instance of such words. All it takes is
*one* word in a 500-page text and your scheme breaks down, and we're
back to the bad ole days of codepages. And yes you can say "well just
include é and ï in the English code page". But then all it takes is a
single math formula that requires a Greek letter, and your text is
non-encodable anymore. By the time you pull in all the French, German,
Greek letters and math symbols, you might as well just go back to UTF-8.

The alternative is to have embedded escape sequences for the rare
foreign letter/word that you might need, but then you're back to being
unable to slice the string at will, since slicing it at the wrong place
will produce gibberish.

I'm not saying UTF-8 (or UTF-16, etc.) is panacea -- there are things
about it that are annoying, but it's certainly better than the scheme
you're proposing.


T

-- 
Живёшь только однажды.


More information about the Digitalmars-d mailing list