Why UTF-8/16 character encodings?

Sun May 26 02:59:19 PDT 2013

For some reason this posting by H. S. Teoh shows up on the 
mailing list but not on the forum.

On Sat May 25 13:42:10 PDT 2013, H. S. Teoh wrote:
On Sat, May 25, 2013 at 10:07:41AM +0200, Joakim wrote:
>> The vast majority of non-english alphabets in UCS can be 
>> encoded in
>> a single byte.  It is your exceptions that are not relevant.
>
> I'll have you know that Chinese, Korean, and Japanese account 
> for a
> significant percentage of the world's population, and therefore
> arguments about "vast majority" are kinda missing the forest 
> for the
> trees. If you count the number of *alphabets* that can be 
> encoded in a
> single byte, you can get a majority, but that in no way 
> reflects actual
> usage.
Not just "a majority," the vast majority of alphabets, 
representing 85% of the world's population.

>> >The only alternatives to a variable width encoding I can see 
>> >are:
>> >- Single code page per string
>> >This is completely useless because now you can't concatenate
>> >strings of different code pages.
>> I wouldn't be so fast to ditch this.  There is a real argument 
>> to be
>> made that strings of different languages are sufficiently 
>> different
>> that there should be no multi-language strings.  Is this the 
>> best
>> route?  I'm not sure, but I certainly wouldn't dismiss it out 
>> of hand.
>
> This is so patently absurd I don't even know how to begin to 
> answer...
> have you actually dealt with any significant amount of text at 
> all? A
> large amount of text in today's digital world are at least 
> bilingual, if
> not more. Even in pure English text, you occasionally need a 
> foreign
> letter in order to transcribe a borrowed/quoted word, e.g., 
> "cliché",
> "naïve", etc.. Under your scheme, it would be impossible to 
> encode any
> text that contains even a single instance of such words. All it 
> takes is
> *one* word in a 500-page text and your scheme breaks down, and 
> we're
> back to the bad ole days of codepages. And yes you can say 
> "well just
> include é and ï in the English code page". But then all it 
> takes is a
> single math formula that requires a Greek letter, and your text 
> is
> non-encodable anymore. By the time you pull in all the French, 
> German,
> Greek letters and math symbols, you might as well just go back 
> to UTF-8.
I think you misunderstand what this implies.  I mentioned it 
earlier as another possibility to Walter, "keep all your strings 
in a single language, with a different format to compose them 
together."  Nobody is talking about disallowing alphabets other 
than English or going back to code pages.  The fundamental 
question is whether it makes sense to combine all these different 
alphabets and their idiosyncratic rules into a single string and 
encoding.

There is a good argument to be made that the differences outweigh 
the similarities and you'd be better off keeping each 
language/alphabet in its own string.  It's a question of 
modeling, just like a class hierarchy.  As I said, I'm not sure 
this is the best route, but it has some real strengths.

> The alternative is to have embedded escape sequences for the 
> rare
> foreign letter/word that you might need, but then you're back 
> to being
> unable to slice the string at will, since slicing it at the 
> wrong place
> will produce gibberish.
No one has presented this as a viable option.

> I'm not saying UTF-8 (or UTF-16, etc.) is panacea -- there are 
> things
> about it that are annoying, but it's certainly better than the 
> scheme
> you're proposing.
I disagree.

On Saturday, 25 May 2013 at 20:52:41 UTC, H. S. Teoh wrote:
> And just how exactly does that help with slicing? If anything, 
> it makes
> slicing way hairier and error-prone than UTF-8. In fact, this 
> one point
> alone already defeated any performance gains you may have had 
> with a
> single-byte encoding. Now you can't do *any* slicing at all 
> without
> convoluted algorithms to determine what encoding is where at the
> endpoints of your slice, and the resulting slice must have new 
> headers
> to indicate the start/end of every different-language 
> substring. By the
> time you're done with all that, you're going way slower than 
> processing
> UTF-8.
There are no convoluted algorithms, it's a simple check if the 
string contains any two-bye encodings, a check which can be done 
once and cached.  If it's single-byte all the way through, no 
problems whatsoever with slicing.  If there are two-byte 
languages included, the slice function will have to do a little 
arithmetic calculation before slicing.  You will also need a few 
arithmetic ops to create the new header for the slice.  The point 
is that these operations will be much faster than decoding every 
code point to slice UTF-8.

> Again I say, I'm not 100% sold on UTF-8, but what you're 
> proposing here
> is far worse.
Well, I'm glad you realize some problems with UTF-8, :) even if 
you dismiss my alternative out of hand.