Why UTF-8/16 character encodings?

Sat May 25 00:33:14 PDT 2013

On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
> One of the first, and best, decisions I made for D was it would 
> be Unicode front to back.
That is why I asked this question here.  I think D is still one 
of the few programming languages with such unicode support.

> This is more a problem with the algorithms taking the easy way 
> than a problem with UTF-8. You can do all the string 
> algorithms, including regex, by working with the UTF-8 directly 
> rather than converting to UTF-32. Then the algorithms work at 
> full speed.
I call BS on this.  There's no way working on a variable-width 
encoding can be as "full speed" as a constant-width encoding.  
Perhaps you mean that the slowdown is minimal, but I doubt that 
also.

> That was the go-to solution in the 1980's, they were called 
> "code pages". A disaster.
My understanding is that code pages were a "disaster" because 
they weren't standardized and often badly implemented.  If you 
used UCS with a single-byte encoding, you wouldn't have that 
problem.

> > with the few exceptional languages with more than 256
> characters encoded in two bytes.
>
> Like those rare languages Japanese, Korean, Chinese, etc. This 
> too was done in the 80's with "Shift-JIS" for Japanese, and 
> some other wacky scheme for Korean, and a third nutburger one 
> for Chinese.
Of course, you have to have more than one byte for those 
languages, because they have more than 256 characters.  So there 
will be no compression gain over UTF-8/16 there, but a big gain 
in parsing complexity with a simpler encoding, particularly when 
dealing with multi-language strings.

> I've had the misfortune of supporting all that in the old 
> Zortech C++ compiler. It's AWFUL. If you think it's simpler, 
> all I can say is you've never tried to write internationalized 
> code with it.
Heh, I'm not saying "let's go back to badly defined code pages" 
because I'm saying "let's go back to single-byte encodings."  The 
two are separate arguments.

> UTF-8 is heavenly in comparison. Your code is automatically 
> internationalized. It's awesome.
At what cost?  Most programmers completely punt on unicode, 
because they just don't want to deal with the complexity.  
Perhaps you can deal with it and don't mind the performance loss, 
but I suspect you're in the minority.