Why UTF-8/16 character encodings?
Joakim
joakim at airpost.net
Sat May 25 00:33:14 PDT 2013
On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
> One of the first, and best, decisions I made for D was it would
> be Unicode front to back.
That is why I asked this question here. I think D is still one
of the few programming languages with such unicode support.
> This is more a problem with the algorithms taking the easy way
> than a problem with UTF-8. You can do all the string
> algorithms, including regex, by working with the UTF-8 directly
> rather than converting to UTF-32. Then the algorithms work at
> full speed.
I call BS on this. There's no way working on a variable-width
encoding can be as "full speed" as a constant-width encoding.
Perhaps you mean that the slowdown is minimal, but I doubt that
also.
> That was the go-to solution in the 1980's, they were called
> "code pages". A disaster.
My understanding is that code pages were a "disaster" because
they weren't standardized and often badly implemented. If you
used UCS with a single-byte encoding, you wouldn't have that
problem.
> > with the few exceptional languages with more than 256
> characters encoded in two bytes.
>
> Like those rare languages Japanese, Korean, Chinese, etc. This
> too was done in the 80's with "Shift-JIS" for Japanese, and
> some other wacky scheme for Korean, and a third nutburger one
> for Chinese.
Of course, you have to have more than one byte for those
languages, because they have more than 256 characters. So there
will be no compression gain over UTF-8/16 there, but a big gain
in parsing complexity with a simpler encoding, particularly when
dealing with multi-language strings.
> I've had the misfortune of supporting all that in the old
> Zortech C++ compiler. It's AWFUL. If you think it's simpler,
> all I can say is you've never tried to write internationalized
> code with it.
Heh, I'm not saying "let's go back to badly defined code pages"
because I'm saying "let's go back to single-byte encodings." The
two are separate arguments.
> UTF-8 is heavenly in comparison. Your code is automatically
> internationalized. It's awesome.
At what cost? Most programmers completely punt on unicode,
because they just don't want to deal with the complexity.
Perhaps you can deal with it and don't mind the performance loss,
but I suspect you're in the minority.
More information about the Digitalmars-d
mailing list