Why UTF-8/16 character encodings?

Walter Bright newshound2 at digitalmars.com
Fri May 24 18:48:56 PDT 2013


On 5/24/2013 1:37 PM, Joakim wrote:
> This leads to Phobos converting every UTF-8 string to UTF-32, so that
> it can easily run its algorithms on a constant-width 32-bit character set, and
> the resulting performance penalties.

This is more a problem with the algorithms taking the easy way than a problem 
with UTF-8. You can do all the string algorithms, including regex, by working 
with the UTF-8 directly rather than converting to UTF-32. Then the algorithms 
work at full speed.


 > Yes, it wouldn't be strictly backwards-compatible with ASCII, but it would be 
so much easier to internationalize.

That was the go-to solution in the 1980's, they were called "code pages". A 
disaster.


 > with the few exceptional languages with more than 256 characters encoded in 
two bytes.

Like those rare languages Japanese, Korean, Chinese, etc. This too was done in 
the 80's with "Shift-JIS" for Japanese, and some other wacky scheme for Korean, 
and a third nutburger one for Chinese.

I've had the misfortune of supporting all that in the old Zortech C++ compiler. 
It's AWFUL. If you think it's simpler, all I can say is you've never tried to 
write internationalized code with it.

UTF-8 is heavenly in comparison. Your code is automatically internationalized. 
It's awesome.


More information about the Digitalmars-d mailing list