Why UTF-8/16 character encodings?

Diggory diggsey at googlemail.com
Sat May 25 00:48:04 PDT 2013


I think you are a little confused about what unicode actually 
is... Unicode has nothing to do with code pages and nobody uses 
code pages any more except for compatibility with legacy 
applications (with good reason!).

Unicode is:
1) A standardised numbering of a large number of characters
2) A set of standardised algorithms for operating on these 
characters
3) A set of standardised encodings for efficiently encoding 
sequences of these characters

You said that phobos converts UTF-8 strings to UTF-32 before 
operating on them but that's not true. As it iterates over UTF-8 
strings it iterates over dchars rather than chars, but that's not 
in any way inefficient so I don't really see the problem.

Also your complaint that UTF-8 reserves the short characters for 
the english alphabet is not really relevant - the characters with 
longer encodings tend to be rarer (such as special symbols) or 
carry more information (such as chinese characters where the same 
sentence takes only about 1/3 the number of characters).


More information about the Digitalmars-d mailing list