Why UTF-8/16 character encodings?
Joakim
joakim at airpost.net
Sat May 25 01:07:41 PDT 2013
On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
> I think you are a little confused about what unicode actually
> is... Unicode has nothing to do with code pages and nobody uses
> code pages any more except for compatibility with legacy
> applications (with good reason!).
Incorrect.
"Unicode is an effort to include all characters from previous
code pages into a single character enumeration that can be used
with a number of encoding schemes... In practice the various
Unicode character set encodings have simply been assigned their
own code page numbers, and all the other code pages have been
technically redefined as encodings for various subsets of
Unicode."
http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode
> Unicode is:
> 1) A standardised numbering of a large number of characters
> 2) A set of standardised algorithms for operating on these
> characters
> 3) A set of standardised encodings for efficiently encoding
> sequences of these characters
What makes you think I'm unaware of this? I have repeatedly
differentiated between UCS (1) and UTF-8 (3).
> You said that phobos converts UTF-8 strings to UTF-32 before
> operating on them but that's not true. As it iterates over
> UTF-8 strings it iterates over dchars rather than chars, but
> that's not in any way inefficient so I don't really see the
> problem.
And what's a dchar? Let's check:
dchar : unsigned 32 bit UTF-32
http://dlang.org/type.html
Of course that's inefficient, you are translating your whole
encoding over to a 32-bit encoding every time you need to process
it. Walter as much as said so up above.
> Also your complaint that UTF-8 reserves the short characters
> for the english alphabet is not really relevant - the
> characters with longer encodings tend to be rarer (such as
> special symbols) or carry more information (such as chinese
> characters where the same sentence takes only about 1/3 the
> number of characters).
The vast majority of non-english alphabets in UCS can be encoded
in a single byte. It is your exceptions that are not relevant.
More information about the Digitalmars-d
mailing list