Why is utf8 the default in D?
Michel Fortin
michel.fortin at michelf.com
Mon Apr 27 04:37:40 PDT 2009
On 2009-04-27 07:04:06 -0400, Frank Benoit <keinfarbton at googlemail.com> said:
> M$ and Java have chosen to use utf16 as their default Unicode character
> encoding. I am sure, the decision was not made without good reasoning.
>
> What are their arguments?
> Why does D propagate utf8 as the default?
>
> E.g.
> Exception.msg
> Object.toString()
> new std.stream.File( char[] )
The argument at the time was that they were going to work directly with
Unicode code points, thus simplifying things. Then Unicode extended to
cover even more characters, and at some point 16 bit became
insufficient; 16-bit encoding of Unicode became UTF-16, and surrogate
pairs were added to allow it to contain even higher code points, making
16-bit unicode a variable-size character encoding now known as UTF-16.
So it turns out that those in the early years of Unicode who made that
choice made it for reasons that no longer exist. Today, variable-size
UTF-16 makes it as hard to calculate a string length and do random
access in a string as UTF-8. In practice, many frameworks just ignore
the problem and are happy counting each code of a surrogate pair as two
characters, as they always did, but that behaviour isn't exactly
correct.
To get the benefit those framework/language designers though they'd get
at the time, we'd have to go with UTF-32, but then storing strings
become immensely wasteful. And I'm not counting that most data exchange
these days have standardized with UTF-8, rarely you'll encounter UTF-16
in the wild (and when you do, you have take care about UTF-16 LE and
BE), and even more rare is UTF-32. And that's not counting that
perhaps, one day, Unicode will grow again and fall outside of its
32-bit range... although that may have to wait until learn a few
extraterrestrial languages. :-)
So the D solution, which is to use UTF-8 everywhere while still
supporting string operations using UTF-16 and UTF-32, looks very good
to me. What I actually do is use UTF-8 everywhere, and sometime when I
need to easily manipulate characters I use UTF-32. And I use UTF-16 for
dealing with APIs expecting it, but not for much else.
--
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/
More information about the Digitalmars-d
mailing list