Why is utf8 the default in D?

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Mon Apr 27 04:57:45 PDT 2009


Michel Fortin wrote:
> On 2009-04-27 07:04:06 -0400, Frank Benoit <keinfarbton at googlemail.com> 
> said:
> 
>> M$ and Java have chosen to use utf16 as their default Unicode character
>> encoding. I am sure, the decision was not made without good reasoning.
>>
>> What are their arguments?
>> Why does D propagate utf8 as the default?
>>
>> E.g.
>> Exception.msg
>> Object.toString()
>> new std.stream.File( char[] )
> 
> The argument at the time was that they were going to work directly with 
> Unicode code points, thus simplifying things. Then Unicode extended to 
> cover even more characters, and at some point 16 bit became 
> insufficient; 16-bit encoding of Unicode became UTF-16, and surrogate 
> pairs were added to allow it to contain even higher code points, making 
> 16-bit unicode a variable-size character encoding now known as UTF-16.
> 
> So it turns out that those in the early years of Unicode who made that 
> choice made it for reasons that no longer exist. Today, variable-size 
> UTF-16 makes it as hard to calculate a string length and do random 
> access in a string as UTF-8. In practice, many frameworks just ignore 
> the problem and are happy counting each code of a surrogate pair as two 
> characters, as they always did, but that behaviour isn't exactly correct.
> 
> To get the benefit those framework/language designers though they'd get 
> at the time, we'd have to go with UTF-32, but then storing strings 
> become immensely wasteful. And I'm not counting that most data exchange 
> these days have standardized with UTF-8, rarely you'll encounter UTF-16 
> in the wild (and when you do, you have take care about UTF-16 LE and 
> BE), and even more rare is UTF-32. And that's not counting that perhaps, 
> one day, Unicode will grow again and fall outside of its 32-bit range... 
> although that may have to wait until learn a few extraterrestrial 
> languages. :-)
> 
> So the D solution, which is to use UTF-8 everywhere while still 
> supporting string operations using UTF-16 and UTF-32, looks very good to 
> me. What I actually do is use UTF-8 everywhere, and sometime when I need 
> to easily manipulate characters I use UTF-32. And I use UTF-16 for 
> dealing with APIs expecting it, but not for much else.

Well put. I distinctly remember the hubbub around Java's UTF16 support 
that was solving all of strings' problems, followed by the embarrassed 
silence upon the introduction of UTF32.

Andrei



More information about the Digitalmars-d mailing list