First Impressions!

Thu Nov 30 13:18:37 UTC 2017

On Thursday, 30 November 2017 at 10:19:18 UTC, Walter Bright 
wrote:
> On 11/27/2017 7:01 PM, A Guy With an Opinion wrote:
>> +- Unicode support is good. Although I think D's string type 
>> should have probably been utf16 by default. Especially 
>> considering the utf module states:
>> 
>> "UTF character support is restricted to '\u0000' <= character 
>> <= '\U0010FFFF'."
>> 
>> Seems like the natural fit for me. Plus for the vast majority 
>> of use cases I am pretty guaranteed a char = codepoint. Not 
>> the biggest issue in the world and maybe I'm just being overly 
>> critical here.
>
> Sooner or later your code will exhibit bugs if it assumes that 
> char==codepoint with UTF16, because of surrogate pairs.
>
> https://stackoverflow.com/questions/5903008/what-is-a-surrogate-pair-in-java
>
> As far as I can tell, pretty much the only users of UTF16 are 
> Windows programs. Everyone else uses UTF8 or UCS32.
>
> I recommend using UTF8.

As long as you understand it's limitations I think most bugs can 
be avoided. Where UTF16 breaks down, is pretty well defined. 
Also, super rare. I think UTF32 would be great to, but it seems 
like just a waste of space 99% of the time. UTF8 isn't horrible, 
I am not going to never use D because it uses UTF8 (that would be 
silly). Especially when wstring also seems baked into the 
language. However, it can complicate code because you pretty much 
always have to assume character != codepoint outside of ASCII. I 
can see a reasonable person arguing that it forcing you assume 
character != code point is actually a good thing. And that is a 
valid opinion.