The Case Against Autodecode

Thu Jun 2 14:07:19 PDT 2016

On Thursday, 2 June 2016 at 20:52:29 UTC, ag0aep6g wrote:
> On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote:
>> By whom? The "support level 1" folks yonder at the Unicode 
>> standard? :o)
>> -- Andrei
>
> Do they say that level 1 should be the default, and do they 
> give a rationale for that? Would you kindly link or quote that?

The level 2 support description noted that it should be opt-in 
because its slow.
Arguably it should be easier to operate on code units if you know 
its safe to do so, but either always working on code units or 
always working on graphemes as the default seems to be either too 
broken too often or too slow too often.

Now one can argue either consistency for code units (because then 
we can treat char[] and friends as a slice) or correctness for 
graphemes but really the more I think about it the more I think 
there is no good default and you need to learn unicode anyways. 
The only sad parts here are that 1) we hijacked an array type for 
strings, which sucks and 2) that we dont have an api that is 
actually good at teaching the user what it does and doesnt do.

The consequence of 1 is that generic code that also wants to deal 
with strings will want to special-case to get rid of 
auto-decoding, the consequence of 2 is that we will have tons of 
not-actually-correct string handling.
I would assume that almost all string handling code that is out 
in the wild is broken anyways (in code I have encountered I have 
never seen attempts to normalize or do other things before or 
after comparisons, searching, etc), unless of course, YOU or one 
of your colleagues wrote it (consider that checking the length of 
a string in Java or C# to validate it is no longer than X 
characters is often done and wrong, because .Length is the number 
of UTF-16 code units in those languages) :o)

So really as bad and alarming as "incorrect string handling" by 
default seems, it in practice of other languages that get used 
way more than D has not prevented people from writing working 
(internationalized!) applications in those languages.
One could say we should do it better than them, but I would be 
inclined to believe that RCStr provides our opportunity to do so. 
Having char[] be what it is is an annoying wart, and maybe at 
some point we can deprecate/remove that behaviour, but for now Id 
rather see if RCStr is viable than attempt to change semantics of 
all string handling code in D.