The Case Against Autodecode
default0 via Digitalmars-d
digitalmars-d at puremagic.com
Thu Jun 2 14:07:19 PDT 2016
On Thursday, 2 June 2016 at 20:52:29 UTC, ag0aep6g wrote:
> On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote:
>> By whom? The "support level 1" folks yonder at the Unicode
>> standard? :o)
>> -- Andrei
>
> Do they say that level 1 should be the default, and do they
> give a rationale for that? Would you kindly link or quote that?
The level 2 support description noted that it should be opt-in
because its slow.
Arguably it should be easier to operate on code units if you know
its safe to do so, but either always working on code units or
always working on graphemes as the default seems to be either too
broken too often or too slow too often.
Now one can argue either consistency for code units (because then
we can treat char[] and friends as a slice) or correctness for
graphemes but really the more I think about it the more I think
there is no good default and you need to learn unicode anyways.
The only sad parts here are that 1) we hijacked an array type for
strings, which sucks and 2) that we dont have an api that is
actually good at teaching the user what it does and doesnt do.
The consequence of 1 is that generic code that also wants to deal
with strings will want to special-case to get rid of
auto-decoding, the consequence of 2 is that we will have tons of
not-actually-correct string handling.
I would assume that almost all string handling code that is out
in the wild is broken anyways (in code I have encountered I have
never seen attempts to normalize or do other things before or
after comparisons, searching, etc), unless of course, YOU or one
of your colleagues wrote it (consider that checking the length of
a string in Java or C# to validate it is no longer than X
characters is often done and wrong, because .Length is the number
of UTF-16 code units in those languages) :o)
So really as bad and alarming as "incorrect string handling" by
default seems, it in practice of other languages that get used
way more than D has not prevented people from writing working
(internationalized!) applications in those languages.
One could say we should do it better than them, but I would be
inclined to believe that RCStr provides our opportunity to do so.
Having char[] be what it is is an annoying wart, and maybe at
some point we can deprecate/remove that behaviour, but for now Id
rather see if RCStr is viable than attempt to change semantics of
all string handling code in D.
More information about the Digitalmars-d
mailing list