The Case Against Autodecode
cym13 via Digitalmars-d
digitalmars-d at puremagic.com
Thu Jun 2 14:38:39 PDT 2016
On Thursday, 2 June 2016 at 20:29:48 UTC, Andrei Alexandrescu
wrote:
> On 06/02/2016 04:22 PM, cym13 wrote:
>>
>> A:“We should decode to code points”
>> B:“No, decoding to code points is a stupid idea.”
>> A:“No it's not!”
>> B:“Can you show a concrete example where it does something
>> useful?”
>> A:“Sure, look at that!”
>> B:“This isn't working at all, look at all those
>> counter-examples!”
>> A:“It may not work for your examples but look how easy it is to
>> find code points!”
>
> With autodecoding all of std.algorithm operates correctly on
> code points. Without it all it does for strings is gibberish.
> -- Andrei
Allow me to try another angle:
- There are different levels of unicode support and you don't
want to
support them all transparently. That's understandable.
- The level you choose to support is the code point level. There
are
many good arguments about why this isn't a good default but you
won't
change your mind. I don't like that at all and I'm not alone but
let's
forget the entirety of the vocal D community for a moment.
- A huge part of unicode chars can be normalized to fit your
definition. That way not everything work (far from it) but a
sufficiently big subset works.
- On the other hand without normalization it just doesn't make any
sense from a user perspective.The ö example has clearly shown that
much, you even admitted it yourself by stating that many counter
arguments would have worked had the string been normalized).
- The most proeminent problem is with graphems that can have
different
representations as those that can't be normalized can't be
searched as
dchars as well.
- If autodecoding to code points is to stay and in an effort to
find a
compromise then normalizing should be done by default. Sure it
would
take some more time but it wouldn't break any code (I think) and
would
actually make things more correct. They still wouldn't be correct
but
I feel that something as crazy as unicode cannot be tackled
generically anyway.
More information about the Digitalmars-d
mailing list