Scott Meyers' DConf 2014 keynote "The Last Thing D Needs"

Dmitry Olshansky via Digitalmars-d-announce digitalmars-d-announce at puremagic.com
Thu May 29 11:25:18 PDT 2014


29-May-2014 02:10, Jonathan M Davis via Digitalmars-d-announce пишет:
> On Tue, 27 May 2014 06:42:41 -1000
> Andrei Alexandrescu via Digitalmars-d-announce
> <digitalmars-d-announce at puremagic.com> wrote:
>
>  >
> http://www.reddit.com/r/programming/comments/26m8hy/scott_meyers_dconf_2014_keynote_the_last_thing_d/
>  >
>  > https://news.ycombinator.com/newest (search that page, if not found
>  > click "More" and search again)
>  >
>  > https://www.facebook.com/dlang.org/posts/855022447844771
>  >
>  > https://twitter.com/D_Programming/status/471330026168651777
>
> Fortunately, for the most part, I think that we've avoided the types of
> inconsistencies that Scott describes for C++, but we do definitely have some
> of our own. The ones that come to mind at the moment are:

Not talking about other moments, but Unicode kind of caught my eye..
>
> 6. The situation with ranges and string is kind of ugly, with them being
> treated as ranges of code points. I don't know what the correct solution to
> this is, since treating them as ranges of code units promotes efficiency but
> makes code more error-prone, whereas treating them as ranges of graphemes
> would just cost too much.

This is gross oversimplification of the matter. There is no more 
correct, less correct. Each algorithm requires its own level of 
consideration, if there is a simple truism about Unicode it is:

Never operate on a single character, rather operate on slices of text.

To sum up the situation:

Unicode standard defines *all* of its algorithms in terms of code points 
and some use grapheme clusters. It never says anything about code units 
beyond mapping of code units --> code point. So whether or not you 
should actually decode is up to the implementation.


> Ranges of code points is _mostly_ correct but
> still
> incorrect and _more_ efficient than graphemes but still quite a bit less
> efficient than code units. So, it's kind of like it's got the best and worst
> of both worlds. The current situation causes inconsistencies with everything
> else (forcing us to use isNarrowString all over the place) and definitely
> requires frequent explaining, but it does prevent some classes of problems.
> So, I don't know. I used to be in favor of the current situation, but at
> this
> point, if we could change it, I think that I'd argue in faver of just
> treating
> them as ranges of code units and then have wrappers for ranges of code
> points
> or graphemes.

Agreed. The simple dream of automatically decoding UTF and staying 
"Unicode correct" is a failure.

> It seems like the current situation promotes either using
> ubyte[] (if you care about efficiency) or the new grapheme facilities in
> std.uni if you care about correctness, whereas just using strings as
> ranges of
> dchar is probably a bad idea unless you just don't want to deal with any of
> the Unicode stuff, don't care all that much about efficiency, and are
> willing
> have bugs in the areas where operating at the code point level is incorrect.

The worst thing about current situation is any generic code that works 
on UTF ranges has to jump through unbelievable amount of hoops to undo 
"string has no length" madness.

I think what we should do is define an StringRange or some such, that 
will at least make the current special case of string more generic.

-- 
Dmitry Olshansky


More information about the Digitalmars-d-announce mailing list