std.algorithm.remove and principle of least astonishment

Sun Nov 21 17:56:15 PST 2010

On Sunday 21 November 2010 17:27:06 Andrei Alexandrescu wrote:
> On 11/21/10 7:11 PM, Michel Fortin wrote:
> > On 2010-11-20 18:58:33 -0500, Andrei Alexandrescu
> > 
> > <SeeWebsiteForEmail at erdani.org> said:
> >> D strings exhibit no such problems. They expose their implementation -
> >> array of code units. Having that available is often handy. They also
> >> obey a formal interface - bidirectional ranges.
> > 
> > It's convenient that char[] and wchar[] expose a dchar bidirectional
> > range interface... but only when a dchar bidirectional range is what you
> > want to use. If you want to iterate over code units (lower-level
> > representation), or graphemes (upper-level representation), then it gets
> > in your way.
> 
> I agree.
> 
> > There is no easy notion of "character" in unicode. A code point is *not*
> > a character. One character can span multiple code points. I fear
> > treating dchars as "the default character unit" is repeating same kind
> > of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and
> > treating each 2-byte code unit as a character. I mean, what's the point
> > of working with the intermediary representation (code points) when it
> > doesn't represent a character?
> 
> I understand the concern, and that's why I strongly support formal
> abstractions that are supported by, but largely independent from,
> representations. If graphemes are to be modeled, D is in better shape
> than other languages. What we need to do is define a range byGrapheme()
> that accepts char[], wchar[], or dchar[].
> 
> > Instead, I think it'd be better that the level one wants to work at be
> > made explicit. If one wants to work with code points, he just rolls a
> > code-point bidirectional range on top of the string. If one wants to
> > work with graphemes (user-perceived characters), he just rolls a
> > grapheme bidirectional range on top of the string. In other words:

We could always define an abstract Character (or whatever you want to call it) 
which holds a character - regardless of whether it uses a grapheme or not - and 
make it relatively easy to iterate over Characters rather than dchars. It would 
be nice if they abolished graphemes though... It is quite possible that while 
D's handling of unicode is a huge improvement over other languages, by treating 
dchar as a full character essentially everywhere, we're opening ourselves up for 
a variety of bugs caused by graphemes which will be subtle and hard to find. But 
I'm not sure what the correct solution to that is.

- Jonathan M Davis