std.algorithm.remove and principle of least astonishment

Sun Nov 21 17:27:06 PST 2010

On 11/21/10 7:11 PM, Michel Fortin wrote:
> On 2010-11-20 18:58:33 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail at erdani.org> said:
>
>> D strings exhibit no such problems. They expose their implementation -
>> array of code units. Having that available is often handy. They also
>> obey a formal interface - bidirectional ranges.
>
> It's convenient that char[] and wchar[] expose a dchar bidirectional
> range interface... but only when a dchar bidirectional range is what you
> want to use. If you want to iterate over code units (lower-level
> representation), or graphemes (upper-level representation), then it gets
> in your way.

I agree.

> There is no easy notion of "character" in unicode. A code point is *not*
> a character. One character can span multiple code points. I fear
> treating dchars as "the default character unit" is repeating same kind
> of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and
> treating each 2-byte code unit as a character. I mean, what's the point
> of working with the intermediary representation (code points) when it
> doesn't represent a character?

I understand the concern, and that's why I strongly support formal 
abstractions that are supported by, but largely independent from, 
representations. If graphemes are to be modeled, D is in better shape 
than other languages. What we need to do is define a range byGrapheme() 
that accepts char[], wchar[], or dchar[].

> Instead, I think it'd be better that the level one wants to work at be
> made explicit. If one wants to work with code points, he just rolls a
> code-point bidirectional range on top of the string. If one wants to
> work with graphemes (user-perceived characters), he just rolls a
> grapheme bidirectional range on top of the string. In other words:
>
> string str = "hello";
> foreach (cu; str) {} // code unit iteration
> foreach (cp; str.codePoints) {} // code point iteration, bidirectional
> range of dchar
> foreach (gr; str.graphemes) {} // grapheme iteration, bidirectional
> range of graphemes
>
> That'd be much cleaner than having some sort of hybrid
> code-point/code-unit array/range.
>
> Here's a nice reference about unicode graphemes, word segmentation, and
> related algorithms.
> <http://unicode.org/reports/tr29/>

I agree except for the fact that in my experience you want to iterate 
over code points much more often than over code units. Iterating by code 
unit by default is almost always wrong. That's why D's strings offer the 
bidirectional interface by default. I have reasons to believe it was a 
good decision.

Andrei