std.algorithm.remove and principle of least astonishment
Michel Fortin
michel.fortin at michelf.com
Sun Nov 21 17:11:23 PST 2010
On 2010-11-20 18:58:33 -0500, Andrei Alexandrescu
<SeeWebsiteForEmail at erdani.org> said:
> D strings exhibit no such problems. They expose their implementation -
> array of code units. Having that available is often handy. They also
> obey a formal interface - bidirectional ranges.
It's convenient that char[] and wchar[] expose a dchar bidirectional
range interface... but only when a dchar bidirectional range is what
you want to use. If you want to iterate over code units (lower-level
representation), or graphemes (upper-level representation), then it
gets in your way.
There is no easy notion of "character" in unicode. A code point is
*not* a character. One character can span multiple code points. I fear
treating dchars as "the default character unit" is repeating same kind
of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and
treating each 2-byte code unit as a character. I mean, what's the point
of working with the intermediary representation (code points) when it
doesn't represent a character?
Instead, I think it'd be better that the level one wants to work at be
made explicit. If one wants to work with code points, he just rolls a
code-point bidirectional range on top of the string. If one wants to
work with graphemes (user-perceived characters), he just rolls a
grapheme bidirectional range on top of the string. In other words:
string str = "hello";
foreach (cu; str) {} // code unit iteration
foreach (cp; str.codePoints) {} // code point iteration, bidirectional
range of dchar
foreach (gr; str.graphemes) {} // grapheme iteration, bidirectional
range of graphemes
That'd be much cleaner than having some sort of hybrid
code-point/code-unit array/range.
Here's a nice reference about unicode graphemes, word segmentation, and
related algorithms.
<http://unicode.org/reports/tr29/>
--
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/
More information about the Digitalmars-d
mailing list