std.algorithm.remove and principle of least astonishment

Michel Fortin michel.fortin at michelf.com
Sun Nov 21 17:11:23 PST 2010


On 2010-11-20 18:58:33 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> D strings exhibit no such problems. They expose their implementation - 
> array of code units. Having that available is often handy. They also 
> obey a formal interface - bidirectional ranges.

It's convenient that char[] and wchar[] expose a dchar bidirectional 
range interface... but only when a dchar bidirectional range is what 
you want to use. If you want to iterate over code units (lower-level 
representation), or graphemes (upper-level representation), then it 
gets in your way.

There is no easy notion of "character" in unicode. A code point is 
*not* a character. One character can span multiple code points. I fear 
treating dchars as "the default character unit" is repeating same kind 
of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and 
treating each 2-byte code unit as a character. I mean, what's the point 
of working with the intermediary representation (code points) when it 
doesn't represent a character?

Instead, I think it'd be better that the level one wants to work at be 
made explicit. If one wants to work with code points, he just rolls a 
code-point bidirectional range on top of the string. If one wants to 
work with graphemes (user-perceived characters), he just rolls a 
grapheme bidirectional range on top of the string. In other words:

	string str = "hello";
	foreach (cu; str) {}            // code unit iteration
	foreach (cp; str.codePoints) {} // code point iteration, bidirectional 
range of dchar
	foreach (gr; str.graphemes) {}  // grapheme iteration, bidirectional 
range of graphemes

That'd be much cleaner than having some sort of hybrid 
code-point/code-unit array/range.

Here's a nice reference about unicode graphemes, word segmentation, and 
related algorithms.
<http://unicode.org/reports/tr29/>

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/



More information about the Digitalmars-d mailing list