std.algorithm.remove and principle of least astonishment

Mon Nov 22 03:57:36 PST 2010

On Sun, 21 Nov 2010 20:11:23 -0500
Michel Fortin <michel.fortin at michelf.com> wrote:

> On 2010-11-20 18:58:33 -0500, Andrei Alexandrescu 
> <SeeWebsiteForEmail at erdani.org> said:
> 
> > D strings exhibit no such problems. They expose their implementation - 
> > array of code units. Having that available is often handy. They also 
> > obey a formal interface - bidirectional ranges.
> 
> It's convenient that char[] and wchar[] expose a dchar bidirectional 
> range interface... but only when a dchar bidirectional range is what 
> you want to use. If you want to iterate over code units (lower-level 
> representation), or graphemes (upper-level representation), then it 
> gets in your way.

True.

> There is no easy notion of "character" in unicode. A code point is 
> *not* a character. One character can span multiple code points. I fear 
> treating dchars as "the default character unit" is repeating same kind 
> of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and 
> treating each 2-byte code unit as a character. I mean, what's the point 
> of working with the intermediary representation (code points) when it 
> doesn't represent a character?

True, but only partially. The error of using utf16 to represent code points is far less serious in practice, because code point > ffff have about no chance to ever be present in any text one programmer will ever have to deal with. (This error was in fact initially caused by the standard people who first thought ffff was enough, so that 16-bit tools and encodings were created and used.)
But I fully agree with "what's the point of working with the intermediary representation (code points) when it doesn't represent a character?". *This* is wrong and may cause much damage. Actually, it means apps simply do not work correctly; a logical error; and one that can hardly be automatically detected.
A side-issue is that in present times we mostly deal with source texts for which there exists precomposed characters, _and_ text-prodcuing tools usually use them. So that programmers who ignore the issue may think they are right. But both of those facts may soon be wrong.

> Instead, I think it'd be better that the level one wants to work at be 
> made explicit. If one wants to work with code points, he just rolls a 
> code-point bidirectional range on top of the string. If one wants to 
> work with graphemes (user-perceived characters), he just rolls a 
> grapheme bidirectional range on top of the string. In other words:
> 
> 	string str = "hello";
> 	foreach (cu; str) {}            // code unit iteration
> 	foreach (cp; str.codePoints) {} // code point iteration, bidirectional 
> range of dchar
> 	foreach (gr; str.graphemes) {}  // grapheme iteration, bidirectional 
> range of graphemes
> 
> That'd be much cleaner than having some sort of hybrid 
> code-point/code-unit array/range.

Yop, but the ability to iterate over graphemes, while the internal representation is of a string of codes, or code units, is *not* what we need:
	text.count(c);
would have to construct graphemes on the fly on the whole string. Every text processing routine performed on a given text will have to do it on all or part of the text (indexing for instance would do it only up to given index). Meaning every routine would have to do the job of constructing a string of graphemes (and normalising it) that should be done only once. Hope I'm clear.
Reason why we need a proper Text type as a string of graphemes. The same abstration offered by dchar (from code units to code points) is needed at a higher-level (from code points to graphemes). Each element would be what I call a "stack", a mini-array of dchars. Then, we can deal with it like with a palin ASCII or Latin-1 text.

c c c c c c c c	c            dstring = dchar[] --> coded string

    c
  c c   c
c c c c c                    text = stack[]    --> logical string

> Here's a nice reference about unicode graphemes, word segmentation, and 
> related algorithms.

> <http://unicode.org/reports/tr29/>

I have implemented once the algorithm used to construct graphemes put of code points, as a base for a grapheme-level Text type, with all common text processing routines (*) (in Lua). I plan to do this in & for D in a short while. As said, it should simpler thank to D's true string types who already abstract from lower-level issues.

(*) Actually, once one a has a string of <graphemes/codes/code-units>, routines are the same whatever the kind of element. There could be a generic version in std.string.

-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com