std.algorithm.remove and principle of least astonishment

Mon Nov 22 03:32:13 PST 2010

On Sun, 21 Nov 2010 19:27:06 -0600
Andrei Alexandrescu <SeeWebsiteForEmail at erdani.org> wrote:

> > There is no easy notion of "character" in unicode. A code point is *not*
> > a character. One character can span multiple code points. I fear
> > treating dchars as "the default character unit" is repeating same kind
> > of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and
> > treating each 2-byte code unit as a character. I mean, what's the point
> > of working with the intermediary representation (code points) when it
> > doesn't represent a character?  
> 
> I understand the concern, and that's why I strongly support formal 
> abstractions that are supported by, but largely independent from, 
> representations. If graphemes are to be modeled, D is in better shape 
> than other languages. What we need to do is define a range byGrapheme() 
> that accepts char[], wchar[], or dchar[].

Sure, D helps a lot. I agree with abstraction levels independant of internal representation in the general case (I think it's one major aspect and advantage of ranges, isn't it?). But it yields a huge efficiency issue in this very case. Namely that if one deals with a text at the level graphemes while the representation of of a string of code points, then every little routine has to reconstruct the graphemes on the fly. For instance, indexing 3 times will do 3 times the job of constructing a string of graphemes (up to the given indices).
Thus, when one has to do text processing, even of the simplest kind, it is necessary to use a dedicated type (or any kind of tool using a high-level representation). (Analog to the need of first decoding code units into code points, only once, before dealing with code points -- but at a higher level.)
See also answer to Michel's post.

Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com