std.algorithm.remove and principle of least astonishment

Mon Nov 22 05:24:33 PST 2010

On 2010-11-22 06:57:36 -0500, spir <denis.spir at gmail.com> said:

> On Sun, 21 Nov 2010 20:11:23 -0500
> Michel Fortin <michel.fortin at michelf.com> wrote:
> 
>> On 2010-11-20 18:58:33 -0500, Andrei Alexandrescu
>> <SeeWebsiteForEmail at erdani.org> said:
>> 
>>> D strings exhibit no such problems. They expose their implementation -
> 
>>> array of code units. Having that available is often handy. They also
>>> obey a formal interface - bidirectional ranges.
>> 
>> It's convenient that char[] and wchar[] expose a dchar bidirectional
>> range interface... but only when a dchar bidirectional range is what
>> you want to use. If you want to iterate over code units (lower-level
>> representation), or graphemes (upper-level representation), then it
>> gets in your way.
> 
> True.
> 
>> There is no easy notion of "character" in unicode. A code point is
>> *not* a character. One character can span multiple code points. I fear
>> treating dchars as "the default character unit" is repeating same kind
>> of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and
>> treating each 2-byte code unit as a character. I mean, what's the point
>> of working with the intermediary representation (code points) when it
>> doesn't represent a character?
> 
> True, but only partially. The error of using utf16 to represent code points
>  is far less serious in practice, because code point > ffff have about no c
> hance to ever be present in any text one programmer will ever have to deal
> with. (This error was in fact initially caused by the standard people who f
> irst thought ffff was enough, so that 16-bit tools and encodings were creat
> ed and used.)
> But I fully agree with "what's the point of working with the intermediary r
> epresentation (code points) when it doesn't represent a character?". *This*
>  is wrong and may cause much damage. Actually, it means apps simply do not
> work correctly; a logical error; and one that can hardly be automatically d
> etected.
> A side-issue is that in present times we mostly deal with source texts for
> which there exists precomposed characters, _and_ text-prodcuing tools usual
> ly use them. So that programmers who ignore the issue may think they are ri
> ght. But both of those facts may soon be wrong.
> 
>> Instead, I think it'd be better that the level one wants to work at be
>> made explicit. If one wants to work with code points, he just rolls a
>> code-point bidirectional range on top of the string. If one wants to
>> work with graphemes (user-perceived characters), he just rolls a
>> grapheme bidirectional range on top of the string. In other words:
>> 
>> 	string str = "hello";
>> 	foreach (cu; str) {}            // code unit iteration
>> 	foreach (cp; str.codePoints) {} // code point iteration, bidirectional
>> range of dchar
>> 	foreach (gr; str.graphemes) {}  // grapheme iteration, bidirectional
>> range of graphemes
>> 
>> That'd be much cleaner than having some sort of hybrid
>> code-point/code-unit array/range.
> 
> Yop, but the ability to iterate over graphemes, while the internal represen
> tation is of a string of codes, or code units, is *not* what we need:
> 	text.count(c);
> would have to construct graphemes on the fly on the whole string.

I agree there might be a use case for a special data type allowing fast 
random access to graphemes and able to retain the precise count of 
graphemes. But if what you do only requires iterating over all 
graphemes, a wrapper range that converts to graphemes on the fly might 
be less overhead than building a separate data structure.

In fact, this separate data structure to hold graphemes is probably 
going to require more memory, and more memory will fit worse in the 
processor's cache. Compare the cost in performance of a cache miss 
versus one or two comparisons to check if the next code point is part 
of the same grapheme and you might actually find the version that 
iterates by converting code points to graphemes on the fly faster for 
long strings. As long as you don't need random access to the graphemes, 
I don't think you need a separate data structure.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/