VLERange: a range in between BidirectionalRange and RandomAccessRange

Sat Jan 15 20:45:01 PST 2011

On 2011-01-15 18:59:27 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> I'm unclear on where this is converging to. At this point the 
> commitment of the language and its standard library to (a) UTF aray 
> representation and (b) code points conceptualization is quite strong. 
> Changing that would be quite difficult and disruptive, and the benefits 
> are virtually nonexistent for most of D's user base.

There's still a disagreement about whether a string or a code unit 
array should be the default string representation, and whether 
iterating on a code unit array should give you code unit or grapheme 
elements. Of those who who participated in the discussion, I don't 
think anyone is disputing the idea that a grapheme element is better 
than a dchar element for iterating over a string.

> It may be more realistic to consider using what we have as back-end for 
> grapheme-oriented processing.
> For example:
> 
> struct Grapheme(Char) if (isSomeChar!Char)
> {
>      private const Char[] rep;
>      ...
> }
> 
> auto byGrapheme(S)(S s) if (isSomeString!S)
> {
>     ...
> }
> 
> string s = "Hello";
> foreach (g; byGrapheme(s)
> {
>      ...
> }

No doubt it's easier to implement it that way. The problem is that in 
most cases it won't be used. How many people really know what is a 
grapheme? Of those, how many will forget to use byGrapheme at one time 
or another? And so in most programs string manipulation will misbehave 
in the presence of combining characters or unnormalized strings.

If you want to help D programmers write correct code when it comes to 
Unicode manipulation, you need to help them iterate on real characters 
(graphemes), and you need the algorithms to apply to real characters 
(graphemes), not the approximation of a Unicode character that is a 
code point.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/