VLERange: a range in between BidirectionalRange and RandomAccessRange

Sun Jan 16 11:29:04 PST 2011

On 1/15/11 10:45 PM, Michel Fortin wrote:
> On 2011-01-15 18:59:27 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail at erdani.org> said:
>
>> I'm unclear on where this is converging to. At this point the
>> commitment of the language and its standard library to (a) UTF aray
>> representation and (b) code points conceptualization is quite strong.
>> Changing that would be quite difficult and disruptive, and the
>> benefits are virtually nonexistent for most of D's user base.
>
> There's still a disagreement about whether a string or a code unit array
> should be the default string representation, and whether iterating on a
> code unit array should give you code unit or grapheme elements. Of those
> who who participated in the discussion, I don't think anyone is
> disputing the idea that a grapheme element is better than a dchar
> element for iterating over a string.

Disagreement as that might be, a simple fact that needs to be taken into 
account is that as of right now all of Phobos uses UTF arrays for string 
representation and dchar as element type.

Besides, for one I do dispute the idea that a grapheme element is better 
than a dchar element for iterating over a string. The grapheme has the 
attractiveness of being theoretically clean but at the same time is 
woefully inefficient and helps languages that few D users need to work 
with. At least that's my perception, and we need some serious numbers 
instead of convincing rhetoric to make a big decision.

It's all a matter of picking one's trade-offs. Clearly ASCII is out as 
no serious amount of non-English text can be trafficked without 
diacritics. So switching to UTF makes a lot of sense, and that's what D did.

When I introduced std.range and std.algorithm, they'd handle char[] and 
wchar[] no differently than any other array. A lot of algorithms simply 
did the wrong thing by default, so I attempted to fix that situation by 
defining byDchar(). So instead of passing some string str to an 
algorithm, one would pass byDchar(str).

A couple of weeks went by in testing that state of affairs, and before 
late I figured that I need to insert byDchar() virtually _everywhere_. 
There were a couple of algorithms (e.g. Boyer-Moore) that happened to 
work with arrays for subtle reasons (needless to say, they won't work 
with graphemes at all). But by and large the situation was that the 
simple and intuitive code was wrong and that the correct code 
necessitated inserting byDchar().

So my next decision, which understandably some of the people who didn't 
go through the experiment may find unintuitive, was to make byDchar() 
the default. This cleaned up a lot of crap in std itself and saved a lot 
of crap in the yet-unwritten client code.

I think it's reasonable to understand why I'm happy with the current 
state of affairs. It is better than anything we've had before and better 
than everything else I've tried.

Now, thanks to the effort people have spent in this group (thank you!), 
I have an understanding of the grapheme issue. I guarantee that 
grapheme-level iteration will have a high cost incurred to it: 
efficiency and changes in std. The languages that need composing 
characters for producing meaningful text are few and far between, so it 
makes sense to confine support for them to libraries that are not the 
default, unless we find ways to not disrupt everyone else.

>> It may be more realistic to consider using what we have as back-end
>> for grapheme-oriented processing.
>> For example:
>>
>> struct Grapheme(Char) if (isSomeChar!Char)
>> {
>> private const Char[] rep;
>> ...
>> }
>>
>> auto byGrapheme(S)(S s) if (isSomeString!S)
>> {
>> ...
>> }
>>
>> string s = "Hello";
>> foreach (g; byGrapheme(s)
>> {
>> ...
>> }
>
> No doubt it's easier to implement it that way. The problem is that in
> most cases it won't be used. How many people really know what is a
> grapheme?

How many people really should care?

> Of those, how many will forget to use byGrapheme at one time
> or another? And so in most programs string manipulation will misbehave
> in the presence of combining characters or unnormalized strings.

But most strings don't contain combining characters or unnormalized strings.

> If you want to help D programmers write correct code when it comes to
> Unicode manipulation, you need to help them iterate on real characters
> (graphemes), and you need the algorithms to apply to real characters
> (graphemes), not the approximation of a Unicode character that is a code
> point.

I don't think the situation is as clean cut, as grave, and as urgent as 
you say.

Andrei