VLERange: a range in between BidirectionalRange and RandomAccessRange

Sun Jan 16 13:20:13 PST 2011

On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> On 1/15/11 10:45 PM, Michel Fortin wrote:
>> No doubt it's easier to implement it that way. The problem is that in
>> most cases it won't be used. How many people really know what is a
>> grapheme?
> 
> How many people really should care?

I think the only people who should *not* care are those who have 
validated that the input does not contain any combining code point. If 
you know the input *can't* contain combining code points, then it's 
safe to ignore them.

If we don't make correct Unicode handling the default, someday someone 
is going to ask a developer to fix a problem where his system doesn't 
handle some text correctly. Later that day, he'll come to the 
realization that almost none of his D code and none of the D libraries 
he use handle unicode correctly, and he'll say: can't fix this. His 
peer working on a similar Objective-C program will have a good laugh.

Sure, correct Unicode handling is slower and more complicated to 
implement, but at least you know you'll get the right results.

>> Of those, how many will forget to use byGrapheme at one time
>> or another? And so in most programs string manipulation will misbehave
>> in the presence of combining characters or unnormalized strings.
> 
> But most strings don't contain combining characters or unnormalized strings.

I think we should expect combining marks to be used more and more as 
our OS text system and fonts start supporting them better. Them being 
rare might be true today, but what do you know about tomorrow?

A few years ago, many Unicode symbols didn't even show up correctly on 
Windows. Today, we have Unicode domain names and people start putting 
funny symbols in them (for instance: <http://◉.ws>). I haven't seen it 
yet, but we'll surely see combining characters in domain names soon 
enough (if only as a way to make fun of programs that can't handle 
Unicode correctly). Well, let me be the first to make fun of such 
programs: <http://☺̭̏.michelf.com/>.

Also, not all combining characters are marks meant to be used by some 
foreign languages. Some are used for mathematics for instance. Or you 
could use 20E0 COMBINING ENCLOSING CIRCLE BACKSLASH as an overlay 
indicating some kind of prohibition.

>> If you want to help D programmers write correct code when it comes to
>> Unicode manipulation, you need to help them iterate on real characters
>> (graphemes), and you need the algorithms to apply to real characters
>> (graphemes), not the approximation of a Unicode character that is a code
>> point.
> 
> I don't think the situation is as clean cut, as grave, and as urgent as 
> you say.

I agree it's probably not as clean cut as I say (I'm trying to keep 
complicated things simple here), but it's something important to decide 
early because the cost of changing it increase as more code is written.

Quoting the first part of the same post (out of order):

> Disagreement as that might be, a simple fact that needs to be taken 
> into account is that as of right now all of Phobos uses UTF arrays for 
> string representation and dchar as element type.
> 
> Besides, for one I do dispute the idea that a grapheme element is 
> better than a dchar element for iterating over a string. The grapheme 
> has the attractiveness of being theoretically clean but at the same 
> time is woefully inefficient and helps languages that few D users need 
> to work with. At least that's my perception, and we need some serious 
> numbers instead of convincing rhetoric to make a big decision.

You'll no doubt get more performance from a grapheme-aware specialized 
algorithm working directly on code points than by iterating on 
graphemes returned as string slices. But both will give *correct* 
results.

Implementing a specialized algorithm of this kind becomes an 
optimization, and it's likely you'll want an optimized version for most 
string algorithms.

I'd like to have some numbers too about performance, but I have none at 
this time.

> It's all a matter of picking one's trade-offs. Clearly ASCII is out as 
> no serious amount of non-English text can be trafficked without 
> diacritics. So switching to UTF makes a lot of sense, and that's what D 
> did.
> 
> When I introduced std.range and std.algorithm, they'd handle char[] and 
> wchar[] no differently than any other array. A lot of algorithms simply 
> did the wrong thing by default, so I attempted to fix that situation by 
> defining byDchar(). So instead of passing some string str to an 
> algorithm, one would pass byDchar(str).
> 
> A couple of weeks went by in testing that state of affairs, and before 
> late I figured that I need to insert byDchar() virtually _everywhere_. 
> There were a couple of algorithms (e.g. Boyer-Moore) that happened to 
> work with arrays for subtle reasons (needless to say, they won't work 
> with graphemes at all). But by and large the situation was that the 
> simple and intuitive code was wrong and that the correct code 
> necessitated inserting byDchar().
> 
> So my next decision, which understandably some of the people who didn't 
> go through the experiment may find unintuitive, was to make byDchar() 
> the default. This cleaned up a lot of crap in std itself and saved a 
> lot of crap in the yet-unwritten client code.

But were your algorithms *correct* in the first place? I'd argue that 
by making byDchar the default you've not saved yourself from any crap 
because dchar isn't the right layer of abstraction.

> I think it's reasonable to understand why I'm happy with the current 
> state of affairs. It is better than anything we've had before and 
> better than everything else I've tried.

It is indeed easy to understand why you're happy with the current state 
of affairs: you never had to deal with multi-code-point character and 
can't imagine yourself having to deal with them on a semi-frequent 
basis. Other people won't be so happy with this state of affairs, but 
they'll probably notice only after most of their code has been written 
unaware of the problem.

> Now, thanks to the effort people have spent in this group (thank you!), 
> I have an understanding of the grapheme issue. I guarantee that 
> grapheme-level iteration will have a high cost incurred to it: 
> efficiency and changes in std. The languages that need composing 
> characters for producing meaningful text are few and far between, so it 
> makes sense to confine support for them to libraries that are not the 
> default, unless we find ways to not disrupt everyone else.

We all are more aware of the problem now, that's a good thing. :-)

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/