VLERange: a range in between BidirectionalRange and RandomAccessRange

Sun Jan 16 15:58:54 PST 2011

On 1/16/11 3:20 PM, Michel Fortin wrote:
> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail at erdani.org> said:
>
>> On 1/15/11 10:45 PM, Michel Fortin wrote:
>>> No doubt it's easier to implement it that way. The problem is that in
>>> most cases it won't be used. How many people really know what is a
>>> grapheme?
>>
>> How many people really should care?
>
> I think the only people who should *not* care are those who have
> validated that the input does not contain any combining code point. If
> you know the input *can't* contain combining code points, then it's safe
> to ignore them.

I agree. Now let me ask again: how many people really should care?

> If we don't make correct Unicode handling the default, someday someone
> is going to ask a developer to fix a problem where his system doesn't
> handle some text correctly. Later that day, he'll come to the
> realization that almost none of his D code and none of the D libraries
> he use handle unicode correctly, and he'll say: can't fix this. His peer
> working on a similar Objective-C program will have a good laugh.
>
> Sure, correct Unicode handling is slower and more complicated to
> implement, but at least you know you'll get the right results.

I love the increased precision, but again I'm not sure how many people 
ever manipulate text with combining characters. Meanwhile they'll 
complain that D is slower than other languages.

>>> Of those, how many will forget to use byGrapheme at one time
>>> or another? And so in most programs string manipulation will misbehave
>>> in the presence of combining characters or unnormalized strings.
>>
>> But most strings don't contain combining characters or unnormalized
>> strings.
>
> I think we should expect combining marks to be used more and more as our
> OS text system and fonts start supporting them better. Them being rare
> might be true today, but what do you know about tomorrow?

I don't think languages will acquire more diacritics soon. I do hope, of 
course, that D applications gain more usage in the Arabic, Hebrew etc. 
world.

> A few years ago, many Unicode symbols didn't even show up correctly on
> Windows. Today, we have Unicode domain names and people start putting
> funny symbols in them (for instance: <http://◉.ws>). I haven't seen it
> yet, but we'll surely see combining characters in domain names soon
> enough (if only as a way to make fun of programs that can't handle
> Unicode correctly). Well, let me be the first to make fun of such
> programs: <http://☺̭̏.michelf.com/>.

Would you bet the language on that?

> Also, not all combining characters are marks meant to be used by some
> foreign languages. Some are used for mathematics for instance. Or you
> could use 20E0 COMBINING ENCLOSING CIRCLE BACKSLASH as an overlay
> indicating some kind of prohibition.
>
>
>>> If you want to help D programmers write correct code when it comes to
>>> Unicode manipulation, you need to help them iterate on real characters
>>> (graphemes), and you need the algorithms to apply to real characters
>>> (graphemes), not the approximation of a Unicode character that is a code
>>> point.
>>
>> I don't think the situation is as clean cut, as grave, and as urgent
>> as you say.
>
> I agree it's probably not as clean cut as I say (I'm trying to keep
> complicated things simple here), but it's something important to decide
> early because the cost of changing it increase as more code is written.

Agreed.

> Quoting the first part of the same post (out of order):
>
>> Disagreement as that might be, a simple fact that needs to be taken
>> into account is that as of right now all of Phobos uses UTF arrays for
>> string representation and dchar as element type.
>>
>> Besides, for one I do dispute the idea that a grapheme element is
>> better than a dchar element for iterating over a string. The grapheme
>> has the attractiveness of being theoretically clean but at the same
>> time is woefully inefficient and helps languages that few D users need
>> to work with. At least that's my perception, and we need some serious
>> numbers instead of convincing rhetoric to make a big decision.
>
> You'll no doubt get more performance from a grapheme-aware specialized
> algorithm working directly on code points than by iterating on graphemes
> returned as string slices. But both will give *correct* results.
>
> Implementing a specialized algorithm of this kind becomes an
> optimization, and it's likely you'll want an optimized version for most
> string algorithms.
>
> I'd like to have some numbers too about performance, but I have none at
> this time.

I spent a fair amount of time comparing ASCII vs. Unicode code speed. 
The fact of the matter is that the overhead is measurable and often 
high. Also it occurs at a very core level. For starters, the grapheme 
itself is larger and has one extra indirection. I am confident the 
marginal overhead for graphemes would be considerable.

>> It's all a matter of picking one's trade-offs. Clearly ASCII is out as
>> no serious amount of non-English text can be trafficked without
>> diacritics. So switching to UTF makes a lot of sense, and that's what
>> D did.
>>
>> When I introduced std.range and std.algorithm, they'd handle char[]
>> and wchar[] no differently than any other array. A lot of algorithms
>> simply did the wrong thing by default, so I attempted to fix that
>> situation by defining byDchar(). So instead of passing some string str
>> to an algorithm, one would pass byDchar(str).
>>
>> A couple of weeks went by in testing that state of affairs, and before
>> late I figured that I need to insert byDchar() virtually _everywhere_.
>> There were a couple of algorithms (e.g. Boyer-Moore) that happened to
>> work with arrays for subtle reasons (needless to say, they won't work
>> with graphemes at all). But by and large the situation was that the
>> simple and intuitive code was wrong and that the correct code
>> necessitated inserting byDchar().
>>
>> So my next decision, which understandably some of the people who
>> didn't go through the experiment may find unintuitive, was to make
>> byDchar() the default. This cleaned up a lot of crap in std itself and
>> saved a lot of crap in the yet-unwritten client code.
>
> But were your algorithms *correct* in the first place? I'd argue that by
> making byDchar the default you've not saved yourself from any crap
> because dchar isn't the right layer of abstraction.

It was correct for all but a couple languages. Again: most of today's 
languages don't ever need combining characters.

>> I think it's reasonable to understand why I'm happy with the current
>> state of affairs. It is better than anything we've had before and
>> better than everything else I've tried.
>
> It is indeed easy to understand why you're happy with the current state
> of affairs: you never had to deal with multi-code-point character and
> can't imagine yourself having to deal with them on a semi-frequent
> basis.

Do you, and can you?

> Other people won't be so happy with this state of affairs, but
> they'll probably notice only after most of their code has been written
> unaware of the problem.

They can't be unaware and write said code.

>> Now, thanks to the effort people have spent in this group (thank
>> you!), I have an understanding of the grapheme issue. I guarantee that
>> grapheme-level iteration will have a high cost incurred to it:
>> efficiency and changes in std. The languages that need composing
>> characters for producing meaningful text are few and far between, so
>> it makes sense to confine support for them to libraries that are not
>> the default, unless we find ways to not disrupt everyone else.
>
> We all are more aware of the problem now, that's a good thing. :-)

All I wish is it's not blown out of proportion. It fares rather low on 
my list of library issues that D has right now.

Andrei