VLERange: a range in between BidirectionalRange and RandomAccessRange

Mon Jan 17 08:34:24 PST 2011

On 2011-01-16 18:58:54 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> On 1/16/11 3:20 PM, Michel Fortin wrote:
>> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
>> <SeeWebsiteForEmail at erdani.org> said:
>> 
>>> On 1/15/11 10:45 PM, Michel Fortin wrote:
>>>> No doubt it's easier to implement it that way. The problem is that in
>>>> most cases it won't be used. How many people really know what is a
>>>> grapheme?
>>> 
>>> How many people really should care?
>> 
>> I think the only people who should *not* care are those who have
>> validated that the input does not contain any combining code point. If
>> you know the input *can't* contain combining code points, then it's safe
>> to ignore them.
> 
> I agree. Now let me ask again: how many people really should care?

As I said: all those people who are not validating the inputs to make 
sure they don't contain combining code points. As far as I know, no one 
is doing that, so that means everybody should use algorithms capable of 
handling multi-code-point graphemes. If someone indeed is doing this 
validation, he'll probably also be smart enough to make his algorithms 
to work with dchars.

That said, no one should really have to care but those who implement 
the string manipulation functions. The idea behind making the grapheme 
the element type is to make it easier to write grapheme-aware string 
manipulation functions, even if you don't know about graphemes. But the 
reality is probably more mixed than that.

 - - -

I gave some thought about all this, and came to an interesting 
realizations that made me refine the proposal. The new proposal is 
disruptive perhaps as much as the first, but in a different way.

But first, let's state a few facts to reframe the current discussion:

Fact 1: most people don't know Unicode very well
Fact 2: most people are confused by code units, code points, graphemes, 
and what is a 'character'
Fact 3: most people won't bother with all this, they'll just use the 
basic language facilities and assume everything work correctly if it it 
works correctly for them

Now, let's define two goals:

Goal 1: make most people's string operations work correctly
Goal 2: make most people's string operations work fast

To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not 
sure we agree on this, but let's continue.

>From the above 3 facts, we can deduce that a user won't want to bother 
to using byDchar, byGrapheme, or byWhatever when using algorithms. You 
were annoyed by having to write byDchar everywhere, so changed the 
element type to always be dchar and you don't have to write byDchar 
anymore. That's understandable and perfectly reasonable.

The problem is of course that it doesn't give you correct results. Most 
of the time what you really want is to use graphemes, dchar just happen 
to be a good approximation of that that works most of the time.

Iterating by grapheme is somewhat problematic, and it degrades 
performance. Same for comparing graphemes for normalized equivalence. 
That's all true. I'm not too sure what we can do about that. It can be 
optimized, but it's very understandable that some people won't be 
satisfied by the performance and will want to avoid graphemes.

Speaking of optimization, I do understand that iterating by grapheme 
using the range interface won't give you the best performance. It's 
certainly convenient as it enables the reuse of existing algorithms 
with graphemes, but more specialized algorithms and interfaces might be 
more suited.

One observation I made with having dchar as the default element type is 
that not all algorithms really need to deal with dchar. If I'm 
searching for code point 'a' in a UTF-8 string, decoding code units 
into code points is a waste of time. Why? because the only way to 
represent code point 'a' is by having code point 'a'. And guess what? 
The almost same optimization can apply to graphemes: if you're 
searching for 'a' in a grapheme-aware manner in a UTF-8 string, all you 
have to do is search for the UTF-8 code unit 'a', then check if the 'a' 
code unit is followed by a combining mark code point to confirm it is 
really a 'a', not a composed grapheme. Iterating the string by code 
unit is enough for these cases, and it'd increase performance by a lot.

So making dchar the default type is no doubt convenient because it 
abstracts things enough so that generic algorithms can work with 
strings, but it has a performance penalty that you don't always need. I 
made an example using UTF-8, it applies even more to UTF-16. And it 
applies to grapheme-aware manipulations too.

This penalty with generic algorithms comes from the fact that they take 
a predicate of the form "a == 'a'" or "a == b", which is ill-suited for 
strings because you always need to fully decode the string (by dchar or 
by graphemes) for the purpose of calling the predicate. Given that 
comparing characters for something else than equality or them being 
part of a set is very rarely something you do, generic algorithms miss 
a big optimization opportunity here.

 - - -

So here's what I think we should do:

Todo 1: disallow generic algorithms on naked strings: string-specific 
Unicode-aware algorithms should be used instead; they can share the 
same name if their usage is similar

Todo 2: to use a generic algorithm with a strings, you must dress the 
string using one of toDchar, toGrapheme, toCodeUnits; this way your 
intentions are clear

Todo 3: string-specific algorithms can implemented as simple wrappers 
for generic algorithms with the string dressed correctly for the task, 
or they can implement more sophisticated algorithms to increase 
performance

There's two major benefits to this approach:

Benefit 1: if indeed you really don't want the performance penalty that 
comes with checking for composed graphemes, you can bypass it at some 
specific places in your code using byDchar, or you can disable it 
altogether by modifying the string-specific algorithms and recompiling 
Phobos.

Benefit 2: we don't have to rush to implementing graphemes in the 
Unicode-aware algorithms. Just make sure the interface for 
string-specific algorithms *can* accept graphemes, and we can roll out 
support for them at a later time once we have a decent implementation.

Also, all this is leaving the question open as to what to do when 
someone uses the string as a range. In my opinion, it should either 
iterate on code units (because the string is actually an array, and 
because that's what foreach does) or simply disallow iteration (asking 
that you dress the string first using toCodeUnit, toDchar, or 
toGrapheme).

Do you like that more?

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/