VLERange: a range in between BidirectionalRange and RandomAccessRange

Mon Jan 17 09:33:04 PST 2011

On 1/17/11 10:34 AM, Michel Fortin wrote:
> On 2011-01-16 18:58:54 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail at erdani.org> said:
>
>> On 1/16/11 3:20 PM, Michel Fortin wrote:
>>> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
>>> <SeeWebsiteForEmail at erdani.org> said:
>>>
>>>> On 1/15/11 10:45 PM, Michel Fortin wrote:
>>>>> No doubt it's easier to implement it that way. The problem is that in
>>>>> most cases it won't be used. How many people really know what is a
>>>>> grapheme?
>>>>
>>>> How many people really should care?
>>>
>>> I think the only people who should *not* care are those who have
>>> validated that the input does not contain any combining code point. If
>>> you know the input *can't* contain combining code points, then it's safe
>>> to ignore them.
>>
>> I agree. Now let me ask again: how many people really should care?
>
> As I said: all those people who are not validating the inputs to make
> sure they don't contain combining code points.

The question (which I see you keep on dodging :o)) is how much text 
contains combining code points.

I have worked in NLP for years, and still do. I even worked on Arabic 
text (albeit Romanized). I work with Wikipedia. I use Unicode all the 
time, but I have yet to have trouble with a combining character. I was 
just vaguely aware of their existence up until this discussion, but just 
waved it away and guess what - it worked for me.

It does not serve us well to rigidly claim that the only good way of 
doing anything Unicode is to care about graphemes. Even NSString exposes 
the UTF16 underlying encoding and provides dedicated functions for 
grapheme-based processing. For one thing, if you care about the width of 
a word in printed text (one of the case where graphemes are important), 
you need font information. And - surprise! - some fonts do NOT support 
combining characters and print signs next to one another instead of 
juxtaposing them, so the "wrong" method of counting characters is more 
informative.

> As far as I know, no one
> is doing that, so that means everybody should use algorithms capable of
> handling multi-code-point graphemes. If someone indeed is doing this
> validation, he'll probably also be smart enough to make his algorithms
> to work with dchars.

I am not sure everybody should use graphemes.

> That said, no one should really have to care but those who implement the
> string manipulation functions. The idea behind making the grapheme the
> element type is to make it easier to write grapheme-aware string
> manipulation functions, even if you don't know about graphemes. But the
> reality is probably more mixed than that.

The reality is indeed more mixed. Inevitably at some point the API needs 
to answer the question: "what is the first character of this string?" 
Transparency is not possible. You break all string code out there.

> - - -
>
> I gave some thought about all this, and came to an interesting
> realizations that made me refine the proposal. The new proposal is
> disruptive perhaps as much as the first, but in a different way.
>
> But first, let's state a few facts to reframe the current discussion:
>
> Fact 1: most people don't know Unicode very well
> Fact 2: most people are confused by code units, code points, graphemes,
> and what is a 'character'
> Fact 3: most people won't bother with all this, they'll just use the
> basic language facilities and assume everything work correctly if it it
> works correctly for them

Nice :o).

> Now, let's define two goals:
>
> Goal 1: make most people's string operations work correctly
> Goal 2: make most people's string operations work fast

Goal 3: don't break all existing code
Goal 4: make most people's string-based code easy to write and understand

> To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not
> sure we agree on this, but let's continue.

I think we disagree about what "most" means. For you it means "people 
who don't understand Unicode well but deal with combining characters 
anyway". For me it's "the largest percentage of D users across various 
writing systems".

>  From the above 3 facts, we can deduce that a user won't want to bother
> to using byDchar, byGrapheme, or byWhatever when using algorithms. You
> were annoyed by having to write byDchar everywhere, so changed the
> element type to always be dchar and you don't have to write byDchar
> anymore. That's understandable and perfectly reasonable.
>
> The problem is of course that it doesn't give you correct results. Most
> of the time what you really want is to use graphemes, dchar just happen
> to be a good approximation of that that works most of the time.

Again, it's a matter of tradeoffs. I chose dchar because char was plain 
_wrong_ most of the time, not because char was a pretty darn good 
approximation that worked for most people most of the time. The fact 
remains that dchar _is_ a pretty darn good approximation that also has 
pretty good darn speed. So I'd say that I _still_ want to use dchar most 
of the time.

Committing to graphemes would complicate APIs for _everyone_ and would 
make things slower for _everyone_ for the sake of combining characters 
that _never_ occur in _most_ people's text. This is bad design, pure and 
simple. A good design is to cater for the majority and provide dedicated 
APIs for the few.

> Iterating by grapheme is somewhat problematic, and it degrades
> performance.

Yes.

> Same for comparing graphemes for normalized equivalence.

Yes, although I think you can optimize code such that comparing two 
strings wholesale only has a few more comparisons on the critical path. 
That would be still slower, but not as slow as iterating by grapheme in 
a naive implementation.

> That's all true. I'm not too sure what we can do about that. It can be
> optimized, but it's very understandable that some people won't be
> satisfied by the performance and will want to avoid graphemes.

I agree.

> Speaking of optimization, I do understand that iterating by grapheme
> using the range interface won't give you the best performance. It's
> certainly convenient as it enables the reuse of existing algorithms with
> graphemes, but more specialized algorithms and interfaces might be more
> suited.

Even the specialized algorithms will be significantly slower.

> One observation I made with having dchar as the default element type is
> that not all algorithms really need to deal with dchar. If I'm searching
> for code point 'a' in a UTF-8 string, decoding code units into code
> points is a waste of time. Why? because the only way to represent code
> point 'a' is by having code point 'a'.

Right. That's why many algorithms in std are specialized for such cases.

> And guess what? The almost same
> optimization can apply to graphemes: if you're searching for 'a' in a
> grapheme-aware manner in a UTF-8 string, all you have to do is search
> for the UTF-8 code unit 'a', then check if the 'a' code unit is followed
> by a combining mark code point to confirm it is really a 'a', not a
> composed grapheme. Iterating the string by code unit is enough for these
> cases, and it'd increase performance by a lot.

Unfortunately it all breaks as soon as you go beyond one code point. You 
can't search efficiently, you can't compare efficiently. Boyer-Moore and 
friends are out.

I'm not saying that we shouldn't implement the correct operations! I'm 
just not convinced they should be the default.

> So making dchar the default type is no doubt convenient because it
> abstracts things enough so that generic algorithms can work with
> strings, but it has a performance penalty that you don't always need. I
> made an example using UTF-8, it applies even more to UTF-16. And it
> applies to grapheme-aware manipulations too.

It is true that UTF manipulation incurs overhead. The tradeoff has many 
dimensions: UTF-16 is bulkier and less cache friendly, ASCII is not 
sufficient for most people, the UTF decoding overhead is not that 
high... it's difficult to find the sweetest spot.

> This penalty with generic algorithms comes from the fact that they take
> a predicate of the form "a == 'a'" or "a == b", which is ill-suited for
> strings because you always need to fully decode the string (by dchar or
> by graphemes) for the purpose of calling the predicate. Given that
> comparing characters for something else than equality or them being part
> of a set is very rarely something you do, generic algorithms miss a big
> optimization opportunity here.

How can we improve that? You can't argue for an inefficient scheme just 
because what we have isn't as efficient as it could possibly be.

> - - -
>
> So here's what I think we should do:
>
> Todo 1: disallow generic algorithms on naked strings: string-specific
> Unicode-aware algorithms should be used instead; they can share the same
> name if their usage is similar

I don't understand this. We already do this, and by "Unicode-aware" we 
understand using dchar throughout. This is transparent to client code.

> Todo 2: to use a generic algorithm with a strings, you must dress the
> string using one of toDchar, toGrapheme, toCodeUnits; this way your
> intentions are clear

Breaks a lot of existing code. Won't fly with Walter unless it solves 
world hunger. Nevertheless I have to say I like it; the #1 thing I'd 
change about the built-in strings is that they implicitly are two things 
at the same time. Asking for representation should be explicit.

> Todo 3: string-specific algorithms can implemented as simple wrappers
> for generic algorithms with the string dressed correctly for the task,
> or they can implement more sophisticated algorithms to increase performance

One thing I like about the current scheme is that all 
bidirectional-range algorithms work out of the box with all strings, and 
lend themselves to optimization whenever you want to.

This will have trouble passing Walter's wanking test. Mine too; every 
time I need to write a bunch of forwarding functions, that's a signal 
something went wrong somewhere. Remember MFC? :o)

> There's two major benefits to this approach:
>
> Benefit 1: if indeed you really don't want the performance penalty that
> comes with checking for composed graphemes, you can bypass it at some
> specific places in your code using byDchar, or you can disable it
> altogether by modifying the string-specific algorithms and recompiling
> Phobos.
>
> Benefit 2: we don't have to rush to implementing graphemes in the
> Unicode-aware algorithms. Just make sure the interface for
> string-specific algorithms *can* accept graphemes, and we can roll out
> support for them at a later time once we have a decent implementation.

I'm not seeing the drawbacks. Hurts everyone for the sake of a few, 
breaks existent code, makes all string processing a mess, would-be users 
will throw their hands in the air seeing the simplest examples, but 
we'll have the satisfaction of high-five-ing one another telling 
ourselves that we did the right thing.

> Also, all this is leaving the question open as to what to do when
> someone uses the string as a range. In my opinion, it should either
> iterate on code units (because the string is actually an array, and
> because that's what foreach does) or simply disallow iteration (asking
> that you dress the string first using toCodeUnit, toDchar, or toGrapheme).
>
> Do you like that more?

This is not about liking. I like doing the right thing as much as you 
do, and I think Phobos shows that. Clearly doing the right thing through 
and through is handling combining characters appropriately. The problem 
is keeping all desiderata in careful balance.

Andrei