VLERange: a range in between BidirectionalRange and RandomAccessRange

Mon Jan 17 12:29:48 PST 2011

On 2011-01-17 12:33:04 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> On 1/17/11 10:34 AM, Michel Fortin wrote:
>> As I said: all those people who are not validating the inputs to make
>> sure they don't contain combining code points.
> 
> The question (which I see you keep on dodging :o)) is how much text 
> contains combining code points.

Not much, right now.

The problem is that the answer to this question is likely to change as 
Unicode support improves in operating system and applications. 
Shouldn't we future-proof Phobos?

> It does not serve us well to rigidly claim that the only good way of 
> doing anything Unicode is to care about graphemes.

For the time being we can probably afford it.

> Even NSString exposes the UTF16 underlying encoding and provides 
> dedicated functions for grapheme-based processing. For one thing, if 
> you care about the width of a word in printed text (one of the case 
> where graphemes are important), you need font information. And - 
> surprise! - some fonts do NOT support combining characters and print 
> signs next to one another instead of juxtaposing them, so the "wrong" 
> method of counting characters is more informative.

Generally what OS X does in those case is that it displays that 
character in another font. That said, counting grapheme is never a good 
way to tell how much space some text will take (unless the application 
enforces a fixed width per grapheme). It's more useful for telling the 
number of character in a text document, similar to a word count.

>> That said, no one should really have to care but those who implement the
>> string manipulation functions. The idea behind making the grapheme the
>> element type is to make it easier to write grapheme-aware string
>> manipulation functions, even if you don't know about graphemes. But the
>> reality is probably more mixed than that.
> 
> The reality is indeed more mixed. Inevitably at some point the API 
> needs to answer the question: "what is the first character of this 
> string?" Transparency is not possible. You break all string code out 
> there.

I'm not sure what you mean by that.

>> - - -
>> 
>> I gave some thought about all this, and came to an interesting
>> realizations that made me refine the proposal. The new proposal is
>> disruptive perhaps as much as the first, but in a different way.
>> 
>> But first, let's state a few facts to reframe the current discussion:
>> 
>> Fact 1: most people don't know Unicode very well
>> Fact 2: most people are confused by code units, code points, graphemes,
>> and what is a 'character'
>> Fact 3: most people won't bother with all this, they'll just use the
>> basic language facilities and assume everything work correctly if it it
>> works correctly for them
> 
> Nice :o).
> 
>> Now, let's define two goals:
>> 
>> Goal 1: make most people's string operations work correctly
>> Goal 2: make most people's string operations work fast
> 
> Goal 3: don't break all existing code
> Goal 4: make most people's string-based code easy to write and understand

Those are worthy goals too.

>> To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not
>> sure we agree on this, but let's continue.
> 
> I think we disagree about what "most" means. For you it means "people 
> who don't understand Unicode well but deal with combining characters 
> anyway". For me it's "the largest percentage of D users across various 
> writing systems".

It's not just D users, it's also for the users of programs written by D 
users. I can't count how many times I've seen accented character 
mishandled on websites and elsewhere, and I probably have an aversion 
about doing the same thing to people of other cultures and languages.

If the operating system supports combining marks, users have an 
expectations that applications running on it will deal with them 
correctly too, and they'll (rightfully) blame your application if it 
doesn't work. Same for websites.

I understand that in some situations you don't want to deal with 
graphemes even if you theoretically should, but I don't think it should 
be the default.

>> One observation I made with having dchar as the default element type is
>> that not all algorithms really need to deal with dchar. If I'm searching
>> for code point 'a' in a UTF-8 string, decoding code units into code
>> points is a waste of time. Why? because the only way to represent code
>> point 'a' is by having code point 'a'.
> 
> Right. That's why many algorithms in std are specialized for such cases.
> 
>> And guess what? The almost same
>> optimization can apply to graphemes: if you're searching for 'a' in a
>> grapheme-aware manner in a UTF-8 string, all you have to do is search
>> for the UTF-8 code unit 'a', then check if the 'a' code unit is followed
>> by a combining mark code point to confirm it is really a 'a', not a
>> composed grapheme. Iterating the string by code unit is enough for these
>> cases, and it'd increase performance by a lot.
> 
> Unfortunately it all breaks as soon as you go beyond one code point. 
> You can't search efficiently, you can't compare efficiently. 
> Boyer-Moore and friends are out.

Ok. Say you were searching for the needle "étoilé" in an UTF-8 
haystack, I see two way to extend the optimization described above:

1. search for the easy part "toil", then check its surrounding 
graphemes to confirm it's really "étoilé"
2. search for a code point matching 'é' or 'e', then confirm that the 
code points following it form the right graphemes.

Implementing the second one can be done by converting the needle to a 
regular expression operating at code-unit level. With that you can 
search efficiently for the needle directly in code units without having 
to decode and/or normalize the whole haystack.

>> This penalty with generic algorithms comes from the fact that they take
>> a predicate of the form "a == 'a'" or "a == b", which is ill-suited for
>> strings because you always need to fully decode the string (by dchar or
>> by graphemes) for the purpose of calling the predicate. Given that
>> comparing characters for something else than equality or them being part
>> of a set is very rarely something you do, generic algorithms miss a big
>> optimization opportunity here.
> 
> How can we improve that? You can't argue for an inefficient scheme just 
> because what we have isn't as efficient as it could possibly be.

You ask what's inefficient about generic algorithms having customizable 
predicates? You can't implement the above optimization if you can't 
guaranty the predicate is "==". That said, perhaps we can detect "==" 
and only apply the optimization then.

Being able to specify the predicate doesn't gain you much for strings, 
because a < 'a' doesn't make much sense. All you need to check for is 
equality with some value or membership of given character set, both of 
which can use the optimization above.

>> So here's what I think we should do:
>> 
>> Todo 1: disallow generic algorithms on naked strings: string-specific
>> Unicode-aware algorithms should be used instead; they can share the same
>> name if their usage is similar
> 
> I don't understand this. We already do this, and by "Unicode-aware" we 
> understand using dchar throughout. This is transparent to client code.

That's probably because you haven't understood the intent (I might not 
have made it very clear either).

The problem I see currently is that you rely on dchar being the element 
type. That should be an implementation detail, not something client 
code can see or rely on. By making it an implementation detail, you can 
later make grapheme-aware algorithms the default without changing the 
API. Since you're the gatekeeper to Phobos, you can make this change 
conditional to getting an acceptable level of performance out of the 
grapheme-aware algorithms, or on other factors like the amount of 
combining characters you encounter in the wild in the next few years.

So the general string functions would implement your compromise (using 
dchar) but not commit indefinitely to it. Someone who really want to 
work in code point can use toDchar, someone who want to deal with 
graphemes uses toGraphemes, someone who doesn't care won't choose 
anything and get the default behaviour of compromise.

All you need to do for this is document it, and try to make sure the 
string APIs don't force the implementation to work with code points.

>> Todo 2: to use a generic algorithm with a strings, you must dress the
>> string using one of toDchar, toGrapheme, toCodeUnits; this way your
>> intentions are clear
> 
> Breaks a lot of existing code.
> 
> Won't fly with Walter unless it solves world hunger. Nevertheless I 
> have to say I like it; the #1 thing I'd change about the built-in 
> strings is that they implicitly are two things at the same time. Asking 
> for representation should be explicit.

No, it doesn't break anything. This is just the continuation of what I 
tried to explain above: if you want to be sure you're working with 
graphemes or dchar, say it.

Also, it said nothing about iteration or foreach, so I'm not sure why 
it wouldn't fly with Walter. It can stay as it is, except for one 
thing: you and Walter should really get on the same wavelength 
regarding ElementType!(char[]) and foreach(c; string). I don't care 
that much which is the default, but they absolutely need to be the same.

>> Todo 3: string-specific algorithms can implemented as simple wrappers
>> for generic algorithms with the string dressed correctly for the task,
>> or they can implement more sophisticated algorithms to increase performance
> 
> One thing I like about the current scheme is that all 
> bidirectional-range algorithms work out of the box with all strings, 
> and lend themselves to optimization whenever you want to.

I like this as the default behaviour too. I think however that you 
should restrict the algorithms that work out of the box to those which 
can also work with graphemes. This way you can change the behaviour in 
the future and support graphemes by a simple upgrade of Phobos.

Algorithms that doesn't work with graphemes would still work with toDchar.

So what doesn't work with graphemes? Predicates such as "a < b" for 
instance. That's pretty much it.

> This will have trouble passing Walter's wanking test. Mine too; every 
> time I need to write a bunch of forwarding functions, that's a signal 
> something went wrong somewhere. Remember MFC? :o)

The idea is that we write the API as it would apply to graphemes, but 
we implement it using dchar for the time being. Some function 
signatures might have to differ a bit.

>> Do you like that more?
> 
> This is not about liking. I like doing the right thing as much as you 
> do, and I think Phobos shows that. Clearly doing the right thing 
> through and through is handling combining characters appropriately. The 
> problem is keeping all desiderata in careful balance.

Well then, don't you find it balanced enough? I'm not asking that 
everything be done with graphemes. I'm not even asking that anything be 
done with graphemes by default. I'm only asking that we keep the API 
clean enough so we can pass to graphemes by default in the future 
without having to rewrite all the code everywhere to use byGrapheme. If 
this isn't the right balance.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/