VLERange: a range in between BidirectionalRange and RandomAccessRange

Fri Jan 14 07:50:02 PST 2011

On 2011-01-13 23:23:10 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> On 1/13/11 7:09 PM, Michel Fortin wrote:
>> That's forgetting that most of the time people care about graphemes
>> (user-perceived characters), not code points.
> 
> I'm not so sure about that. What do you base this assessment on? Denis 
> wrote a library that according to him does grapheme-related stuff 
> nobody else does. So apparently graphemes is not what people care about 
> (although it might be what they should care about).

Apple implemented all these things in the NSString class in Cocoa. They 
did all this work on Unicode at the beginning of Mac OS X, at a time 
where making such changes wouldn't break anything.

It's a hard thing to change later when you have code that depend on the 
old behaviour. It's a complicated matter and not so many people will 
understand the issues, so it's no wonder many languages just deal with 
code points.

> This might be a good time to see whether we need to address graphemes 
> systematically. Could you please post a few links that would educate me 
> and others in the mysteries of combining characters?

As usual, Wikipedia offers a good summary and a couple of references. 
Here's the part about combining characters: 
<http://en.wikipedia.org/wiki/Combining_character>.

There's basically four ranges of code points which are combining:
- Combining Diacritical Marks (0300–036F)
- Combining Diacritical Marks Supplement (1DC0–1DFF)
- Combining Diacritical Marks for Symbols (20D0–20FF)
- Combining Half Marks (FE20–FE2F)

A code point followed by one or more code points in these ranges is 
conceptually a single character (a grapheme).

But for comparing strings correctly, you need to determine the 
canonical equivalence. Wikipedia describes it in Unicode Normalization 
article <http://en.wikipedia.org/wiki/Unicode_normalization>. The full 
algorithm specification can be found here: 
<http://unicode.org/reports/tr15/> (the algorithm . The canonical form 
has both a composed and decomposed variant, the first trying to use 
pre-combined character when possible, the second not using any 
pre-combined character. Not only combining marks are concerned, there 
are a few single-code-point characters which have a duplicate somewhere 
else in the code point table.

Also, there's two normalizations: the canonical one (described above) 
and the compatibility one which is more lax (making the ligature "ﬂ" 
would equivalent to "fl" for instance). If a user searches for some 
text in a document, it's probably better to search using the 
compatibility normalization so that "flower" (with ligature) and 
"ﬂower" (without ligature) can match each other. If you want to search 
case-insensitively, then you'll need to implement the collation 
algorithm, but that's getting further.

If you're wondering which direction to take, this official FAQ seems 
like a good resource (especially the first few questions):
<http://www.unicode.org/faq/normalization.html>

One important thing to note is that most of the time, strings come 
already in the normalized pre-composed form. So the normalization 
algorithm should be optimized for the case it has nothing to do. That's 
what is said in section 1.3 Description of the Normalization Algorithm 
in the specification: 
<http://www.unicode.org/reports/tr15/#Description_Norm>.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/