VLERange: a range in between BidirectionalRange and RandomAccessRange
Michel Fortin
michel.fortin at michelf.com
Fri Jan 14 07:50:02 PST 2011
On 2011-01-13 23:23:10 -0500, Andrei Alexandrescu
<SeeWebsiteForEmail at erdani.org> said:
> On 1/13/11 7:09 PM, Michel Fortin wrote:
>> That's forgetting that most of the time people care about graphemes
>> (user-perceived characters), not code points.
>
> I'm not so sure about that. What do you base this assessment on? Denis
> wrote a library that according to him does grapheme-related stuff
> nobody else does. So apparently graphemes is not what people care about
> (although it might be what they should care about).
Apple implemented all these things in the NSString class in Cocoa. They
did all this work on Unicode at the beginning of Mac OS X, at a time
where making such changes wouldn't break anything.
It's a hard thing to change later when you have code that depend on the
old behaviour. It's a complicated matter and not so many people will
understand the issues, so it's no wonder many languages just deal with
code points.
> This might be a good time to see whether we need to address graphemes
> systematically. Could you please post a few links that would educate me
> and others in the mysteries of combining characters?
As usual, Wikipedia offers a good summary and a couple of references.
Here's the part about combining characters:
<http://en.wikipedia.org/wiki/Combining_character>.
There's basically four ranges of code points which are combining:
- Combining Diacritical Marks (0300–036F)
- Combining Diacritical Marks Supplement (1DC0–1DFF)
- Combining Diacritical Marks for Symbols (20D0–20FF)
- Combining Half Marks (FE20–FE2F)
A code point followed by one or more code points in these ranges is
conceptually a single character (a grapheme).
But for comparing strings correctly, you need to determine the
canonical equivalence. Wikipedia describes it in Unicode Normalization
article <http://en.wikipedia.org/wiki/Unicode_normalization>. The full
algorithm specification can be found here:
<http://unicode.org/reports/tr15/> (the algorithm . The canonical form
has both a composed and decomposed variant, the first trying to use
pre-combined character when possible, the second not using any
pre-combined character. Not only combining marks are concerned, there
are a few single-code-point characters which have a duplicate somewhere
else in the code point table.
Also, there's two normalizations: the canonical one (described above)
and the compatibility one which is more lax (making the ligature "fl"
would equivalent to "fl" for instance). If a user searches for some
text in a document, it's probably better to search using the
compatibility normalization so that "flower" (with ligature) and
"flower" (without ligature) can match each other. If you want to search
case-insensitively, then you'll need to implement the collation
algorithm, but that's getting further.
If you're wondering which direction to take, this official FAQ seems
like a good resource (especially the first few questions):
<http://www.unicode.org/faq/normalization.html>
One important thing to note is that most of the time, strings come
already in the normalized pre-composed form. So the normalization
algorithm should be optimized for the case it has nothing to do. That's
what is said in section 1.3 Description of the Normalization Algorithm
in the specification:
<http://www.unicode.org/reports/tr15/#Description_Norm>.
--
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/
More information about the Digitalmars-d
mailing list