VLERange: a range in between BidirectionalRange and RandomAccessRange

Nick Sabalausky a at a.a
Thu Jan 13 22:26:41 PST 2011


"Andrei Alexandrescu" <SeeWebsiteForEmail at erdani.org> wrote in message 
news:igoj6s$17r6$1 at digitalmars.com...
>
> I'm not so sure about that. What do you base this assessment on? Denis 
> wrote a library that according to him does grapheme-related stuff nobody 
> else does. So apparently graphemes is not what people care about (although 
> it might be what they should care about).
>

It's what they want, they just don't know it.

Graphemes are what many people *think* code points are.

>
> This might be a good time to see whether we need to address graphemes 
> systematically. Could you please post a few links that would educate me 
> and others in the mysteries of combining characters?
>

Maybe someone else has a link to an explanation (I don't), but it's 
basically just this:

Three levels of abstraction from lowest to highest:
- Code Unit (ie, encoding)
- Code Point (ie, what Unicode assigns distinct numbers to)
- Grapheme (ie, what we think of as a "character")

A code-point can be made up of one or more code-units. Likewise, a grapheme 
can be made up of one or more code-points.

There are (at least) two types of code points:

- Regular ones, such as letters, digits, and punctuation.

- "Combining Characters", such as accent marks (or if you're familiar with 
Japanese, the little things in the upper-right corner that change an "s" to 
a "z" or an "h" to a "p". Or like German's umlaut - the two dots above a 
vowel). Ie, things that are not characters in their own right, but merely 
modify other characters. These can be often (always?) be thought of as being 
like overlays.

If a code point representing a "combining character" exists in a string, 
then instead of being displayed as a character it merely modifies whatever 
code-point came before it.

So, for instance, if you want to store the German word for five (in all 
lower-case), there are two ways to do it:

[ 'f', {u with the umlaut}, 'n', 'f' ]

Or:

[ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

Those *both* get rendered exactly the same, and both represent the same 
four-letter sequence. In the second example, the 'u' and the {umlaut 
combining character} combine to form one grapheme. The f's and n's just 
happen to be single-code-point graphemes.

Note that while some characters exist in pre-combined form (such as the {u 
with the umlaut} above), legend has it there are others than can only be 
represented using a combining character.

It's also my understanding, though I'm not certain, that sometimes multiple 
combining characters can be used together on the same "root" character.

Caveat: There may very well be further complications that I'm not aware of. 
Heck, knowing Unicode, there probably are.




More information about the Digitalmars-d mailing list