VLERange: a range in between BidirectionalRange and RandomAccessRange
Nick Sabalausky
a at a.a
Thu Jan 13 22:26:41 PST 2011
"Andrei Alexandrescu" <SeeWebsiteForEmail at erdani.org> wrote in message
news:igoj6s$17r6$1 at digitalmars.com...
>
> I'm not so sure about that. What do you base this assessment on? Denis
> wrote a library that according to him does grapheme-related stuff nobody
> else does. So apparently graphemes is not what people care about (although
> it might be what they should care about).
>
It's what they want, they just don't know it.
Graphemes are what many people *think* code points are.
>
> This might be a good time to see whether we need to address graphemes
> systematically. Could you please post a few links that would educate me
> and others in the mysteries of combining characters?
>
Maybe someone else has a link to an explanation (I don't), but it's
basically just this:
Three levels of abstraction from lowest to highest:
- Code Unit (ie, encoding)
- Code Point (ie, what Unicode assigns distinct numbers to)
- Grapheme (ie, what we think of as a "character")
A code-point can be made up of one or more code-units. Likewise, a grapheme
can be made up of one or more code-points.
There are (at least) two types of code points:
- Regular ones, such as letters, digits, and punctuation.
- "Combining Characters", such as accent marks (or if you're familiar with
Japanese, the little things in the upper-right corner that change an "s" to
a "z" or an "h" to a "p". Or like German's umlaut - the two dots above a
vowel). Ie, things that are not characters in their own right, but merely
modify other characters. These can be often (always?) be thought of as being
like overlays.
If a code point representing a "combining character" exists in a string,
then instead of being displayed as a character it merely modifies whatever
code-point came before it.
So, for instance, if you want to store the German word for five (in all
lower-case), there are two ways to do it:
[ 'f', {u with the umlaut}, 'n', 'f' ]
Or:
[ 'f', 'u', {umlaut combining character}, 'n', 'f' ]
Those *both* get rendered exactly the same, and both represent the same
four-letter sequence. In the second example, the 'u' and the {umlaut
combining character} combine to form one grapheme. The f's and n's just
happen to be single-code-point graphemes.
Note that while some characters exist in pre-combined form (such as the {u
with the umlaut} above), legend has it there are others than can only be
represented using a combining character.
It's also my understanding, though I'm not certain, that sometimes multiple
combining characters can be used together on the same "root" character.
Caveat: There may very well be further complications that I'm not aware of.
Heck, knowing Unicode, there probably are.
More information about the Digitalmars-d
mailing list