VLERange: a range in between BidirectionalRange and RandomAccessRange

spir denis.spir at gmail.com
Fri Jan 14 05:34:32 PST 2011


On 01/14/2011 07:26 AM, Nick Sabalausky wrote:
> "Andrei Alexandrescu"<SeeWebsiteForEmail at erdani.org>  wrote in message
> news:igoj6s$17r6$1 at digitalmars.com...
>>
>> I'm not so sure about that. What do you base this assessment on? Denis
>> wrote a library that according to him does grapheme-related stuff nobody
>> else does. So apparently graphemes is not what people care about (although
>> it might be what they should care about).
>>
>
> It's what they want, they just don't know it.
>
> Graphemes are what many people *think* code points are.
>
>>
>> This might be a good time to see whether we need to address graphemes
>> systematically. Could you please post a few links that would educate me
>> and others in the mysteries of combining characters?
>>
>
> Maybe someone else has a link to an explanation (I don't), but it's
> basically just this:

If anyone finds a pointer to such an explanation, bravo, and than you. 
(You will certainly not find it in Unicode literature, for instance.)
Nick's explanation below is good and concise. (Just 2 notes added.)

> Three levels of abstraction from lowest to highest:
> - Code Unit (ie, encoding)
> - Code Point (ie, what Unicode assigns distinct numbers to)
> - Grapheme (ie, what we think of as a "character")
>
> A code-point can be made up of one or more code-units. Likewise, a grapheme
> can be made up of one or more code-points.
>
> There are (at least) two types of code points:
>
> - Regular ones, such as letters, digits, and punctuation.
>
> - "Combining Characters", such as accent marks (or if you're familiar with
> Japanese, the little things in the upper-right corner that change an "s" to
> a "z" or an "h" to a "p". Or like German's umlaut - the two dots above a
> vowel). Ie, things that are not characters in their own right, but merely
> modify other characters. These can be often (always?) be thought of as being
> like overlays.

You can also say there are 2 kinds of characters: simple like "u" & 
composite "ü" or "ṵ̈̈". The former are coded with a single (base) code, 
the latter with one (rarely more) base codes and an arbitrary number of 
combining codes.

For a majority of _common_ characters made of 2 or 3 codes (western 
language letters, korean Hangul syllables,...), precombined codes have 
been added to the set. Thus, they can be coded with a single code like 
simple characters.

[Also note, to avoid things be too simple ;-), some (few) combining 
codes called "prepend" come _before_ the base in raw code sequence...]

> If a code point representing a "combining character" exists in a string,
> then instead of being displayed as a character it merely modifies whatever
> code-point came before it.
>
> So, for instance, if you want to store the German word for five (in all
> lower-case), there are two ways to do it:
>
> [ 'f', {u with the umlaut}, 'n', 'f' ]
>
> Or:
>
> [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

Note: the second form is the base form for Unicode. There are reasons to 
have chosen it (see my text), and why UCS does not and simply cannot 
propose precomposed codes for all possible composite characters.

> Those *both* get rendered exactly the same, and both represent the same
> four-letter sequence. In the second example, the 'u' and the {umlaut
> combining character} combine to form one grapheme. The f's and n's just
> happen to be single-code-point graphemes.
>
> Note that while some characters exist in pre-combined form (such as the {u
> with the umlaut} above), legend has it there are others than can only be
> represented using a combining character.
>
> It's also my understanding, though I'm not certain, that sometimes multiple
> combining characters can be used together on the same "root" character.

There is no logical limit, only practical such as how to display 3 
diacritics above the same base? You can invent a script for a mythical 
folk's language if you like :-)
Also, some examples of real language characters (Hebrew, IIRC) in 
Unicode test data sets hold up to 8 codes.

> Caveat: There may very well be further complications that I'm not aware of.
> Heck, knowing Unicode, there probably are.

Denis
_________________
vita es estrany
spir.wikidot.com



More information about the Digitalmars-d mailing list