VLERange: a range in between BidirectionalRange andRandomAccessRange

Sun Jan 16 14:12:51 PST 2011

On 01/14/2011 08:20 PM, Nick Sabalausky wrote:
> "spir"<denis.spir at gmail.com>  wrote in message
> news:mailman.619.1295012086.4748.digitalmars-d at puremagic.com...
>>
>> If anyone finds a pointer to such an explanation, bravo, and than you.
>> (You will certainly not find it in Unicode literature, for instance.)
>> Nick's explanation below is good and concise. (Just 2 notes added.)
>
> Yea, most Unicode explanations seem to talk all about "code-units vs
> code-points" and then they'll just have a brief note like "There's also
> other things like digraphs and combining codes." And that'll be all they
> mention.
>
> You're right about the Unicode literature. It's the usual standards-body
> documentation, same as W3C: "Instead of only some people understanding how
> this works, lets encode the documentation in legalese (and have twenty
> only-slightly-different versions) to make sure that nobody understands how
> it works."

If anyone is interested, ICU's documentation is far more readable (and 
intended for programmers). ICU is *the* reference library for dealing 
with unicode (an IBM open source product, with C/C++/Java interfaces), 
used by many other products in the background.
ICU: http://site.icu-project.org/
user guide: http://userguide.icu-project.org/
section about text segmentation: 
http://userguide.icu-project.org/boundaryanalysis

Note that just like Unicode, they consider forming graphemes (grouping 
codes into character representations) a simple particular case of text 
segmentation, which they call "boundary analysis" (but they have the 
nice idea to use "character" instead of "grapheme").

The only mention I found in ICU's doc of the issue we have talked about 
here lengthily is (at http://userguide.icu-project.org/strings):
"Handling Lengths, Indexes, and Offsets in Strings

The length of a string and all indexes and offsets related to the string 
are always counted in terms of UChar code units, not in terms of UChar32 
code points. (This is the same as in common C library functions that use 
char * strings with multi-byte encodings.)

Often, a user thinks of a "character" as a complete unit in a language, 
like an 'Ä', while it may be represented with multiple Unicode code 
points including a base character and combining marks. (See the Unicode 
standard for details.) This often requires users to index and pass 
strings (UnicodeString or UChar *) with multiple code units or code 
points. It cannot be done with single-integer character types. Indexing 
of such "characters" is done with the BreakIterator class (in C: ubrk_ 
functions).

Even with such "higher-level" indexing functions, the actual index 
values will be expressed in terms of UChar code units. When more than 
one code unit is used at a time, the index value changes by more than 
one at a time. [...]

(ICU's UChar are like D wchar.)

>> You can also say there are 2 kinds of characters: simple like "u"&
>> composite "ü" or "ü??". The former are coded with a single (base) code,
>> the latter with one (rarely more) base codes and an arbitrary number of
>> combining codes.
>
> Couple questions about the "more than one base codes":
>
> - Do you know an example offhand?

No. I know this only from it beeing mentionned in documentation. Unless 
we consider (see below) L jamo as base codes.

> - Does that mean like a ligature where the base codes form a single glyph,
> or does it mean that the combining code either spans or operates over
> multiple glyphs? Or can it go either way?

IIRC examples like ij in nederlands are only considered "compability 
equivalent" to the corresponding ligatures, just like eg "ss" for "ß" in 
german. Meaning they should not be considered equal by default, this 
would be an additional feature, and langage- and app-dependant). Unlike 
base "e"+ combining "^" really == "ê".

>> For a majority of _common_ characters made of 2 or 3 codes (western
>> language letters, korean Hangul syllables,...), precombined codes have
>> been added to the set. Thus, they can be coded with a single code like
>> simple characters.
>>
>
> Out of curiosity, how do decomposed Hangul characters work? (Or do you
> know?) Not actually knowing any Korean, my understanding is that they're a
> set of 1 to 4 phoenetic glyphs that are then combined into one glyph. So, it
> is like a series of base codes that automatically combine, or are there
> combining characters involved?

I know nothing about Korean language except what I studied about its 
scripting system for Unicode algorithms (but one can also code said 
algorithm blindly). See http://en.wikipedia.org/wiki/Hangul and about 
Hangul in Unicode 
http://en.wikipedia.org/wiki/Korean_language_and_computers. What I 
understand (beware, it's just wild deductions) is there are 3 kinds of 
"jamo" scripting marks (noted L, V, T) that can combine into syllabic 
"graphemes", resp in first, median, last place. These marks indeed 
somehow correspond to vocalic or consonantic phonemes.
In unicode, in addition to such jamo, which are simple marks (like base 
letters and diacritics in latin-based languages), there are precombined 
codes for LV and LVT combinations (like for "ä" or "û"). We could thus 
think that Hangul syllables are limited to 3 jamo.
But: according to Unicode's official "grapheme break cluster" algorithm 
(read: how to group codepoints into characters) 
(http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries), codes 
for L jamo can also be followed by _and_ should be combined with other 
L, LV or LVT codes. Similarly, LV or V should be combined with V or VT, 
and LVT or T with T. (Seems logical.) So, I do not know how complicated 
a Hangul syllab can be in practice or in theory.
If there can be in practice whole syllables following other schemes than 
L / LV / LVT, then this is another example of real language whole 
characters that cannot be coded by a single codepoint.

Denis
_________________
vita es estrany
spir.wikidot.com