VLERange: a range in between BidirectionalRange andRandomAccessRange
spir
denis.spir at gmail.com
Sun Jan 16 14:12:51 PST 2011
On 01/14/2011 08:20 PM, Nick Sabalausky wrote:
> "spir"<denis.spir at gmail.com> wrote in message
> news:mailman.619.1295012086.4748.digitalmars-d at puremagic.com...
>>
>> If anyone finds a pointer to such an explanation, bravo, and than you.
>> (You will certainly not find it in Unicode literature, for instance.)
>> Nick's explanation below is good and concise. (Just 2 notes added.)
>
> Yea, most Unicode explanations seem to talk all about "code-units vs
> code-points" and then they'll just have a brief note like "There's also
> other things like digraphs and combining codes." And that'll be all they
> mention.
>
> You're right about the Unicode literature. It's the usual standards-body
> documentation, same as W3C: "Instead of only some people understanding how
> this works, lets encode the documentation in legalese (and have twenty
> only-slightly-different versions) to make sure that nobody understands how
> it works."
If anyone is interested, ICU's documentation is far more readable (and
intended for programmers). ICU is *the* reference library for dealing
with unicode (an IBM open source product, with C/C++/Java interfaces),
used by many other products in the background.
ICU: http://site.icu-project.org/
user guide: http://userguide.icu-project.org/
section about text segmentation:
http://userguide.icu-project.org/boundaryanalysis
Note that just like Unicode, they consider forming graphemes (grouping
codes into character representations) a simple particular case of text
segmentation, which they call "boundary analysis" (but they have the
nice idea to use "character" instead of "grapheme").
The only mention I found in ICU's doc of the issue we have talked about
here lengthily is (at http://userguide.icu-project.org/strings):
"Handling Lengths, Indexes, and Offsets in Strings
The length of a string and all indexes and offsets related to the string
are always counted in terms of UChar code units, not in terms of UChar32
code points. (This is the same as in common C library functions that use
char * strings with multi-byte encodings.)
Often, a user thinks of a "character" as a complete unit in a language,
like an 'Ä', while it may be represented with multiple Unicode code
points including a base character and combining marks. (See the Unicode
standard for details.) This often requires users to index and pass
strings (UnicodeString or UChar *) with multiple code units or code
points. It cannot be done with single-integer character types. Indexing
of such "characters" is done with the BreakIterator class (in C: ubrk_
functions).
Even with such "higher-level" indexing functions, the actual index
values will be expressed in terms of UChar code units. When more than
one code unit is used at a time, the index value changes by more than
one at a time. [...]
(ICU's UChar are like D wchar.)
>> You can also say there are 2 kinds of characters: simple like "u"&
>> composite "ü" or "ü??". The former are coded with a single (base) code,
>> the latter with one (rarely more) base codes and an arbitrary number of
>> combining codes.
>
> Couple questions about the "more than one base codes":
>
> - Do you know an example offhand?
No. I know this only from it beeing mentionned in documentation. Unless
we consider (see below) L jamo as base codes.
> - Does that mean like a ligature where the base codes form a single glyph,
> or does it mean that the combining code either spans or operates over
> multiple glyphs? Or can it go either way?
IIRC examples like ij in nederlands are only considered "compability
equivalent" to the corresponding ligatures, just like eg "ss" for "ß" in
german. Meaning they should not be considered equal by default, this
would be an additional feature, and langage- and app-dependant). Unlike
base "e"+ combining "^" really == "ê".
>> For a majority of _common_ characters made of 2 or 3 codes (western
>> language letters, korean Hangul syllables,...), precombined codes have
>> been added to the set. Thus, they can be coded with a single code like
>> simple characters.
>>
>
> Out of curiosity, how do decomposed Hangul characters work? (Or do you
> know?) Not actually knowing any Korean, my understanding is that they're a
> set of 1 to 4 phoenetic glyphs that are then combined into one glyph. So, it
> is like a series of base codes that automatically combine, or are there
> combining characters involved?
I know nothing about Korean language except what I studied about its
scripting system for Unicode algorithms (but one can also code said
algorithm blindly). See http://en.wikipedia.org/wiki/Hangul and about
Hangul in Unicode
http://en.wikipedia.org/wiki/Korean_language_and_computers. What I
understand (beware, it's just wild deductions) is there are 3 kinds of
"jamo" scripting marks (noted L, V, T) that can combine into syllabic
"graphemes", resp in first, median, last place. These marks indeed
somehow correspond to vocalic or consonantic phonemes.
In unicode, in addition to such jamo, which are simple marks (like base
letters and diacritics in latin-based languages), there are precombined
codes for LV and LVT combinations (like for "ä" or "û"). We could thus
think that Hangul syllables are limited to 3 jamo.
But: according to Unicode's official "grapheme break cluster" algorithm
(read: how to group codepoints into characters)
(http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries), codes
for L jamo can also be followed by _and_ should be combined with other
L, LV or LVT codes. Similarly, LV or V should be combined with V or VT,
and LVT or T with T. (Seems logical.) So, I do not know how complicated
a Hangul syllab can be in practice or in theory.
If there can be in practice whole syllables following other schemes than
L / LV / LVT, then this is another example of real language whole
characters that cannot be coded by a single codepoint.
Denis
_________________
vita es estrany
spir.wikidot.com
More information about the Digitalmars-d
mailing list