why a part of D community do not want go to D2 ?

spir denis.spir at gmail.com
Thu Nov 11 14:59:36 PST 2010


On Thu, 11 Nov 2010 22:40:47 +0100
Daniel Gibson <metalcaedes at gmail.com> wrote:

> spir schrieb:
> > On Thu, 11 Nov 2010 09:40:05 -0800
> > Andrei Alexandrescu <SeeWebsiteForEmail at erdani.org> wrote:
> > 
> >>> string substring(string s, size_t beg, size_t end) // "logical slice" -
> >>> from code point number beg to code point number end  
> >> That's not implemented and I don't think it would be useful. Usually 
> >> when I want a substring, the calculations up to that point indicate the 
> >> code _unit_ I'm at.
> > 
> > Yes, but a code unit does not represent a character, instead a unicode "abstract character".
> > 
> > void main() {
> >     dstring s = "\u0061\u0302\u006d\u0065"d;
> >     writeln(s);     // "âme"
> >     assert(s[0..1] == "a");
> >     assert(s.indexOf("â") == -1);
> > }
> > 
> > A "user-perceived character" (also strangely called "grapheme" in unicode docs) can be represented by an arbitrary number of code _units_ (up to 8 in their test data, but there is no actual limit). What a code unit represents is, say, a "scripting mark". In "â", there are 2 of them. For legacy reasons, UCS also includes "precombined characters", so that "â" can also be represented by a single code, indeed. But the above form is valid, it's even arguably the base form for "â" (and most composite chars cannot be represented by a single code).
> > 
> 
> OMG, this is worse than I thought O_O
> I thought "ok, for UTF-8 one code unit is one byte and one 'real', visible 
> character is called a code point and consists of 1-4 code units" - but having 
> "user-perceived characters" that consist of multiple code units is sick.

Most people, even programmers that deal with unicode everyday, think the same. This is due to several factors: (1) unicode's misleading use of "abstract character" (wonder whether it was done in purpose?) (2) string processing tools simply ignore all of that (3) most texts we deal with today only hold common characters that have a single-code representation.
So that everybody plays with strings as if (1 code <--> 1 char).

> Unicode has a way to tell if a sequence of code units (bytes) belongs together 
> or not, so identifying code points isn't too hard.
> But is there a way to identify "graphemes"? Other then a list of rules like "a 
> sequence of the two code points <foo> and <bar> make up one "grapheme" <foobar>?
 
There is an algorithm, indeed, and not too complicated. But you won't find any information in the string of codes itself that tells you about it (meaning, you cannot synchronise at start/end of "grapheme" without knowledge of the whole algorithm).
Accordingly, when picking _some_ code points (eg the one 'a' above), there is no way to tell whether it's a standalone code that happens to represent a whole character ("a"), or the just the start of it. These are the base "marks": they have the same code when meaning a whole char and as start of "stack" (substring representing a whole char). But combining marks have 2 codes: one when combined, one when used alone like in "in portuguese, '~' is used to denote a nasal vowel".
(Hope I'm clear.)

The whole set of UCS (the charset) issues, over Unicode ones, is imo:
1. Actual characters are represented by an arbitrary number of codes.
2. The same character can be represented by different strings of codes...
3. including strings of the same length, but in different order ;-)
The first issue is actually good: it would be stupid to try and give a code to every possible combination, and even impossible. also, the present scheme allows _creating_ character for our use, that will be rndered correctly (yes!). (But I would kill any designer collegue that would allow for points 2. and 3. ;-)

Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com



More information about the Digitalmars-d mailing list