why a part of D community do not want go to D2 ?

Daniel Gibson metalcaedes at gmail.com
Thu Nov 11 13:40:47 PST 2010


spir schrieb:
> On Thu, 11 Nov 2010 09:40:05 -0800
> Andrei Alexandrescu <SeeWebsiteForEmail at erdani.org> wrote:
> 
>>> string substring(string s, size_t beg, size_t end) // "logical slice" -
>>> from code point number beg to code point number end  
>> That's not implemented and I don't think it would be useful. Usually 
>> when I want a substring, the calculations up to that point indicate the 
>> code _unit_ I'm at.
> 
> Yes, but a code unit does not represent a character, instead a unicode "abstract character".
> 
> void main() {
>     dstring s = "\u0061\u0302\u006d\u0065"d;
>     writeln(s);     // "âme"
>     assert(s[0..1] == "a");
>     assert(s.indexOf("â") == -1);
> }
> 
> A "user-perceived character" (also strangely called "grapheme" in unicode docs) can be represented by an arbitrary number of code _units_ (up to 8 in their test data, but there is no actual limit). What a code unit represents is, say, a "scripting mark". In "â", there are 2 of them. For legacy reasons, UCS also includes "precombined characters", so that "â" can also be represented by a single code, indeed. But the above form is valid, it's even arguably the base form for "â" (and most composite chars cannot be represented by a single code).
> 

OMG, this is worse than I thought O_O
I thought "ok, for UTF-8 one code unit is one byte and one 'real', visible 
character is called a code point and consists of 1-4 code units" - but having 
"user-perceived characters" that consist of multiple code units is sick.
Unicode has a way to tell if a sequence of code units (bytes) belongs together 
or not, so identifying code points isn't too hard.
But is there a way to identify "graphemes"? Other then a list of rules like "a 
sequence of the two code points <foo> and <bar> make up one "grapheme" <foobar>?



More information about the Digitalmars-d mailing list