why a part of D community do not want go to D2 ?
Daniel Gibson
metalcaedes at gmail.com
Thu Nov 11 13:40:47 PST 2010
spir schrieb:
> On Thu, 11 Nov 2010 09:40:05 -0800
> Andrei Alexandrescu <SeeWebsiteForEmail at erdani.org> wrote:
>
>>> string substring(string s, size_t beg, size_t end) // "logical slice" -
>>> from code point number beg to code point number end
>> That's not implemented and I don't think it would be useful. Usually
>> when I want a substring, the calculations up to that point indicate the
>> code _unit_ I'm at.
>
> Yes, but a code unit does not represent a character, instead a unicode "abstract character".
>
> void main() {
> dstring s = "\u0061\u0302\u006d\u0065"d;
> writeln(s); // "âme"
> assert(s[0..1] == "a");
> assert(s.indexOf("â") == -1);
> }
>
> A "user-perceived character" (also strangely called "grapheme" in unicode docs) can be represented by an arbitrary number of code _units_ (up to 8 in their test data, but there is no actual limit). What a code unit represents is, say, a "scripting mark". In "â", there are 2 of them. For legacy reasons, UCS also includes "precombined characters", so that "â" can also be represented by a single code, indeed. But the above form is valid, it's even arguably the base form for "â" (and most composite chars cannot be represented by a single code).
>
OMG, this is worse than I thought O_O
I thought "ok, for UTF-8 one code unit is one byte and one 'real', visible
character is called a code point and consists of 1-4 code units" - but having
"user-perceived characters" that consist of multiple code units is sick.
Unicode has a way to tell if a sequence of code units (bytes) belongs together
or not, so identifying code points isn't too hard.
But is there a way to identify "graphemes"? Other then a list of rules like "a
sequence of the two code points <foo> and <bar> make up one "grapheme" <foobar>?
More information about the Digitalmars-d
mailing list