why a part of D community do not want go to D2 ?

spir denis.spir at gmail.com
Thu Nov 11 12:34:38 PST 2010


On Thu, 11 Nov 2010 09:40:05 -0800
Andrei Alexandrescu <SeeWebsiteForEmail at erdani.org> wrote:

> > string substring(string s, size_t beg, size_t end) // "logical slice" -
> > from code point number beg to code point number end  
> 
> That's not implemented and I don't think it would be useful. Usually 
> when I want a substring, the calculations up to that point indicate the 
> code _unit_ I'm at.

Yes, but a code unit does not represent a character, instead a unicode "abstract character".

void main() {
    dstring s = "\u0061\u0302\u006d\u0065"d;
    writeln(s);     // "âme"
    assert(s[0..1] == "a");
    assert(s.indexOf("â") == -1);
}

A "user-perceived character" (also strangely called "grapheme" in unicode docs) can be represented by an arbitrary number of code _units_ (up to 8 in their test data, but there is no actual limit). What a code unit represents is, say, a "scripting mark". In "â", there are 2 of them. For legacy reasons, UCS also includes "precombined characters", so that "â" can also be represented by a single code, indeed. But the above form is valid, it's even arguably the base form for "â" (and most composite chars cannot be represented by a single code).

In my views, there is a missing level of abstraction in common UString processing libs and types. How to count the "â"s in a text? How to find one? Above, indexOf fails because my editor uses a precombined code, while the source (here literal) uses another form.
To be able to produce meaningful results, and to use simple routines like index, find, count..., the way we used to with single-length character sets, there should be a grouping phase on top of decoding; we would then process arrays of "stacks" representing characters, not of codes. ITo search, it's also necessary to have all characters normalised form, so that both "â" would match: another phase.
Unicode provides algorithms for those phases in constructing string representations -- but everyone seems to ignore the issues... s[0..1] would then return the first character, not the first code of the "stack" representing the first character.


Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com



More information about the Digitalmars-d mailing list