why a part of D community do not want go to D2 ?

spir denis.spir at gmail.com
Fri Nov 12 01:24:33 PST 2010


On Thu, 11 Nov 2010 18:09:44 -0800
Walter Bright <newshound2 at digitalmars.com> wrote:

> Daniel Gibson wrote:
> > If I'm not mistaken, those functions don't handle these "graphemes", 
> 
> Probably not, since I don't know what the **** a grapheme is. I think the person 
> who thought unicode should be divided up into characters, bytes, code points, 
> code units, glyphs, combining characters, composite characters, and graphemes 
> (and whatever new complexity they dream up this week) should be taken out and 
> screened by the TSA.

I think about the same.

> Unicode is a simple idea, made absurdly complex.

Yes, for sure. The added, non-functional, complexity was introduced (people say) to keep some kind of superficial compatibility with legacy charset, in the (hopeless) hope that this would boost initial adoption of UCS. Thus, we have characters like an (a with circumflex) which is represented in UCS (I'm talking of the higher-level not of encoding, what Unicode addresses):
* basically, with 2 code _points_, which is good, because this scheme allows economically representing all possible combinations,
* but also with a single "precombined" code point, to cope with legacy charsets of the latin-N series, which id bad.
In addition to that, characters represented by more than 2 code points suffer another issue, namely that the additional "combining" codes cna be put in any order ;-) There is a normal one (and an algorithm to restore it), but it is not imposed.

These supposed trade-offs helped nothing concretely, because legacy texts require decoding anyway; mapping to two unicode code points instead of one is nothing for software, and needs to be done once only. For this initial "advantage", we suffer undue complication for the rest of the standard's lifetime.
The consequence in terms of computing is a huge loss:
	*** the mapping 1 character <--> 1 representation does not hold anymore ***
And it's even broken in several ways. Instead, we have
	1 character <--> n representations -- where n is unpredictable
I guess you can imagine the logical suites...

> Furthermore, I think that Phobos handles Unicode in a straightforward manner. 
> All this other crazy stuff should be handled by some 3rd party 100Mb add-on, so 
> none of the rest of us have to suffer under it.

All right. Maybe I'll write one day a library or type that deals with "UText" at the high level (the user level) I try to explain here. Then, one can index, slice, search, count, replace, match... like with plain ASCII. I've done it once in Lua (unfinished, but worked).

The key point to understand is that an array of code points (read, a dstring) is still not a logical string. Reason why I introduce the notion of stack (ripped the word from a unicode doc). Below each c is a code:

 c c c c       physical sequence of codes
 a ^ m e       <--> sequence of "marks"

 c
 c c c         logical sequence of "stacks"
 â m e         <--> sequence of characters

One "stack" holds one place in the (horizontal) logical sequence, thus the name.

This requires, for any source text or text slice (eg a substring to be searched):
* decoding, properly speaking --> array of code in source charset (eg latin-3)
* code mapping --> array of unicode code ~= dstring
  (unecessary if source encoding is utfN, ascci, or latin-1, since code maps are identical)
* grouping code units representing single chars --> array of "stacks" (mini-dstring)
* normalising to decomposed form, called NFD (eg "â" is 2 codes)
* sorting codes inside stack
Actually, the 3 last step can be done in one go (but it's not a big gain, if any).

Then, we have restored the sensible bi-univoque mapping 1 character <--> 1 representation. (Like in ASCII, latin-1, and all charsets that represent characters with single-length forms). Thus, we can safely perform any kind of process.


Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com



More information about the Digitalmars-d mailing list