why a part of D community do not want go to D2 ?

spir denis.spir at gmail.com
Fri Nov 12 00:50:11 PST 2010


On Thu, 11 Nov 2010 23:17:05 +0000 (UTC)
retard <re at tard.com.invalid> wrote:

> Thu, 11 Nov 2010 23:59:36 +0100, spir wrote:
> 
> > (3) most texts we deal with
> > today only hold common characters that have a single-code
> > representation. So that everybody plays with strings as if (1 code <-->
> > 1 char).
> 
> That might be true for many americans. But even then the single byte 
> can't express all characters you need in everyday communication. There 
> are countless people with é or ë or ü in their last name. ” and “ are 
> probably not among the first 128-256 codes. Using e instead of ë or é 
> might work to some extent, but ü and u are pronounced differently. Some 
> use ue instead.

I meant _codes_ (code points). Not code _unit_ and even less bytes.

The character <I with dot above and dot below> (if ever you want to use it ;-) needs 2 or 3 code _points_ for representation in memory or storage. Try:
	writeln (""); // --> Ị̇
If your output system is sufficiently capable, then you get an I with dot above and dot below! (I recommand the DejaVu font series). And, as you see, the type dstring is used, meaning each element is a dchar holding a whole code point. Right? But it's a single character requiring 3 codes.
Ebven more troubling: if I choose a lowercase 'i' instead, then since <i with dot below> exists as a precombined code, I have the choice between 2 or 3 codes.

An "abstract character", as introduced by UCS and represented by a code, is *not* what we think as "character". It is an abstract "mark", such as the 'I', the combining dot above, the combining dot below, all inside "Ị̇".
Also, it's important to realise that there is no formal definition of "character", and even less a universal one. A character is what people using a scripting system consider as such.
I know, UCS / Unicode terminology is misleading. It does not help, instead it increases confusion.

What you are evoking is a lower-level issue, namely the encoding of code points themselves (here, 3) into code units, and then bytes, in a concrete form (say, in file). Depending on the encoding (here I consider only utf8/16/32 ones), there may be 1, 1 or 2, 1 to 4, code units per code point.


Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com



More information about the Digitalmars-d mailing list