To Walter, about char[] initialization by FF

Walter Bright newshound at digitalmars.com
Sat Jul 29 19:16:12 PDT 2006


Andrew Fedoniouk wrote:
>> In D, char[] is a UTF-8 sequence. It's well defined, and therefore 
>> portable. It supports every human language.
> 
> What does it mean "UTF-8 ... supports ...every human language" ?
> 
> It allows to encode - yes.

We both know what UTF-8 is and does.

> But in runtime support means quite different thing
> and I am pretty sure you know what I mean here.

I'm sure there are bugs in the library UTF-8 support. But they are bugs, 
are fixable, and not fundamental problems. As you find any, please post 
them to bugzilla.


> In Java as we know UTF-8 is used for representing
> string literals inside .class files but being loaded they
> became vectors of Java chars - unicode BMP codepoints
> (ushort). And this serves almost all character cases.
> Exceptions like: it is not trivial to do effectively
> processing of single byte encoded things there - you need
> to rewrite the whole set of functions to handle this.
> 
> Please don't think that UTF-8 is a panacea.

I don't. But it's way better than C/C++, because you can rely on it and 
your code will work with different languages out of the box.


> For example in China they use GB2312 encoding
> to represent almost 7000 Chinese characters in active use now.
> This is strictly 2 bytes enconding and
> don't even try to ask them to switch to UTF-8
> (3 bytes as a rule). This will increase their internet
> traffic by 1/3.
> 
> Same apply to Europe. E.g. in Russia
> there are 32 characters in alphabet and it is
> just enough to have one byte encoding for
> English/Russian text. It makes no sense
> to send over the wire two bytes (russian in utf-8)
> instead of one for the sites like lib.ru.
> 
> Sorry but guys are paying there for each byte
> downloaded from Internet. This apply
> to almost all countries except of US and Canada.

If one needs to use a custom encoding, use ubyte[] or ushort[]. If one 
needs to be universal, use char[], wchar[], or dchar[]. And for what 
it's worth, D isn't a web transmission protocol. I don't see any problem 
with a D program converting its input from Format X to UTF for internal 
processing, and then converting its output back to X or Y or Z.



More information about the Digitalmars-d mailing list