To Walter, about char[] initialization by FF

Sat Jul 29 20:45:28 PDT 2006

Andrew Fedoniouk wrote:
>>> Please don't think that UTF-8 is a panacea.
>> I don't. But it's way better than C/C++, because you can rely on it and 
>> your code will work with different languages out of the box.
> 
> Sorry but this is a bit optimistic.
> 
> D/samples/wc.exe from the box will fail on russian texts.
> It will fail on almost all Eastern texts. Even they
> will be in UTF-8 encoding. Meaning of 'word'
> is different there.

No matter, it is far easier to write a UTF-8 isword function than one 
that will work on all possible character encoding methods.

> Having statement "string literals in D are only
> UTF-8 encoded" is not conceptually better than
> "string literals in C are encoded by using codepage defined
> by pragma(codepage,...)".

It is conceptually better because UTF-8 is completely defined and covers 
all human languages. Codepages are not completely defined, do not cover 
asian languages, rely on non-standard compiler extensions, and in fact 
you cannot even rely on *ASCII* being supported by any particular C or 
C++ compiler. (It could be EBCDIC or any encoding invented by the 
compiler vendor.)

Code pages have another disastrous problem - it's impossible to mix 
languages. I have an academic text in front of me written in a mixture 
of german, french, and latin. How's that going to work with code pages?

Code pages are obsolete yesterday's technology, and I'm not sorry to see 
them go.

> Same by the way applied to most of Java compilers
> they accepts texts in various singlebyte encodings.
> (Why *I* am telling this to *you*? :-)

The compiler may accept it as an extension, but the Java *language* is 
defined to work with UTF-16 source text only. (Java calls them 'char's, 
even though there may be multi-char encodings.)