To Walter, about char[] initialization by FF
Walter Bright
newshound at digitalmars.com
Sat Jul 29 20:45:28 PDT 2006
Andrew Fedoniouk wrote:
>>> Please don't think that UTF-8 is a panacea.
>> I don't. But it's way better than C/C++, because you can rely on it and
>> your code will work with different languages out of the box.
>
> Sorry but this is a bit optimistic.
>
> D/samples/wc.exe from the box will fail on russian texts.
> It will fail on almost all Eastern texts. Even they
> will be in UTF-8 encoding. Meaning of 'word'
> is different there.
No matter, it is far easier to write a UTF-8 isword function than one
that will work on all possible character encoding methods.
> Having statement "string literals in D are only
> UTF-8 encoded" is not conceptually better than
> "string literals in C are encoded by using codepage defined
> by pragma(codepage,...)".
It is conceptually better because UTF-8 is completely defined and covers
all human languages. Codepages are not completely defined, do not cover
asian languages, rely on non-standard compiler extensions, and in fact
you cannot even rely on *ASCII* being supported by any particular C or
C++ compiler. (It could be EBCDIC or any encoding invented by the
compiler vendor.)
Code pages have another disastrous problem - it's impossible to mix
languages. I have an academic text in front of me written in a mixture
of german, french, and latin. How's that going to work with code pages?
Code pages are obsolete yesterday's technology, and I'm not sorry to see
them go.
> Same by the way applied to most of Java compilers
> they accepts texts in various singlebyte encodings.
> (Why *I* am telling this to *you*? :-)
The compiler may accept it as an extension, but the Java *language* is
defined to work with UTF-16 source text only. (Java calls them 'char's,
even though there may be multi-char encodings.)
More information about the Digitalmars-d
mailing list