To Walter, about char[] initialization by FF

Andrew Fedoniouk news at terrainformatica.com
Sat Jul 29 23:02:54 PDT 2006


"Walter Bright" <newshound at digitalmars.com> wrote in message 
news:eah9st$2v1o$1 at digitaldaemon.com...
> Andrew Fedoniouk wrote:
>>>> Please don't think that UTF-8 is a panacea.
>>> I don't. But it's way better than C/C++, because you can rely on it and 
>>> your code will work with different languages out of the box.
>>
>> Sorry but this is a bit optimistic.
>>
>> D/samples/wc.exe from the box will fail on russian texts.
>> It will fail on almost all Eastern texts. Even they
>> will be in UTF-8 encoding. Meaning of 'word'
>> is different there.
>
> No matter, it is far easier to write a UTF-8 isword function than one that 
> will work on all possible character encoding methods.
>

Sorry, did you try to write such a function (isword)?

(You need the whole set of character classification tables
to accomplish this - utf-8 will not help you)

>
>> Having statement "string literals in D are only
>> UTF-8 encoded" is not conceptually better than
>> "string literals in C are encoded by using codepage defined
>> by pragma(codepage,...)".
>
> It is conceptually better because UTF-8 is completely defined and covers 
> all human languages. Codepages are not completely defined, do not cover 
> asian languages, rely on non-standard compiler extensions, and in fact you 
> cannot even rely on *ASCII* being supported by any particular C or C++ 
> compiler. (It could be EBCDIC or any encoding invented by the compiler 
> vendor.)
>
> Code pages have another disastrous problem - it's impossible to mix 
> languages. I have an academic text in front of me written in a mixture of 
> german, french, and latin. How's that going to work with code pages?

I am not saying that you shall avoid use of UTF-8 encoding.
If you have mix of say english, russian and chinese on some page
the only way to deliver this to the user is to use some (universal)
unicode transport encoding.
But to render this thing on the screen is completely different
story.

Consider this: attribute names in html (sgml) represented by
ascii codes only - you don't need utf-8 processing to deal with them at all.
You also cannot use utf-8 for storing attribute values generally speaking.
Attribute values participate in CSS selector analysis and some selectors
require char by char (char as a code point and not a D char) access.

There are only few academic cases where you can use utf-8 literally
(as a sequence of utf-8 bytes) *in runtime*. D source code compilation
is one of such things - you can store content of string literals in utf-8 
form -
you don't need to analyze their content.

>
> Code pages are obsolete yesterday's technology, and I'm not sorry to see 
> them go.

Sorry but US is the first country which will ask "what a ...?" on demand
to send always four bytes instead of one.

UTF-8 encoding is "traffic friendly" only for 1/10 of population
on the Earth (English speaking people).
Others just don't want to pay that price.

Sorry you or not sorry it is irrelevant for code pages existence.
They will be forever untill all of us will not speak on Esperanto.

( Currently I am doing right-to-left support in the engine - Arabic and 
Hebrew -
trust me - probably I have more things to say "sorry" about )

>
>> Same by the way applied to most of Java compilers
>> they accepts texts in various singlebyte encodings.
>> (Why *I* am telling this to *you*? :-)
>
> The compiler may accept it as an extension, but the Java *language* is 
> defined to work with UTF-16 source text only. (Java calls them 'char's, 
> even though there may be multi-char encodings.)

Walter, where did you get that magic UTF-16 ?

Doc: http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html
mentions that input of Java compiler is sequence of Unicode (Code Points).
And how this input sequence is encoded, utf-8, utf-16, koi8r - it does not
matter at all and spec is silent about this - human is in its rights to 
choose
encoding his/her terminal/keyboard supports.

Andrew Fedoniouk.
http://terrainformatica.com





More information about the Digitalmars-d mailing list