To Walter, about char[] initialization by FF

kris foo at bar.com
Sat Jul 29 23:34:33 PDT 2006


Is there a doctor in the house?



Andrew Fedoniouk wrote:
> "Walter Bright" <newshound at digitalmars.com> wrote in message 
> news:eah9st$2v1o$1 at digitaldaemon.com...
> 
>>Andrew Fedoniouk wrote:
>>
>>>>>Please don't think that UTF-8 is a panacea.
>>>>
>>>>I don't. But it's way better than C/C++, because you can rely on it and 
>>>>your code will work with different languages out of the box.
>>>
>>>Sorry but this is a bit optimistic.
>>>
>>>D/samples/wc.exe from the box will fail on russian texts.
>>>It will fail on almost all Eastern texts. Even they
>>>will be in UTF-8 encoding. Meaning of 'word'
>>>is different there.
>>
>>No matter, it is far easier to write a UTF-8 isword function than one that 
>>will work on all possible character encoding methods.
>>
> 
> 
> Sorry, did you try to write such a function (isword)?
> 
> (You need the whole set of character classification tables
> to accomplish this - utf-8 will not help you)
> 
> 
>>>Having statement "string literals in D are only
>>>UTF-8 encoded" is not conceptually better than
>>>"string literals in C are encoded by using codepage defined
>>>by pragma(codepage,...)".
>>
>>It is conceptually better because UTF-8 is completely defined and covers 
>>all human languages. Codepages are not completely defined, do not cover 
>>asian languages, rely on non-standard compiler extensions, and in fact you 
>>cannot even rely on *ASCII* being supported by any particular C or C++ 
>>compiler. (It could be EBCDIC or any encoding invented by the compiler 
>>vendor.)
>>
>>Code pages have another disastrous problem - it's impossible to mix 
>>languages. I have an academic text in front of me written in a mixture of 
>>german, french, and latin. How's that going to work with code pages?
> 
> 
> I am not saying that you shall avoid use of UTF-8 encoding.
> If you have mix of say english, russian and chinese on some page
> the only way to deliver this to the user is to use some (universal)
> unicode transport encoding.
> But to render this thing on the screen is completely different
> story.
> 
> Consider this: attribute names in html (sgml) represented by
> ascii codes only - you don't need utf-8 processing to deal with them at all.
> You also cannot use utf-8 for storing attribute values generally speaking.
> Attribute values participate in CSS selector analysis and some selectors
> require char by char (char as a code point and not a D char) access.
> 
> There are only few academic cases where you can use utf-8 literally
> (as a sequence of utf-8 bytes) *in runtime*. D source code compilation
> is one of such things - you can store content of string literals in utf-8 
> form -
> you don't need to analyze their content.
> 
> 
>>Code pages are obsolete yesterday's technology, and I'm not sorry to see 
>>them go.
> 
> 
> Sorry but US is the first country which will ask "what a ...?" on demand
> to send always four bytes instead of one.
> 
> UTF-8 encoding is "traffic friendly" only for 1/10 of population
> on the Earth (English speaking people).
> Others just don't want to pay that price.
> 
> Sorry you or not sorry it is irrelevant for code pages existence.
> They will be forever untill all of us will not speak on Esperanto.
> 
> ( Currently I am doing right-to-left support in the engine - Arabic and 
> Hebrew -
> trust me - probably I have more things to say "sorry" about )
> 
> 
>>>Same by the way applied to most of Java compilers
>>>they accepts texts in various singlebyte encodings.
>>>(Why *I* am telling this to *you*? :-)
>>
>>The compiler may accept it as an extension, but the Java *language* is 
>>defined to work with UTF-16 source text only. (Java calls them 'char's, 
>>even though there may be multi-char encodings.)
> 
> 
> Walter, where did you get that magic UTF-16 ?
> 
> Doc: http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html
> mentions that input of Java compiler is sequence of Unicode (Code Points).
> And how this input sequence is encoded, utf-8, utf-16, koi8r - it does not
> matter at all and spec is silent about this - human is in its rights to 
> choose
> encoding his/her terminal/keyboard supports.
> 
> Andrew Fedoniouk.
> http://terrainformatica.com
> 
> 



More information about the Digitalmars-d mailing list