To Walter, about char[] initialization by FF
kris
foo at bar.com
Sat Jul 29 23:34:33 PDT 2006
Is there a doctor in the house?
Andrew Fedoniouk wrote:
> "Walter Bright" <newshound at digitalmars.com> wrote in message
> news:eah9st$2v1o$1 at digitaldaemon.com...
>
>>Andrew Fedoniouk wrote:
>>
>>>>>Please don't think that UTF-8 is a panacea.
>>>>
>>>>I don't. But it's way better than C/C++, because you can rely on it and
>>>>your code will work with different languages out of the box.
>>>
>>>Sorry but this is a bit optimistic.
>>>
>>>D/samples/wc.exe from the box will fail on russian texts.
>>>It will fail on almost all Eastern texts. Even they
>>>will be in UTF-8 encoding. Meaning of 'word'
>>>is different there.
>>
>>No matter, it is far easier to write a UTF-8 isword function than one that
>>will work on all possible character encoding methods.
>>
>
>
> Sorry, did you try to write such a function (isword)?
>
> (You need the whole set of character classification tables
> to accomplish this - utf-8 will not help you)
>
>
>>>Having statement "string literals in D are only
>>>UTF-8 encoded" is not conceptually better than
>>>"string literals in C are encoded by using codepage defined
>>>by pragma(codepage,...)".
>>
>>It is conceptually better because UTF-8 is completely defined and covers
>>all human languages. Codepages are not completely defined, do not cover
>>asian languages, rely on non-standard compiler extensions, and in fact you
>>cannot even rely on *ASCII* being supported by any particular C or C++
>>compiler. (It could be EBCDIC or any encoding invented by the compiler
>>vendor.)
>>
>>Code pages have another disastrous problem - it's impossible to mix
>>languages. I have an academic text in front of me written in a mixture of
>>german, french, and latin. How's that going to work with code pages?
>
>
> I am not saying that you shall avoid use of UTF-8 encoding.
> If you have mix of say english, russian and chinese on some page
> the only way to deliver this to the user is to use some (universal)
> unicode transport encoding.
> But to render this thing on the screen is completely different
> story.
>
> Consider this: attribute names in html (sgml) represented by
> ascii codes only - you don't need utf-8 processing to deal with them at all.
> You also cannot use utf-8 for storing attribute values generally speaking.
> Attribute values participate in CSS selector analysis and some selectors
> require char by char (char as a code point and not a D char) access.
>
> There are only few academic cases where you can use utf-8 literally
> (as a sequence of utf-8 bytes) *in runtime*. D source code compilation
> is one of such things - you can store content of string literals in utf-8
> form -
> you don't need to analyze their content.
>
>
>>Code pages are obsolete yesterday's technology, and I'm not sorry to see
>>them go.
>
>
> Sorry but US is the first country which will ask "what a ...?" on demand
> to send always four bytes instead of one.
>
> UTF-8 encoding is "traffic friendly" only for 1/10 of population
> on the Earth (English speaking people).
> Others just don't want to pay that price.
>
> Sorry you or not sorry it is irrelevant for code pages existence.
> They will be forever untill all of us will not speak on Esperanto.
>
> ( Currently I am doing right-to-left support in the engine - Arabic and
> Hebrew -
> trust me - probably I have more things to say "sorry" about )
>
>
>>>Same by the way applied to most of Java compilers
>>>they accepts texts in various singlebyte encodings.
>>>(Why *I* am telling this to *you*? :-)
>>
>>The compiler may accept it as an extension, but the Java *language* is
>>defined to work with UTF-16 source text only. (Java calls them 'char's,
>>even though there may be multi-char encodings.)
>
>
> Walter, where did you get that magic UTF-16 ?
>
> Doc: http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html
> mentions that input of Java compiler is sequence of Unicode (Code Points).
> And how this input sequence is encoded, utf-8, utf-16, koi8r - it does not
> matter at all and spec is silent about this - human is in its rights to
> choose
> encoding his/her terminal/keyboard supports.
>
> Andrew Fedoniouk.
> http://terrainformatica.com
>
>
More information about the Digitalmars-d
mailing list