To Walter, about char[] initialization by FF

Sun Jul 30 02:08:48 PDT 2006

Andrew Fedoniouk wrote:
> "Walter Bright" <newshound at digitalmars.com> wrote in message 
> news:eah9st$2v1o$1 at digitaldaemon.com...
>> Andrew Fedoniouk wrote:
>>>>> Please don't think that UTF-8 is a panacea.
>>>> I don't. But it's way better than C/C++, because you can rely on it and 
>>>> your code will work with different languages out of the box.
>>> Sorry but this is a bit optimistic.
>>>
>>> D/samples/wc.exe from the box will fail on russian texts.
>>> It will fail on almost all Eastern texts. Even they
>>> will be in UTF-8 encoding. Meaning of 'word'
>>> is different there.
>> No matter, it is far easier to write a UTF-8 isword function than one that 
>> will work on all possible character encoding methods.
> Sorry, did you try to write such a function (isword)?

I have written isUniAlpha, which is the same thing.

> (You need the whole set of character classification tables
> to accomplish this - utf-8 will not help you)

With code pages, it isn't so straightforward (especially if you've got 
things like shift-JIS too). With code pages, a program can't even accept 
a text file unless you tell it what page the text is in.

> I am not saying that you shall avoid use of UTF-8 encoding.
> If you have mix of say english, russian and chinese on some page
> the only way to deliver this to the user is to use some (universal)
> unicode transport encoding.
> But to render this thing on the screen is completely different
> story.

Fortunately, rendering is the job of the operating system - and I don't 
see how rendering with code pages would be any easier.

> Consider this: attribute names in html (sgml) represented by
> ascii codes only - you don't need utf-8 processing to deal with them at all.
> You also cannot use utf-8 for storing attribute values generally speaking.
> Attribute values participate in CSS selector analysis and some selectors
> require char by char (char as a code point and not a D char) access.

I'd be surprised at that, since UTF-8 is a documented, supported HTML 
page encoding method. But if UTF-8 doesn't work for you, you can use 
wchar (UTF-16) or dchar (UTF-32), or ubyte (for anything else).

> There are only few academic cases where you can use utf-8 literally
> (as a sequence of utf-8 bytes) *in runtime*. D source code compilation
> is one of such things - you can store content of string literals in utf-8 
> form -
> you don't need to analyze their content.

D identifiers can be unicode alphas, which means the UTF-8 must be decoded.

The DMC++ compiler supports various code page source file possibilities, 
including some of the asian language multibyte encodings. I find that 
UTF-8 is a lot easier to work with, as the UTF-8 designers learned from 
the mistakes of the earlier multibyte encodings.

>> Code pages are obsolete yesterday's technology, and I'm not sorry to see 
>> them go.
> Sorry but US is the first country which will ask "what a ...?" on demand
> to send always four bytes instead of one.
> UTF-8 encoding is "traffic friendly" only for 1/10 of population
> on the Earth (English speaking people).
> Others just don't want to pay that price.

I'll make a prediction that the huge benefits of UTF will outweigh the 
downside, and that code pages will increasingly fall into disuse. Note 
that javascript, java, C#, Ruby, etc., are all unicode languages (Ruby 
also supports EUC or SJIS, but not other code pages). Windows is 
(internally) completely unicode (the code page face it shows is done by 
a translation layer on I/O).

In an increasingly multicultural and global economy, applications that 
cannot simultaneously handle multiple languages are going to be at a 
severe disadvantage.

Another problem with code pages is when you're presented with a text 
file, what code page is it in? There's no way for a program to tell, 
unless there's some other transmission of associated metadata. With UTF, 
that's no problem.

> Sorry you or not sorry it is irrelevant for code pages existence.
> They will be forever untill all of us will not speak on Esperanto.
> 
> ( Currently I am doing right-to-left support in the engine - Arabic and 
> Hebrew -
> trust me - probably I have more things to say "sorry" about )

No problem, I believe you <g>.

>>> Same by the way applied to most of Java compilers
>>> they accepts texts in various singlebyte encodings.
>>> (Why *I* am telling this to *you*? :-)
>> The compiler may accept it as an extension, but the Java *language* is 
>> defined to work with UTF-16 source text only. (Java calls them 'char's, 
>> even though there may be multi-char encodings.)
> 
> Walter, where did you get that magic UTF-16 ?
> 
> Doc: http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html
> mentions that input of Java compiler is sequence of Unicode (Code Points).
> And how this input sequence is encoded, utf-8, utf-16, koi8r - it does not
> matter at all and spec is silent about this - human is in its rights to 
> choose encoding his/her terminal/keyboard supports.

Java Language Specification Third Edition Chapter 3.2: "The Java 
programming language represents text in sequences of 16-bit code units, 
using the UTF-16 encoding."

It is, of course, entirely reasonable for a Java compiler to have 
extensions to recognize other encodings and automatically convert them 
internally to UTF-16 before lexical analysis.

"One Encoding to rule them all, One Encoding to replace them,
One Encoding to handle them all and in the darkness bind them"
-- UTF Tolkien