Ceci n'est pas une char

Fri Apr 7 00:18:39 PDT 2006

Georg Wrede wrote:

>>> For the general case, UTF-32 is a pretty wasteful Unicode encoding
>>> just to have that priviledge ?
>>
>> I'm not sure there is a "general case", so it's hard to say. Some
>> programmers have to deal with MBCS every day; others can go for years
>> without ever having to worry about anything but vanilla ASCII.
> 
> True!! Folks in Boise, Idaho, vs. folks in the non-British Europe or the 
> Far East.

I don't think so. UTF-8 is good for us in "non-British" Europe, and
UTF-16 is good in the East. UTF-32 is good for... finding codepoints ?

As long as the "exceptions" (high code units) are taken care of, there 
is really no difference between the three (or five) - it's all Unicode.

I prefer UTF-8 - because it is ASCII-compatible and endian-independent,
but UTF-16 is not a bad choice if you handle a lot of non-ASCII chars.

Just as long as other layers play along with the embedded NULs, and you
have the proper BOM marks when storing it. It seemed to work for Java ?

The argument was just against *UTF-32* as a storage type, nothing more.
(As was rationalized in http://www.unicode.org/faq/utf_bom.html#UTF32)

--anders

PS.
Thought that having std UTF type aliases would have helped, but I dunno:

module std.stdutf;

/* UTF code units */

alias char   utf8_t; // UTF-8
alias wchar utf16_t; // UTF-16
alias dchar utf32_t; // UTF-32

It's a little confusing anyway, many "char*" routines don't accept UTF ?