Ceci n'est pas une char
Anders F Björklund
afb at algonet.se
Fri Apr 7 00:18:39 PDT 2006
Georg Wrede wrote:
>>> For the general case, UTF-32 is a pretty wasteful Unicode encoding
>>> just to have that priviledge ?
>>
>> I'm not sure there is a "general case", so it's hard to say. Some
>> programmers have to deal with MBCS every day; others can go for years
>> without ever having to worry about anything but vanilla ASCII.
>
> True!! Folks in Boise, Idaho, vs. folks in the non-British Europe or the
> Far East.
I don't think so. UTF-8 is good for us in "non-British" Europe, and
UTF-16 is good in the East. UTF-32 is good for... finding codepoints ?
As long as the "exceptions" (high code units) are taken care of, there
is really no difference between the three (or five) - it's all Unicode.
I prefer UTF-8 - because it is ASCII-compatible and endian-independent,
but UTF-16 is not a bad choice if you handle a lot of non-ASCII chars.
Just as long as other layers play along with the embedded NULs, and you
have the proper BOM marks when storing it. It seemed to work for Java ?
The argument was just against *UTF-32* as a storage type, nothing more.
(As was rationalized in http://www.unicode.org/faq/utf_bom.html#UTF32)
--anders
PS.
Thought that having std UTF type aliases would have helped, but I dunno:
module std.stdutf;
/* UTF code units */
alias char utf8_t; // UTF-8
alias wchar utf16_t; // UTF-16
alias dchar utf32_t; // UTF-32
It's a little confusing anyway, many "char*" routines don't accept UTF ?
More information about the Digitalmars-d
mailing list