Improving D's support of code-pages

Deewiant deewiant.doesnotlike.spam at gmail.com
Sun Aug 19 02:20:24 PDT 2007


Kirk McDonald wrote:
> The idiom is this: A string not known to be encoded in UTF-8, -16, or -32 
> should be stored as a ubyte[]. All internal string manipulation should be 
> done in one of the Unicode encoding types (char[], wchar[], or dchar[]), and 
> all input and output should be done with the ubyte[] type.

I asked about this when Tango was first announced, and was dismayed that this
wasn't the case. Good that somebody else has the same thought.

I tried doing this in an application manually, but it resulted in so many casts
(ubyte[] to char[] for the standard library functions, the other way for their
return values) that I gave up. It's the same way for both Phobos and Tango.

> This implies something else: Printing a ubyte[] should cause those actual 
> bytes to be printed directly. While it is currently possible to do this with 
> e.g. std.cstream.dout.write(), it would be very convenient to do this with 
> writef, especially combined with encode().

Tango still doesn't have out-of-the-box support for just sending bytes to
output, although I'm doing my best to get what I've coded to do it to be added.

One problem is, as you said in another post, that std.format.doFormat /
tango.text.convert.Format are Unicode aware. Dealing with non-Unicode in a D app
is very difficult without conversion to UTF-(8|16|32), which is potentially
expensive.

Another problem is that essentially every C binding out there uses 'char' when
they really mean 'ubyte'. Without implicit casts from char to ubyte and vice
versa, this really doesn't work in practice, and with it, the theory breaks down.

All in all it's a very complicated problem, as you've noted. If you can find a
good and actually working solution, great. But I don't think it's here yet.

-- 
Remove ".doesnotlike.spam" from the mail address.



More information about the Digitalmars-d mailing list