Improving D's support of code-pages

Sun Aug 19 00:38:55 PDT 2007

Kirk -

It's not a stupid idea, but you may not have all the necessary pieces? 
For example, this kind of processing should probably not be bound to an 
application by default (bloat?) and thus you'd perhaps need some 
mechanism to (dynamically) attach custom processing onto a stream?

Tango supports this via stream filters, and deewiant (for example) has 
an output filter for doing specific code-page conversion. Tango also has 
UnicodeFile as a template for converting between internal utf8/16/32 and 
an external UTF representation (all 8 varieties) along with BOM support; 
much as you were describing earlier.

The console is a PITA when it comes to encodings, especially when 
redirection is involved. Thus, we decided long ago that Tango would be 
utf8 only for console IO, and for all variations thereof ... gives it a 
known state. From there, either a filter or a replacement console-device 
can be injected into the IO framework for customization purposes.

Unix has a good lib for code-page support, called iconv. The IBM ICU 
project also has extensive code-page support, along with  a bus, 
helicopter, cruise-liner, and a kitchen-sink, all wrapped up in a very 
powerful (UTF16) API. But the latter is too heavyweight to be embedded 
in a core library, which is why those wrappers still reside in Mango 
rather than Tango. On the other hand, Tango does have a codepage API 
much like what you suggest, as a free-function lightweight converter

- Kris

Kirk McDonald wrote:
> Walter Bright wrote:
>> Kirk McDonald wrote:
>>
>>> Pardon? I haven't said anything about stdio behaving differently 
>>> whether it's printing to the console or not. writefln() would 
>>> /always/ attempt to encode in the console's encoding.
>>
>>
>> Ok, I misunderstood.
>>
>> Now, what if stdout is reopened to be a file?
> 
> I've been thinking about these issues more carefully. It is harder than 
> I initially thought. :-)
> 
> Ignoring my ideas of implicitly encoding writefln's output, I regard the 
> encode/decode functions as vital. These alone would improve the current 
> situation immensely.
> 
> Printing ubyte[] arrays as the "raw bytes" therein when using writef() 
> is basically nonsense, thanks to the fact that doFormat itself is 
> Unicode aware. I should have realized this sooner. However, you can 
> still write them with dout.write(). This should be adequate.
> 
> Here is another proposal regarding implicit encoding, slightly modified 
> from my first one:
> 
> The Stream class should be modified to have an encoding attribute. This 
> should usually be null. If it is present, output should be encoded into 
> that encoding. (To facilitate this, the encoding module should provide a 
> doEncode function, analogous to the doFormat function, which has a void 
> delegate(ubyte) or possibly a void delegate(ubyte[]) callback.)
> 
> Next, std.stdio.writef should be modified to write to the object 
> referenced by std.cstream.dout, instead of the FILE* stdout. The next 
> step is obvious: std.cstream.dout's encoding attibute should be set to 
> the console's encoding. Finally, though dout should obviously remain a 
> CFile instance, it should be stored in a Stream reference.
> 
> If another Stream object is substituted for dout, then the behavior of 
> writefln (and anything else relying on dout) would be redirected. 
> Whether the output is still implicitly encoded would depend entirely on 
> this new object's encoding attribute.
> 
> It occurs to me that this could be somewhat slow. Examination of the 
> source reveals that every printed character from dout is the result of a 
> virtual method call. However, I do wonder how important the performance 
> of printing to the console really is.
> 
> Thoughts? Is this a thoroughly stupid idea?
>