Improving D's support of code-pages

Sat Aug 18 14:56:03 PDT 2007

Kirk McDonald wrote:
> Walter Bright wrote:
> 
>> Kirk McDonald wrote:
>>
>>> ----
>>> Additions to Phobos
>>> ----
>>>
>>> The first thing Phobos needs are the following functions. (Their 
>>> basic interface has been cribbed from Python.)
>>>
>>> char[] decode(ubyte[] str, string encoding, string error="strict");
>>> wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
>>> dchar[] ddecode(ubyte[] str, string encoding, string error="strict");
>>>
>>> ubyte[] encode(char[] str, string encoding, string error="strict");
>>> ubyte[] encode(wchar[] str, string encoding, string error="strict");
>>> ubyte[] encode(dchar[] str, string encoding, string error="strict");
>>
>>
>>
>> If you (or someone else) wants to write these, I'll put them in.
>>
> 
> It is not a small amount of work. Perhaps I will take a look at how big 
> of a problem it is (after the conference).
> 
>>> ----
>>> Improvements to Phobos
>>> ----
>>>
>>> The behavior of writef (and perhaps of D's formatting in general) 
>>> must be altered.
>>>
>>> Currently, printing a char[] causes D to output the raw bytes in the 
>>> string. As I previously mentioned, this is not a good thing. On many 
>>> platforms, this can easily result in garbage being printed to the 
>>> screen.
>>>
>>> I propose changing writef to check the console's encoding, and to 
>>> attempt to encode the output in that encoding. Then it can simply 
>>> output the resulting raw bytes. Checking this encoding is a 
>>> platform-specific operation, but essentially every platform 
>>> (particularly Linux, Windows, and OS X) has a way to do it. If the 
>>> string cannot be encoded in that encoding, the exception thrown by 
>>> encode() should be allowed to propagate and terminate the program (or 
>>> be caught by the user). If the user wishes to avoid that exception, 
>>> they should call encode() explicitly themselves. For this reason, 
>>> Phobos will also need a function for retrieving the console's default 
>>> encoding made available to the user.
>>
>>
>>
>> There's a big problem with this - what if the output is being sent to 
>> a file?
> 
> 
> Files have no inherent encoding, only the console does. In this way, 
> writing to a file is different than writing to the console. The user 
> must explcitly provide an encoding when writing to a file; or, if they 
> are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, 
> -16, or -32. (Writing a char[] implies an encoding, while writing a 
> ubyte[] does not.)
> 

I should clarify this: When treating stdout like a file, it should be 
like any other file: writing to it writes raw bytes. But when calling 
writef, which is not treating it like a file, it should attempt to 
encode the output into the console's default encoding.

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org