Improving D's support of code-pages

Sat Aug 18 14:53:31 PDT 2007

Walter Bright wrote:
> Kirk McDonald wrote:
> 
>> ----
>> Additions to Phobos
>> ----
>>
>> The first thing Phobos needs are the following functions. (Their basic 
>> interface has been cribbed from Python.)
>>
>> char[] decode(ubyte[] str, string encoding, string error="strict");
>> wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
>> dchar[] ddecode(ubyte[] str, string encoding, string error="strict");
>>
>> ubyte[] encode(char[] str, string encoding, string error="strict");
>> ubyte[] encode(wchar[] str, string encoding, string error="strict");
>> ubyte[] encode(dchar[] str, string encoding, string error="strict");
> 
> 
> If you (or someone else) wants to write these, I'll put them in.
> 

It is not a small amount of work. Perhaps I will take a look at how big 
of a problem it is (after the conference).

>> ----
>> Improvements to Phobos
>> ----
>>
>> The behavior of writef (and perhaps of D's formatting in general) must 
>> be altered.
>>
>> Currently, printing a char[] causes D to output the raw bytes in the 
>> string. As I previously mentioned, this is not a good thing. On many 
>> platforms, this can easily result in garbage being printed to the screen.
>>
>> I propose changing writef to check the console's encoding, and to 
>> attempt to encode the output in that encoding. Then it can simply 
>> output the resulting raw bytes. Checking this encoding is a 
>> platform-specific operation, but essentially every platform 
>> (particularly Linux, Windows, and OS X) has a way to do it. If the 
>> string cannot be encoded in that encoding, the exception thrown by 
>> encode() should be allowed to propagate and terminate the program (or 
>> be caught by the user). If the user wishes to avoid that exception, 
>> they should call encode() explicitly themselves. For this reason, 
>> Phobos will also need a function for retrieving the console's default 
>> encoding made available to the user.
> 
> 
> There's a big problem with this - what if the output is being sent to a 
> file?

Files have no inherent encoding, only the console does. In this way, 
writing to a file is different than writing to the console. The user 
must explcitly provide an encoding when writing to a file; or, if they 
are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, 
-16, or -32. (Writing a char[] implies an encoding, while writing a 
ubyte[] does not.)

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org