Improving D's support of code-pages

Sat Aug 18 14:33:30 PDT 2007

D's support for unicode is a wonderful thing. The ability to comprehend 
UTF-8, -16, and -32 strings in a straightforward, native fasion is 
invaluable. However, the outside world consists of more than encodings 
of Unicode. The ability to deal with code pages in a straightforward 
manner should be considered absolutely vital.

I will be describing what I think is the optimal way of dealing with 
code-pages. This includes some changes and additions to Phobos (or 
Tango, if that is your preferred platform; to be truthful I am unsure of 
the state of code pages in that library), as well as describing a new D 
idiom.

(I am aware that Mango has some bindings to the ICU code-page conversion 
libraries, but this sort of functionality /really/ belongs in the 
standard library.)

The idiom is this: A string not known to be encoded in UTF-8, -16, or 
-32 should be stored as a ubyte[]. All internal string manipulation 
should be done in one of the Unicode encoding types (char[], wchar[], or 
dchar[]), and all input and output should be done with the ubyte[] type. 
There are some exceptions to this, of course. If you're reading input 
which you know to be in one of D's Unicode encoding types, or writing 
something out in one of those formats, naturally there's no reason you 
shouldn't just read into or write from the D type directly.

However, in many real-world situations, you are not reading something in 
a Unicode encoding, nor do you always want to write one out. This is 
particularly the case when writing something to the console. Not all 
Windows command lines or Linux shells are set up to handle UTF-8, though 
this is very common on Linux. My Windows command-line, however, uses the 
default of the ancient CP437, and this is not uncommon. The point is 
that, on many systems, outputting raw UTF-8 results in garbage.

----
Additions to Phobos
----

The first thing Phobos needs are the following functions. (Their basic 
interface has been cribbed from Python.)

char[] decode(ubyte[] str, string encoding, string error="strict");
wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
dchar[] ddecode(ubyte[] str, string encoding, string error="strict");

ubyte[] encode(char[] str, string encoding, string error="strict");
ubyte[] encode(wchar[] str, string encoding, string error="strict");
ubyte[] encode(dchar[] str, string encoding, string error="strict");

What follows is a description of these functions. For the sake of 
simplicity, I will only be referring to the char[] versions of these 
functions. The wchar[] and dchar[] versions should operate in an 
identical fashion.

Let's say you've read in a file and stored it in a ubyte[]:

ubyte[] file = something();

You're already in a bit of a situation, here, if you don't know the 
encoding of the file. If you've gotten this far without knowing it, or 
knowing how to get it, you probably need to re-think your design.

Let's say you know the file is in Latin-1. Since all of D's 
string-processing facilities expect to deal with a Unicode encoding, you 
want to convert this to UTF-8. You should just be able to decode it:

char[] str = decode(file, "latin-1");

And, ta-da! Your string is now converted to UTF-8, and all of D's string 
processing abilities can be brought to bear.

Now let's say that, after you've done whatever you were going to with 
the string, you want to write it back out in Latin-1. This is just a 
simple call to encode:

ubyte[] new_file = encode(str, "latin-1");

But wait! What if the UTF-8 string contains characters which are not 
valid Latin-1 characters? This is where the 'error' parameter comes into 
play. (Note that the error parameter is present in both encode and 
decode.) This parameter has three valid values:

  * "strict" causes an exception to be thrown. This is the default.
  * "ignore" causes the invalid characters to simply be ignored, and 
elided from the returned string.
  * "replace" causes the invalid characters to be replaced with a 
suitable replacement character. When calling decode, this should be the 
official U+FFFD REPLACEMENT CHARACTER. When calling encode, something 
specific to the code-page would have to be chosen; a '?' would be 
appropriate in the various ASCII-based code pages.

Using strings rather than an enum means this functionality could be 
extended by the user in the future (as Python allows).

Latin-1 is not a very interesting encoding. The situation gets more 
interesting if we are talking about a multi-byte encoding, such as 
UTF-16. So let's say we're reading a file encoded in UTF-16:

ubyte[] utf16_file = whatever();
char[] decode(utf16_file, "utf-16");

While you /could/ simply cast the ubyte[] to a wchar[], this code has 
the advantage of totally seperating the encoding which your program's 
input is in from the type with which you represent the data internally.

Using UTF-16 also means you might have errors during decoding, if there 
are invalid UTF-16 code units in the input string.

These functions might fit into std.string, although a new module such as 
std.codepages would work, as well.

----
Improvements to Phobos
----

The behavior of writef (and perhaps of D's formatting in general) must 
be altered.

Currently, printing a char[] causes D to output the raw bytes in the 
string. As I previously mentioned, this is not a good thing. On many 
platforms, this can easily result in garbage being printed to the screen.

I propose changing writef to check the console's encoding, and to 
attempt to encode the output in that encoding. Then it can simply output 
the resulting raw bytes. Checking this encoding is a platform-specific 
operation, but essentially every platform (particularly Linux, Windows, 
and OS X) has a way to do it. If the string cannot be encoded in that 
encoding, the exception thrown by encode() should be allowed to 
propagate and terminate the program (or be caught by the user). If the 
user wishes to avoid that exception, they should call encode() 
explicitly themselves. For this reason, Phobos will also need a function 
for retrieving the console's default encoding made available to the user.

This implies something else: Printing a ubyte[] should cause those 
actual bytes to be printed directly. While it is currently possible to 
do this with e.g. std.cstream.dout.write(), it would be very convenient 
to do this with writef, especially combined with encode().

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org