Improving D's support of code-pages

Sun Aug 19 02:20:46 PDT 2007

Kirk McDonald wrote:

> However, in many real-world situations, you are not reading something in 
> a Unicode encoding, nor do you always want to write one out. This is 
> particularly the case when writing something to the console. Not all 
> Windows command lines or Linux shells are set up to handle UTF-8, though 
> this is very common on Linux. My Windows command-line, however, uses the 
> default of the ancient CP437, and this is not uncommon. The point is 
> that, on many systems, outputting raw UTF-8 results in garbage.

It was my understanding that D by design only supports UTF environments,
and the behaviour on legacy systems (CP437/ISO-8859-1) is "undefined"...
It's not only output, if you run on a such a system and try to read the
args (char[][]) you can get an UTF exception due to it being malformed.

i.e. the current behaviour is just reading the raw bytes and pretending
that it is UTF, whether that's true or not (exceptions and/or garbage)

> ----
> Improvements to Phobos
> ----
> 
> The behavior of writef (and perhaps of D's formatting in general) must 
> be altered.
> 
> Currently, printing a char[] causes D to output the raw bytes in the 
> string. As I previously mentioned, this is not a good thing. On many 
> platforms, this can easily result in garbage being printed to the screen.

By design, I thought. As usual everything "works" for ASCII characters.

Not that bad for a trade-off between the whatever-the-system-uses of C
and lets-include-every-weird-encoding-ever-in-the-core-library of Java ?

> I propose changing writef to check the console's encoding, and to 
> attempt to encode the output in that encoding. Then it can simply output 
> the resulting raw bytes. Checking this encoding is a platform-specific 
> operation, but essentially every platform (particularly Linux, Windows, 
> and OS X) has a way to do it. If the string cannot be encoded in that 
> encoding, the exception thrown by encode() should be allowed to 
> propagate and terminate the program (or be caught by the user). If the 
> user wishes to avoid that exception, they should call encode() 
> explicitly themselves. For this reason, Phobos will also need a function 
> for retrieving the console's default encoding made available to the user.

Probably not a bad idea (Java does something like this), but it would
bloat the standard library. Adding support for common legacy encodings
like cp437/cp1252/iso88591/roman wouldn't be unthinkable in principle,
but it's hard to "draw the line" and much easier to only support UTF-8 ?

If you want some code for doing such conversions, I have old "mapping"
and "libiconv" modules on my home page at http://www.algonet.se/~afb/d/

/// converts a 8-bit charset encoding string into unicode
char[] decode_string(ubyte[] string, wchar[256] mapping);

/// converts a unicode string into 8-bit charset encoding
ubyte[] encode_string(char[] string, wchar[256] mapping);

(http://www.digitalmars.com/d/archives/digitalmars/D/12967.html)

   /// allocate a converter between charsets fromcode and tocode
   extern (C) iconv_t iconv_open (char *tocode, char *fromcode);

   /// convert inbuf to outbuf and set inbytesleft to unused input and
   /// outbuf to unused output and return number of non-reversable
   /// conversions or -1 on error.
   extern (C) size_t iconv (iconv_t cd, void **inbuf,
			   size_t *inbytesleft,
			   void **outbuf,
			   size_t *outbytesleft);

Mapping ISO-8859-1 (Latin-1) to UTF-8 is by far the easiest, see:
http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs (under 8-bit)

> This implies something else: Printing a ubyte[] should cause those 
> actual bytes to be printed directly. While it is currently possible to 
> do this with e.g. std.cstream.dout.write(), it would be very convenient 
> to do this with writef, especially combined with encode().

Printing ubytes would be nice, currently that's easiest with printf...

But adding codepages to D feels a little like adding 16-bit support :-)

--anders