Improving D's support of code-pages

Mon Aug 20 12:23:00 PDT 2007

Regan Heath wrote:
> Kirk McDonald wrote:
> 
>> Leandro Lucarella wrote:
>>
>>> Kirk McDonald, el 18 de agosto a las 14:33 me escribiste:
>>>
>>>> char[] decode(ubyte[] str, string encoding, string error="strict");
>>>> wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
>>>> dchar[] ddecode(ubyte[] str, string encoding, string error="strict");
>>>
>>>
>>>
>>> Why isn't error an enum instead of a string?
>>>
>>
>> Perhaps it would be useful to allow the user to define new 
>> error-handlers somehow, and provide a callback for them. (Python 
>> allows something like this.) This would allow you to, for instance, 
>> provide a different replacement character than the one provided by 
>> "replace".
> 
> 
> Not a bad idea.
> 
> I would like to suggest alternate function signatures:
> 
> //The error code for the callback
> enum DecodeMode { ..no idea what goes here.. }
> 
> //The callback types
> typedef char function(DecodeMode,char) DecodeCHandler;
> typedef wchar function(DecodeMode,wchar) DecodeWHandler;
> typedef dchar function(DecodeMode,dchar) DecodeDHandler;
> 
> //The decode functions
> uint decode(byte[] str, char[] dst, string encoding, DecodeCHandler 
> handler);
> uint decode(byte[] str, wchar[] dst, string encoding, DecodeWHandler 
> handler);
> uint decode(byte[] str, dchar[] dst, string encoding, DecodeDHandler 
> handler);
> 
> Technically 'char' in C is a signed byte, not an unsigned one therefore 
> byte[] is more accurate.
> 

I don't agree with this last part. For starters, I had thought the 
signed-ness of 'char' in C was not defined. In any case, we're talking 
about chunks of arbitrary, homogenous binary data, so I think ubyte[] is 
most appropriate.

Here's another approach to the error handler thing:

typedef int error_t;

alias void delegate(string encoding, dchar, ref ubyte[])
     encode_error_handler;
alias void delegate(string encoding, ubyte[], size_t, ref dchar)
     decode_error_handler;

error_t register_error(encode_error_handler dg1, decode_error_handler dg2);

error_t Strict, Ignore, Replace;

The register_error function would return a new, unique ID for a given 
error handler. A handler only wanting to handle encoding or decoding 
could simply pass null for the one it doesn't want to handle.

The encode_error_handler receives the encoding and the unicode character 
that could not be encoded. It also has a 'ref ubyte[]' argument, which 
should be set to whatever the replacement character is. (It could be 
passed in as a slice over an internal buffer. Recuding its length should 
never cause an allocation.)

The decode_error_handler receives the encoding, the ubyte[] buffer, and 
the index of the character in it which could not be encoded. It also has 
a 'ref dchar' argument, which should be set to whatever the replacement 
character is.

Strict, Ignore, and Replace could be implemented like this:

static this {
     Strict = register_error(
         delegate void(string encoding, dchar c, ref ubyte[] dest) {
             throw new EncodeError(format("Could not encode character 
\\u%x in encoding '%s'.", c, encoding));
         },
         delegate void(string encoding, ubyte[] buf, size_t idx, ref 
dchar dest) {
             throw new DecodeError(format("Count not decode \\x%x from 
encoding '%s'.", buf[idx], encoding));
         }
     );

     Ignore = register_error(
         delegate void(string encoding, dchar c, ref ubyte[] dest) {
             dest = null;
         },
         delegate void(string encoding, ubyte[] buf, size_t idx, ref 
dchar dest) {
             dest = 0; // This would probably have to be special-cased.
         }
     );
     Replace = register_error(
         delegate void(string encoding, dchar c, ref ubyte[] dest) {
             dest.length = 1;
             dest[0] = '?';
         },
         delegate void(string encoding, ubyte[] buf, size_t idx, ref 
dchar dest) {
             dest = '\uFFFD'; // The Unicode REPLACEMENT CHARACTER
         }
     );
}

> I think you still want to use an enum to represent the cases the 
> callback needs to handle (assuming there is more than one) the same 
> handler function could be used for both encode and decode then.
> 
> I think you want to pass the destination buffers, allowing 
> re-use/preallocation for efficiency.
> 

The implementation could use doEncode and doDecode functions, analogous 
to doFormat, for efficiency.

void doEncode(void delegate(ubyte[]) dg, char[], string encoding,
	error_t handler);
void doEncode(void delegate(ubyte[]) dg, wchar[], string encoding,
	error_t handler);
void doEncode(void delegate(ubyte[]) dg, dchar[], string encoding,
	error_t handler);

void doDecode(void delegate(dchar str) dg, ubyte[], string encoding,
	error_t handler);

The ubyte[] arguments in the callbacks could be slices over an internal 
buffer. No allocation is necessary.

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org