Improving D's support of code-pages
Kirk McDonald
kirklin.mcdonald at gmail.com
Mon Aug 20 12:23:00 PDT 2007
Regan Heath wrote:
> Kirk McDonald wrote:
>
>> Leandro Lucarella wrote:
>>
>>> Kirk McDonald, el 18 de agosto a las 14:33 me escribiste:
>>>
>>>> char[] decode(ubyte[] str, string encoding, string error="strict");
>>>> wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
>>>> dchar[] ddecode(ubyte[] str, string encoding, string error="strict");
>>>
>>>
>>>
>>> Why isn't error an enum instead of a string?
>>>
>>
>> Perhaps it would be useful to allow the user to define new
>> error-handlers somehow, and provide a callback for them. (Python
>> allows something like this.) This would allow you to, for instance,
>> provide a different replacement character than the one provided by
>> "replace".
>
>
> Not a bad idea.
>
> I would like to suggest alternate function signatures:
>
> //The error code for the callback
> enum DecodeMode { ..no idea what goes here.. }
>
> //The callback types
> typedef char function(DecodeMode,char) DecodeCHandler;
> typedef wchar function(DecodeMode,wchar) DecodeWHandler;
> typedef dchar function(DecodeMode,dchar) DecodeDHandler;
>
> //The decode functions
> uint decode(byte[] str, char[] dst, string encoding, DecodeCHandler
> handler);
> uint decode(byte[] str, wchar[] dst, string encoding, DecodeWHandler
> handler);
> uint decode(byte[] str, dchar[] dst, string encoding, DecodeDHandler
> handler);
>
> Technically 'char' in C is a signed byte, not an unsigned one therefore
> byte[] is more accurate.
>
I don't agree with this last part. For starters, I had thought the
signed-ness of 'char' in C was not defined. In any case, we're talking
about chunks of arbitrary, homogenous binary data, so I think ubyte[] is
most appropriate.
Here's another approach to the error handler thing:
typedef int error_t;
alias void delegate(string encoding, dchar, ref ubyte[])
encode_error_handler;
alias void delegate(string encoding, ubyte[], size_t, ref dchar)
decode_error_handler;
error_t register_error(encode_error_handler dg1, decode_error_handler dg2);
error_t Strict, Ignore, Replace;
The register_error function would return a new, unique ID for a given
error handler. A handler only wanting to handle encoding or decoding
could simply pass null for the one it doesn't want to handle.
The encode_error_handler receives the encoding and the unicode character
that could not be encoded. It also has a 'ref ubyte[]' argument, which
should be set to whatever the replacement character is. (It could be
passed in as a slice over an internal buffer. Recuding its length should
never cause an allocation.)
The decode_error_handler receives the encoding, the ubyte[] buffer, and
the index of the character in it which could not be encoded. It also has
a 'ref dchar' argument, which should be set to whatever the replacement
character is.
Strict, Ignore, and Replace could be implemented like this:
static this {
Strict = register_error(
delegate void(string encoding, dchar c, ref ubyte[] dest) {
throw new EncodeError(format("Could not encode character
\\u%x in encoding '%s'.", c, encoding));
},
delegate void(string encoding, ubyte[] buf, size_t idx, ref
dchar dest) {
throw new DecodeError(format("Count not decode \\x%x from
encoding '%s'.", buf[idx], encoding));
}
);
Ignore = register_error(
delegate void(string encoding, dchar c, ref ubyte[] dest) {
dest = null;
},
delegate void(string encoding, ubyte[] buf, size_t idx, ref
dchar dest) {
dest = 0; // This would probably have to be special-cased.
}
);
Replace = register_error(
delegate void(string encoding, dchar c, ref ubyte[] dest) {
dest.length = 1;
dest[0] = '?';
},
delegate void(string encoding, ubyte[] buf, size_t idx, ref
dchar dest) {
dest = '\uFFFD'; // The Unicode REPLACEMENT CHARACTER
}
);
}
> I think you still want to use an enum to represent the cases the
> callback needs to handle (assuming there is more than one) the same
> handler function could be used for both encode and decode then.
>
> I think you want to pass the destination buffers, allowing
> re-use/preallocation for efficiency.
>
The implementation could use doEncode and doDecode functions, analogous
to doFormat, for efficiency.
void doEncode(void delegate(ubyte[]) dg, char[], string encoding,
error_t handler);
void doEncode(void delegate(ubyte[]) dg, wchar[], string encoding,
error_t handler);
void doEncode(void delegate(ubyte[]) dg, dchar[], string encoding,
error_t handler);
void doDecode(void delegate(dchar str) dg, ubyte[], string encoding,
error_t handler);
The ubyte[] arguments in the callbacks could be slices over an internal
buffer. No allocation is necessary.
--
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org
More information about the Digitalmars-d
mailing list