Encoding problems with dsss.exe and implib.exe
Vitaly Kulich
vit_klich at list.ru
Mon Jan 24 03:33:04 PST 2011
I have to add the following to my post:
In Phobos for dmd version 1 there is only one
obvious source of this exception. Namely, it is
function 'decode' in the std.utf module.
Here is its listing:
/***************
* Decodes and returns character starting at s[idx]. idx is advanced past the
* decoded character. If the character is not well formed, a UtfException is
* thrown and idx remains unchanged.
*/
dchar decode(char[] s, inout size_t idx)
in
{
assert(idx >= 0 && idx < s.length);
}
out (result)
{
assert(isValidDchar(result));
}
body
{
size_t len = s.length;
dchar V;
size_t i = idx;
char u = s[i];
if (u & 0x80)
{ uint n;
char u2;
/* The following encodings are valid, except for the 5 and 6 byte
* combinations:
* 0xxxxxxx
* 110xxxxx 10xxxxxx
* 1110xxxx 10xxxxxx 10xxxxxx
* 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
* 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
* 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
*/
for (n = 1; ; n++)
{
if (n > 4)
goto Lerr; // only do the first 4 of 6 encodings
if (((u << n) & 0x80) == 0)
{
if (n == 1)
goto Lerr;
break;
}
}
// Pick off (7 - n) significant bits of B from first byte of octet
V = cast(dchar)(u & ((1 << (7 - n)) - 1));
if (i + (n - 1) >= len)
goto Lerr; // off end of string
/* The following combinations are overlong, and illegal:
* 1100000x (10xxxxxx)
* 11100000 100xxxxx (10xxxxxx)
* 11110000 1000xxxx (10xxxxxx 10xxxxxx)
* 11111000 10000xxx (10xxxxxx 10xxxxxx 10xxxxxx)
* 11111100 100000xx (10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx)
*/
u2 = s[i + 1];
if ((u & 0xFE) == 0xC0 ||
(u == 0xE0 && (u2 & 0xE0) == 0x80) ||
(u == 0xF0 && (u2 & 0xF0) == 0x80) ||
(u == 0xF8 && (u2 & 0xF8) == 0x80) ||
(u == 0xFC && (u2 & 0xFC) == 0x80))
goto Lerr; // overlong combination
for (uint j = 1; j != n; j++)
{
u = s[i + j];
if ((u & 0xC0) != 0x80)
goto Lerr; // trailing bytes are 10xxxxxx
V = (V << 6) | (u & 0x3F);
}
if (!isValidDchar(V))
goto Lerr;
i += n;
}
else
{
V = cast(dchar) u;
i++;
}
idx = i;
return V;
Lerr:
//printf("\ndecode: idx = %d, i = %d, length = %d s = \n'%.*s'\n%x\n'%.*s'\n",
idx, i, s.length, s, s[i], s[i .. length]);
throw new UtfException("4invalid UTF-8 sequence", i);
}
In no other place was found text "4invalid UTF-8 sequence",
therefore, this function needs a revision.
So, I myself answered to the question that concerns the dsss behavior,
as dsss is written in D. But the strange behaviour of implib still undefined.
More information about the Digitalmars-d
mailing list