Behavior of strings with invalid unicode...
monarch_dodra
monarchdodra at gmail.com
Wed Nov 21 05:25:00 PST 2012
I made a commit that was meant to better certify what functions
threw in UTF.
I thus noticed that some of our functions, are unsafe. For
example:
strings s = [0b1100_0000]; //1 byte of 2 byte sequence
s.popFront(); //Assertion error because of invalid
//slicing of s[2 .. $];
"pop" is nothrow, so throwing exception is out of the question,
and the implementation seems to imply that "invalid unicode
sequences are removed".
This is a bug, right?
--------
Things get more complicated if you take into account "partial
invalidity". For example:
strings s = [0b1100_0000, 'a', 'b'];
Here, the first byte is actually an invalid sequence, since the
second byte is not of the form 0b10XX_XXXX. What's more, byte 2
itself is actually a valid sequence. We do not detect this
though, and create this output:
s.popFront(); => s == "b";
*arguably*, the correct behavior would be:
s.popFront(); => s == "ab";
Where only the single invalid first byte is removed.
The problem is that doing this would actually be much more
expensive, especially for a rare case. Worst yet, chances are you
validate again, and again (and again) the same character.
--------
So here are my 2 questions:
1. Is there, or does anyone know of, a standardized "behavior to
follow when decoding utf with invalid codes"?
2. Do we even really support invalid UTF after we "leave" the
std.utf.decode layer? EG: We simply suppose that the string is
valid?
More information about the Digitalmars-d
mailing list