Behavior of strings with invalid unicode...

monarch_dodra monarchdodra at gmail.com
Wed Nov 21 05:25:00 PST 2012


I made a commit that was meant to better certify what functions 
threw in UTF.

I thus noticed that some of our functions, are unsafe. For 
example:

strings s = [0b1100_0000]; //1 byte of 2 byte sequence
s.popFront();              //Assertion error because of invalid
                            //slicing of s[2 .. $];

"pop" is nothrow, so throwing exception is out of the question, 
and the implementation seems to imply that "invalid unicode 
sequences are removed".

This is a bug, right?

--------
Things get more complicated if you take into account "partial 
invalidity". For example:

strings s = [0b1100_0000, 'a', 'b'];

Here, the first byte is actually an invalid sequence, since the 
second byte is not of the form 0b10XX_XXXX. What's more, byte 2 
itself is actually a valid sequence. We do not detect this 
though, and create this output:
s.popFront(); => s == "b";
*arguably*, the correct behavior would be:
s.popFront(); => s == "ab";
Where only the single invalid first byte is removed.

The problem is that doing this would actually be much more 
expensive, especially for a rare case. Worst yet, chances are you 
validate again, and again (and again) the same character.

--------
So here are my 2 questions:
1. Is there, or does anyone know of, a standardized "behavior to 
follow when decoding utf with invalid codes"?

2. Do we even really support invalid UTF after we "leave" the 
std.utf.decode layer? EG: We simply suppose that the string is 
valid?



More information about the Digitalmars-d mailing list