Handling invalid UTF sequences

Jonathan M Davis jmdavisProg at gmx.com
Fri Mar 21 12:17:12 PDT 2014


On Thursday, March 20, 2014 15:39:50 Walter Bright wrote:
> Currently we do it by throwing a UTFException. This has problems:
> 
> 1. about anything that deals with UTF cannot be made nothrow
> 
> 2. turns innocuous errors into major problems, such as DOS attack vectors
> http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
> 
> One option to fix this is to treat invalid sequences as:
> 
> 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)
> 
> 2. U+FFFD
> 
> I kinda like option 1.
> 
> What do you think?

After a discussion on this a few weeks back (where I was in favor of the
current behavior when the discussion started), I'm now completely in favor
of making it so that std.utf.decode simply replaces invalid code points with
U+FFFD per the standard. Most code won't care and will continue to work as
before. The main difference is that invalid Unicode would then fall in the
same category as when a program is given a string with characters that it's
not supposed to be given. Any code that checks for that sort of thing will
then treat invalid Unicode as it would have treated other invalid strings,
and code that doesn't care will continue to not care except that now it will
work with invalid Unicode instead of throwing.

A prime example is something like find. What does it care if it's given 
invalid Unicode? It will simply look for what you tell it to look for, and if 
it's not there, it won't find it. U+FFFD will just be one more character that 
doesn't match what it's looking for.

The few programs that really care about whether a string that they're given 
contains any invalid Unicode can simply validate the string ahead of time. The 
main problem there is that we need to replace std.utf.validate with something 
like std.utf.isValidUnicode, because validate makes the horrendous decision of 
throwing rather than returning a bool (which is what triggered the previous 
discussion on the topic IIRC).

There may be some concern about this change silently changing behavior, but I 
think that the reality is that the vast majority of programs will continue to 
work just fine, and our string processing code will be that much cleaner and 
faster as a result. So, I'm very much inclined to take the path of making this 
change and putting a warning about it in the changelog rather than not making 
the change or trying to do this alongside what we currently have.

- Jonathan M Davis


More information about the Digitalmars-d mailing list