Handling invalid UTF sequences

Thu Mar 20 15:51:26 PDT 2014

On Thursday, 20 March 2014 at 22:39:47 UTC, Walter Bright wrote:
> Currently we do it by throwing a UTFException. This has 
> problems:
>
> 1. about anything that deals with UTF cannot be made nothrow
>
> 2. turns innocuous errors into major problems, such as DOS 
> attack vectors
> http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
>
> One option to fix this is to treat invalid sequences as:
>
> 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)
>
> 2. U+FFFD
>
> I kinda like option 1.
>
> What do you think?

I had thought of this before, and had an idea along the lines of:
1. strings "inside" the program are always valid.
2. encountering invalid strings "inside" the program  is an Error.
3. strings from the "outside" world must be validated before use.

The advantage is *more* than just a nothrow guarantee, but also a 
performance guarantee in release. And it *is* a pretty sane 
approach to the problem:
- User data: validate before use.
- Internal data: if its bad, your program is in a failure state.

----

As for your proposal, I can't really say. Silently accepting 
invalid sequences sounds nice at first, but its kind of just 
squelching the problem, isn't it?

----

In any case, both proposals would be major breaking changes...