Handling invalid UTF sequences

Thu Mar 20 15:50:50 PDT 2014

On Thursday, 20 March 2014 at 22:39:47 UTC, Walter Bright wrote:
> Currently we do it by throwing a UTFException. This has 
> problems:
>
> 1. about anything that deals with UTF cannot be made nothrow
>
> 2. turns innocuous errors into major problems, such as DOS 
> attack vectors
> http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
>
> One option to fix this is to treat invalid sequences as:
>
> 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)
>
> 2. U+FFFD
>
> I kinda like option 1.
>
> What do you think?

Hiding errors under the carpet is not a good strategy. These 
sequences are invalid, and doomed to explode at some point. I'm 
not sure what the solution is, but the .init one do not seems 
like the right one to me.