Handling invalid UTF sequences

Steven Schveighoffer schveiguy at yahoo.com
Thu Mar 20 18:44:28 PDT 2014


On Thu, 20 Mar 2014 18:39:50 -0400, Walter Bright  
<newshound2 at digitalmars.com> wrote:

> Currently we do it by throwing a UTFException. This has problems:
>
> 1. about anything that deals with UTF cannot be made nothrow
>
> 2. turns innocuous errors into major problems, such as DOS attack vectors
> http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
>
> One option to fix this is to treat invalid sequences as:
>
> 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)
>
> 2. U+FFFD
>
> I kinda like option 1.
>
> What do you think?

Can't say I like it. Especially since current code expects a throw.

I understand the need. What about creating a different type which decodes  
into a known invalid code, and doesn't throw? This leaves the selection of  
throwing or not up to the type, which is generally decided on declaration,  
instead of having to change all your calls.

-Steve


More information about the Digitalmars-d mailing list