Handling invalid UTF sequences

Thu Mar 20 16:34:00 PDT 2014

On Thursday, 20 March 2014 at 22:51:27 UTC, monarch_dodra wrote:
> On Thursday, 20 March 2014 at 22:39:47 UTC, Walter Bright wrote:
>> Currently we do it by throwing a UTFException. This has 
>> problems:
>>
>> 1. about anything that deals with UTF cannot be made nothrow
>>
>> 2. turns innocuous errors into major problems, such as DOS 
>> attack vectors
>> http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
>>
>> One option to fix this is to treat invalid sequences as:
>>
>> 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)
>>
>> 2. U+FFFD
>>
>> I kinda like option 1.
>>
>> What do you think?
>
> I had thought of this before, and had an idea along the lines 
> of:
> 1. strings "inside" the program are always valid.
> 2. encountering invalid strings "inside" the program  is an 
> Error.
> 3. strings from the "outside" world must be validated before 
> use.
>
> The advantage is *more* than just a nothrow guarantee, but also 
> a performance guarantee in release. And it *is* a pretty sane 
> approach to the problem:
> - User data: validate before use.
> - Internal data: if its bad, your program is in a failure state.
>

I'm a fan of this approach but Timon pointed out when I wrote 
about it once that it's rather trivial to get an invalid string 
through slicing mid-code point so now I'm not so sure. I think 
I'm still in favor of it because you've obviously got a logic 
error if that happens so your program isn't correct anyway (it's 
not a matter of bad user input).