Handling invalid UTF sequences

Fri Mar 21 12:34:11 PDT 2014

On 3/21/2014 10:14 AM, Dmitry Olshansky wrote:
> 21-Mar-2014 02:39, Walter Bright пишет:
>> Currently we do it by throwing a UTFException. This has problems:
>>
>> 1. about anything that deals with UTF cannot be made nothrow
>>
>> 2. turns innocuous errors into major problems, such as DOS attack vectors
>> http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
>>
>> One option to fix this is to treat invalid sequences as:
>>
>> 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)
>
> If we talk decoding then only dchar is relevant.
> If transcoding then, having 0xFF makes for broken UTF-8 encoding so I see no
> sense in going for it.
>
>>
>> 2. U+FFFD
>>
>
> Also has the benefit of being recommended by the standard specifically for the
> case of substitution for bad encoding.
>
> Details:
> https://d.puremagic.com/issues/show_bug.cgi?id=12113

Ah, that's what I was looking for. The wikipedia article was a bit wishy-washy 
about the whole thing.

>> I kinda like option 1.
>>
>
> Not enough of an argument ;)
>
>