Handling invalid UTF sequences

Dmitry Olshansky dmitry.olsh at gmail.com
Fri Mar 21 10:14:21 PDT 2014


21-Mar-2014 02:39, Walter Bright пишет:
> Currently we do it by throwing a UTFException. This has problems:
>
> 1. about anything that deals with UTF cannot be made nothrow
>
> 2. turns innocuous errors into major problems, such as DOS attack vectors
> http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
>
> One option to fix this is to treat invalid sequences as:
>
> 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)

If we talk decoding then only dchar is relevant.
If transcoding then, having 0xFF makes for broken UTF-8 encoding so I 
see no sense in going for it.

>
> 2. U+FFFD
>

Also has the benefit of being recommended by the standard specifically 
for the case of substitution for bad encoding.

Details:
https://d.puremagic.com/issues/show_bug.cgi?id=12113

> I kinda like option 1.
>

Not enough of an argument ;)


-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list