Handling invalid UTF sequences

Regan Heath regan at netmail.co.nz
Fri Mar 21 02:34:31 PDT 2014


On Thu, 20 Mar 2014 22:39:50 -0000, Walter Bright  
<newshound2 at digitalmars.com> wrote:

> Currently we do it by throwing a UTFException. This has problems:
>
> 1. about anything that deals with UTF cannot be made nothrow
>
> 2. turns innocuous errors into major problems, such as DOS attack vectors
> http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
>
> One option to fix this is to treat invalid sequences as:
>
> 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)
>
> 2. U+FFFD
>
> I kinda like option 1.
>
> What do you think?

In window/Win32..

WideCharToMultiByte has flags for a bunch of similar behaviours and allows  
you to define a default char to use as a replacement in such cases.

swprintf when passed %S will convert a wchar_t UTF-16 argument into ascii,  
and replaces invalid characters with ? as it does so.

swprintf_s (the safe version), IIRC, will invoke the invalid parameter  
handler for sequences which cannot be converted.

I think, ideally, we want some sensible default behaviour but also the  
ability to alter it globally, and even better in specific calls where it  
makes sense to do so (where flags/arguments can be passed to that effect).

So, the default behaviour could be to throw (therefore no breaking change)  
and we provide a function to change this to one of the other options, and  
another to select a replacement character (which would default to .init or  
U+FFFD).

R

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/


More information about the Digitalmars-d mailing list