The Case For Autodecode

Fri Jun 3 14:40:14 PDT 2016

On 06/03/2016 11:13 PM, Steven Schveighoffer wrote:
> No, but I like the idea of preserving the erroneous character you tried
> to convert.

Makes sense.

> But is there an invalid wchar? I looked through the wikipedia article on
> UTF 16, and it didn't seem to say there was one.
>
> If we use U+FFFD, that signifies a coding problem but is still a valid
> code point. However, doing a wchar in the D800 - D8FF range without
> being followed by a code unit in the DC00 - DFFF range is an invalid
> sequence. D throws if it encounters such a thing.

The Unicode FAQ has an answer to this exact question, but it also only 
says that "[u]npaired surrogates are invalid" [1].

It also mentions "noncharacters" which are "permanently reserved [...] 
for internal use". "For example, they might be used internally as a 
particular kind of object placeholder in a string." [2] - Not too bad.

And then there is the replacement character, of course. "[U]sed to 
replace an incoming character whose value is unknown or unrepresentable 
in Unicode" [3].

[1] http://www.unicode.org/faq/utf_bom.html#utf16-7
[2] http://www.unicode.org/faq/private_use.html#noncharacters
[3] http://www.fileformat.info/info/unicode/char/0fffd/index.htm