The Case For Autodecode

Fri Jun 3 13:39:15 PDT 2016

On 06/03/2016 10:18 PM, Steven Schveighoffer wrote:
> But you can get a standalone code unit that is part of a coded sequence
> quite easily
>
> foo(string s)
> {
>     auto x = s[0];
>     dchar d = x;
> }

I don' think we're disagreeing on anything.

I'm calling UTF-8 code units below 0x80 "standalone" code units. They're 
never part of multibyte sequences. Your _dchar_convert returns them 
unscathed.

Higher code units are always part of multibyte sequences (or invalid 
already). Your function returns invalid code points for them.

_dchar_convert does exactly what I meant, except that I had in mind 
returning the replacement character for non-standalone code units. But I 
see that that may not be feasible, and it's probably not necessary.

[...]
> So we need most efficient logic that does this:
>
> if(c & 0x80)
>      return wchar(0xd800 + c);

Is this going to be faster than returning a constant invalid wchar?

> else
>      return wchar(c);
>
> More expensive, but more correct!
>
> wchar to dchar conversion is pretty sound, as the surrogate pairs are
> invalid code points for dchar.
>
> -Steve