The Case For Autodecode
ag0aep6g via Digitalmars-d
digitalmars-d at puremagic.com
Fri Jun 3 13:39:15 PDT 2016
On 06/03/2016 10:18 PM, Steven Schveighoffer wrote:
> But you can get a standalone code unit that is part of a coded sequence
> quite easily
>
> foo(string s)
> {
> auto x = s[0];
> dchar d = x;
> }
I don' think we're disagreeing on anything.
I'm calling UTF-8 code units below 0x80 "standalone" code units. They're
never part of multibyte sequences. Your _dchar_convert returns them
unscathed.
Higher code units are always part of multibyte sequences (or invalid
already). Your function returns invalid code points for them.
_dchar_convert does exactly what I meant, except that I had in mind
returning the replacement character for non-standalone code units. But I
see that that may not be feasible, and it's probably not necessary.
[...]
> So we need most efficient logic that does this:
>
> if(c & 0x80)
> return wchar(0xd800 + c);
Is this going to be faster than returning a constant invalid wchar?
> else
> return wchar(c);
>
> More expensive, but more correct!
>
> wchar to dchar conversion is pretty sound, as the surrogate pairs are
> invalid code points for dchar.
>
> -Steve
More information about the Digitalmars-d
mailing list