The Case For Autodecode
Steven Schveighoffer via Digitalmars-d
digitalmars-d at puremagic.com
Fri Jun 3 13:18:31 PDT 2016
On 6/3/16 3:52 PM, ag0aep6g wrote:
> On 06/03/2016 09:09 PM, Steven Schveighoffer wrote:
>> Except many chars *do* properly convert. This should work:
>>
>> char c = 'a';
>> dchar d = c;
>> assert(d == 'a');
>
> Yeah, that's what I meant by "standalone code unit". Code units that on
> their own represent a code point would not be touched.
But you can get a standalone code unit that is part of a coded sequence
quite easily
foo(string s)
{
auto x = s[0];
dchar d = x;
}
>
>> As I mentioned in my earlier reply, some kind of "bounds checking" for
>> the conversion could be a possibility.
>>
>> Hm... an interesting possiblity:
>>
>> dchar _dchar_convert(char c)
>> {
>> return cast(int)cast(byte)c; // get sign extension for non-ASCII
>> }
>
> So when the char's most significant bit is set, this fills the upper
> bits of the dchar with 1s, right? And a set most significant bit in a
> char means it's part of a multibyte sequence, while in a dchar it means
> that the dchar is invalid, because they only go up to U+10FFFF. Huh. Neat.
An interesting thing is that I think the CPU can do this for us.
> Does it work for for char -> wchar, too?
It does not. 0xffff is a valid code point, and I think so are all the
other values that would result. In fact, I think there are no invalid
code units for wchar. Of course, a surrogate pair requires another code
unit to be valid, so we can at least promote a char to a wchar in the
surrogate pair range (and always in the low or high surrogate range so a
naive transcoding of a char range to wchar will result in an invalid
sequence if there are any non-ascii characters).
So we need most efficient logic that does this:
if(c & 0x80)
return wchar(0xd800 + c);
else
return wchar(c);
More expensive, but more correct!
wchar to dchar conversion is pretty sound, as the surrogate pairs are
invalid code points for dchar.
-Steve
More information about the Digitalmars-d
mailing list