The Case For Autodecode
Patrick Schluter via Digitalmars-d
digitalmars-d at puremagic.com
Sat Jun 4 01:57:50 PDT 2016
On Friday, 3 June 2016 at 20:18:31 UTC, Steven Schveighoffer
wrote:
> On 6/3/16 3:52 PM, ag0aep6g wrote:
>> On 06/03/2016 09:09 PM, Steven Schveighoffer wrote:
>>> Except many chars *do* properly convert. This should work:
>>>
>>> char c = 'a';
>>> dchar d = c;
>>> assert(d == 'a');
>>
>> Yeah, that's what I meant by "standalone code unit". Code
>> units that on
>> their own represent a code point would not be touched.
>
> But you can get a standalone code unit that is part of a coded
> sequence quite easily
>
> foo(string s)
> {
> auto x = s[0];
> dchar d = x;
> }
>
>>
>>> As I mentioned in my earlier reply, some kind of "bounds
>>> checking" for
>>> the conversion could be a possibility.
>>>
>>> Hm... an interesting possiblity:
>>>
>>> dchar _dchar_convert(char c)
>>> {
>>> return cast(int)cast(byte)c; // get sign extension for
>>> non-ASCII
>>> }
>>
>> So when the char's most significant bit is set, this fills the
>> upper
>> bits of the dchar with 1s, right? And a set most significant
>> bit in a
>> char means it's part of a multibyte sequence, while in a dchar
>> it means
>> that the dchar is invalid, because they only go up to
>> U+10FFFF. Huh. Neat.
>
> An interesting thing is that I think the CPU can do this for us.
>
>> Does it work for for char -> wchar, too?
>
> It does not. 0xffff is a valid code point, and I think so are
> all the other values that would result. In fact, I think there
> are no invalid code units for wchar.
https://codepoints.net/specials
U+ffff would be fine, better at least than a surrogate.
More information about the Digitalmars-d
mailing list