The Case For Autodecode

Patrick Schluter via Digitalmars-d digitalmars-d at puremagic.com
Sat Jun 4 01:57:50 PDT 2016


On Friday, 3 June 2016 at 20:18:31 UTC, Steven Schveighoffer 
wrote:
> On 6/3/16 3:52 PM, ag0aep6g wrote:
>> On 06/03/2016 09:09 PM, Steven Schveighoffer wrote:
>>> Except many chars *do* properly convert. This should work:
>>>
>>> char c = 'a';
>>> dchar d = c;
>>> assert(d == 'a');
>>
>> Yeah, that's what I meant by "standalone code unit". Code 
>> units that on
>> their own represent a code point would not be touched.
>
> But you can get a standalone code unit that is part of a coded 
> sequence quite easily
>
> foo(string s)
> {
>    auto x = s[0];
>    dchar d = x;
> }
>
>>
>>> As I mentioned in my earlier reply, some kind of "bounds 
>>> checking" for
>>> the conversion could be a possibility.
>>>
>>> Hm... an interesting possiblity:
>>>
>>> dchar _dchar_convert(char c)
>>> {
>>>     return cast(int)cast(byte)c; // get sign extension for 
>>> non-ASCII
>>> }
>>
>> So when the char's most significant bit is set, this fills the 
>> upper
>> bits of the dchar with 1s, right? And a set most significant 
>> bit in a
>> char means it's part of a multibyte sequence, while in a dchar 
>> it means
>> that the dchar is invalid, because they only go up to 
>> U+10FFFF. Huh. Neat.
>
> An interesting thing is that I think the CPU can do this for us.
>
>> Does it work for for char -> wchar, too?
>
> It does not. 0xffff is a valid code point, and I think so are 
> all the other values that would result. In fact, I think there 
> are no invalid code units for wchar.

https://codepoints.net/specials

U+ffff would be fine, better at least than a surrogate.



More information about the Digitalmars-d mailing list