The Case For Autodecode

Steven Schveighoffer via Digitalmars-d digitalmars-d at puremagic.com
Fri Jun 3 14:13:26 PDT 2016


On 6/3/16 4:39 PM, ag0aep6g wrote:
> On 06/03/2016 10:18 PM, Steven Schveighoffer wrote:
>> But you can get a standalone code unit that is part of a coded sequence
>> quite easily
>>
>> foo(string s)
>> {
>>     auto x = s[0];
>>     dchar d = x;
>> }
>
> I don' think we're disagreeing on anything.
>
> I'm calling UTF-8 code units below 0x80 "standalone" code units. They're
> never part of multibyte sequences. Your _dchar_convert returns them
> unscathed.

Ah, I thought you meant standalone as in it was assigned to a standalone 
char variable vs. part of an array or range. My mistake.

Re-reading your original message, I see that should have been clear to me...

>> So we need most efficient logic that does this:
>>
>> if(c & 0x80)
>>      return wchar(0xd800 + c);
>
> Is this going to be faster than returning a constant invalid wchar?

No, but I like the idea of preserving the erroneous character you tried 
to convert.

But is there an invalid wchar? I looked through the wikipedia article on 
UTF 16, and it didn't seem to say there was one.

If we use U+FFFD, that signifies a coding problem but is still a valid 
code point. However, doing a wchar in the D800 - D8FF range without 
being followed by a code unit in the DC00 - DFFF range is an invalid 
sequence. D throws if it encounters such a thing.

-Steve


More information about the Digitalmars-d mailing list