The Case For Autodecode
Steven Schveighoffer via Digitalmars-d
digitalmars-d at puremagic.com
Fri Jun 3 14:13:26 PDT 2016
On 6/3/16 4:39 PM, ag0aep6g wrote:
> On 06/03/2016 10:18 PM, Steven Schveighoffer wrote:
>> But you can get a standalone code unit that is part of a coded sequence
>> quite easily
>>
>> foo(string s)
>> {
>> auto x = s[0];
>> dchar d = x;
>> }
>
> I don' think we're disagreeing on anything.
>
> I'm calling UTF-8 code units below 0x80 "standalone" code units. They're
> never part of multibyte sequences. Your _dchar_convert returns them
> unscathed.
Ah, I thought you meant standalone as in it was assigned to a standalone
char variable vs. part of an array or range. My mistake.
Re-reading your original message, I see that should have been clear to me...
>> So we need most efficient logic that does this:
>>
>> if(c & 0x80)
>> return wchar(0xd800 + c);
>
> Is this going to be faster than returning a constant invalid wchar?
No, but I like the idea of preserving the erroneous character you tried
to convert.
But is there an invalid wchar? I looked through the wikipedia article on
UTF 16, and it didn't seem to say there was one.
If we use U+FFFD, that signifies a coding problem but is still a valid
code point. However, doing a wchar in the D800 - D8FF range without
being followed by a code unit in the DC00 - DFFF range is an invalid
sequence. D throws if it encounters such a thing.
-Steve
More information about the Digitalmars-d
mailing list