Challenge: write a really really small front() for UTF8
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Sun Mar 23 21:46:00 PDT 2014
On 3/23/14, 9:37 PM, Michel Fortin wrote:
> On 2014-03-24 02:25:17 +0000, Andrei Alexandrescu
> <SeeWebsiteForEmail at erdani.org> said:
>
>> On 3/23/14, 6:53 PM, Michel Fortin wrote:
>>> On 2014-03-23 21:22:58 +0000, Andrei Alexandrescu
>>> <SeeWebsiteForEmail at erdani.org> said:
>>>
>>>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>>
>>> Optimizing for smallest assembly size:
>>>
>>> dchar front(char[] s)
>>> {
>>> size_t bytesize;
>>> dchar result;
>>> switch (s[0])
>>> {
>>> case 0b00000000: .. case 0b01111111:
>>> return s[0];
>>> case 0b11000000: .. case 0b11011111:
>>> return ((s[0] & 0b00011111) << 6) | (s[1] & 0b00011111);
>>> case 0b11100000: .. case 0b11101111:
>>> result = s[0] & 0b00001111;
>>> bytesize = 3;
>>> break;
>>> case 0b11110000: .. case 0b11110111:
>>> result = s[0] & 0b00000111;
>>> bytesize = 4;
>>> default:
>>> return dchar.init;
>>> }
>>> foreach (i; 1..bytesize)
>>> result = (result << 6) | (s[i] & 0b00111111);
>>> return result;
>>> }
>>
>> Nice, thanks! I'd hope for a short path for the ASCII subset, could
>> you achieve that?
>
> Unfortunately, there's a bug in the above. A missing "break" results in
> a fallthrough to default case. That's why the optimizer is so good, it
> just omits the four-byte branch entirely. I noticed something was wrong
> by looking at the assembly. I really wish D had no implicit fallthrough.
-w
Probably time to move that from a warning to an error. In the short time
we've used D at Facebook (forgot to add "-w" to the implicit flags)
we've already have not one, but two bugs related to switch's implicit
fallthrough.
> But try this instead, the result is even shorter:
>
> dchar front(char[] s)
> {
> if (s[0] < 0b1000000)
> return s[0]; // ASCII
>
> // pattern indicator tailLength
> // 0b100xxxxx 0b00 (0) 1
> // 0b101xxxxx 0b01 (1) 1 == indicator
> // 0b110xxxxx 0b10 (2) 2 == indicator
> // 0b111xxxxx 0b11 (3) 3 == indicator
> // note: undefined result for illegal 0b1111xxxx case
>
> auto indicator = (s[0] >> 5) & 0b11;
> auto tailLength = indicator ? indicator : 1;
>
> dchar result = s[0] & (0b00111111 >> tailLength);
> foreach (i; 0..tailLength)
> result = (result << 6) | (s[1+i] & 0b00111111);
> return result;
> }
>
> (Disclaimer: not tested, but I did check that all the expected code
> paths are present in the assembly this time.)
Nicely done. I collected what we have at http://goo.gl/W9V6E7.
Andrei
More information about the Digitalmars-d
mailing list