Challenge: write a really really small front() for UTF8
Michel Fortin
michel.fortin at michelf.ca
Sun Mar 23 21:37:22 PDT 2014
On 2014-03-24 02:25:17 +0000, Andrei Alexandrescu
<SeeWebsiteForEmail at erdani.org> said:
> On 3/23/14, 6:53 PM, Michel Fortin wrote:
>> On 2014-03-23 21:22:58 +0000, Andrei Alexandrescu
>> <SeeWebsiteForEmail at erdani.org> said:
>>
>>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>
>> Optimizing for smallest assembly size:
>>
>> dchar front(char[] s)
>> {
>> size_t bytesize;
>> dchar result;
>> switch (s[0])
>> {
>> case 0b00000000: .. case 0b01111111:
>> return s[0];
>> case 0b11000000: .. case 0b11011111:
>> return ((s[0] & 0b00011111) << 6) | (s[1] & 0b00011111);
>> case 0b11100000: .. case 0b11101111:
>> result = s[0] & 0b00001111;
>> bytesize = 3;
>> break;
>> case 0b11110000: .. case 0b11110111:
>> result = s[0] & 0b00000111;
>> bytesize = 4;
>> default:
>> return dchar.init;
>> }
>> foreach (i; 1..bytesize)
>> result = (result << 6) | (s[i] & 0b00111111);
>> return result;
>> }
>
> Nice, thanks! I'd hope for a short path for the ASCII subset, could you
> achieve that?
Unfortunately, there's a bug in the above. A missing "break" results in
a fallthrough to default case. That's why the optimizer is so good, it
just omits the four-byte branch entirely. I noticed something was wrong
by looking at the assembly. I really wish D had no implicit fallthrough.
But try this instead, the result is even shorter:
dchar front(char[] s)
{
if (s[0] < 0b1000000)
return s[0]; // ASCII
// pattern indicator tailLength
// 0b100xxxxx 0b00 (0) 1
// 0b101xxxxx 0b01 (1) 1 == indicator
// 0b110xxxxx 0b10 (2) 2 == indicator
// 0b111xxxxx 0b11 (3) 3 == indicator
// note: undefined result for illegal 0b1111xxxx case
auto indicator = (s[0] >> 5) & 0b11;
auto tailLength = indicator ? indicator : 1;
dchar result = s[0] & (0b00111111 >> tailLength);
foreach (i; 0..tailLength)
result = (result << 6) | (s[1+i] & 0b00111111);
return result;
}
(Disclaimer: not tested, but I did check that all the expected code
paths are present in the assembly this time.)
--
Michel Fortin
michel.fortin at michelf.ca
http://michelf.ca
More information about the Digitalmars-d
mailing list