Challenge: write a really really small front() for UTF8
Michel Fortin
michel.fortin at michelf.ca
Sun Mar 23 22:27:43 PDT 2014
On 2014-03-24 05:09:15 +0000, Andrei Alexandrescu
<SeeWebsiteForEmail at erdani.org> said:
> On 3/23/14, 9:49 PM, Michel Fortin wrote:
>>
> http://goo.gl/y9EVFr
>
> Andrei
As I said earlier, that front2 version is broken. It's missing a break.
Adding the break makes it non-interesting from an instruction count
point of view.
Also, there's still an issue in my code above (found by safetyOff): the
first "if" for ASCII should use 0b1000_0000, not 0b1000000 (missing one
zero). Here's the corrected (again) version of front1:
dchar front1(char[] s)
{
if (s[0] < 0b1000_0000)
return s[0]; // ASCII
// pattern indicator tailLength
// 0b1100xxxx 0b00 (0) 1
// 0b1101xxxx 0b01 (1) 1 == indicator
// 0b1110xxxx 0b10 (2) 2 == indicator
// 0b1111xxxx 0b11 (3) 3 == indicator
// note: undefined result for illegal 0b11111xxx case
auto indicator = (s[0] >> 4) & 0b11;
auto tailLength = indicator ? indicator : 1;
dchar result = s[0] & (0b0011_1111 >> tailLength);
foreach (i; 0..tailLength)
result = (result << 6) | (s[1+i] & 0b0011_1111);
return result;
}
And I'm going to suggest a variant of the above with one less branch
(but three more instructions): look at how tailLenght is computed by
or'ing with the negative of bit 2 of indicator. I suspect it'll be
faster with non-ASCII input, unless it gets inlined less.
dchar front2(char[] s)
{
if (s[0] < 0b1000_0000)
return s[0]; // ASCII
// pattern indicator tailLength
// 0b1100xxxx 0b00 (0) 1
// 0b1101xxxx 0b01 (1) 1
// 0b1110xxxx 0b10 (2) 2
// 0b1111xxxx 0b11 (3) 3
// note: undefined result for illegal 0b11111xxx case
auto indicator = ((s[0] >> 4) & 0b11);
auto tailLength = indicator | ((~indicator >> 1) & 0b1);
dchar result = s[0] & (0b0011_1111 >> tailLength);
foreach (i; 0..tailLength)
result = (result << 6) | (s[1+i] & 0b0011_1111);
return result;
}
--
Michel Fortin
michel.fortin at michelf.ca
http://michelf.ca
More information about the Digitalmars-d
mailing list