Reducing the cost of autodecoding

Patrick Schluter via Digitalmars-d digitalmars-d at puremagic.com
Sat Oct 15 23:51:30 PDT 2016


On Saturday, 15 October 2016 at 21:21:22 UTC, Stefan Koch wrote:
> On Saturday, 15 October 2016 at 19:42:03 UTC, Uplink_Coder 
> wrote:
>> On Saturday, 15 October 2016 at 19:07:50 UTC, Patrick Schluter 
>> wrote:
>>> At least with that lookup table below, you can detect 
>>> isolated continuation bytes (192 and 193) and invalid codes 
>>> (above 244).
>>>
>>> __gshared static immutable ubyte[] charWidthTab = [
>>>             1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
>>>             2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
>>>             3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
>>>             4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
>>> ];
>>>
>>> length 5 and 6 need not to be tested specifically for your 
>>> goto.
>>
>> If you use 0 instead of 1 the length check will suffice for 
>> throwing on invalid.
>
> __gshared static immutable ubyte[] charWidthTab = [2, 2, 2, 2, 
> 2, 2, 2, 2, 2,
>     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
> 2, 2, 2, 3,
>     3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 
> 4, 4, 4, 0, 0, 0, 0,
>     0, 0, 0, 0];
>
> dchar myFront2(ref char[] str) pure
> {
>     auto c1 = str.ptr[0];
>     if (c1 & 128)
>     {
>         if (c1 & 64)
>         {
>             int idx = 0;
>             int l = charWidthTab.ptr[c1 - 192];
>             if (str.length < l)
>                 goto Linvalid;
>             dchar c = 0;
>             l--;
>             while (l)
>             {
>                 l--;
>                 immutable cc = str.ptr[idx++];
>                 debug if (cc & 64) goto Linvalid;
>                 c |= cc;
>                 c <<= 6;
>             }
>             c |= str.ptr[idx];
>             return c;
>
>         }
>     Linvalid:
>         throw new Exception("yadayada");
>
>     }
>     else
>     {
>         return c1;
>     }
> }
>
> This code proofs to be the fastest so far.
> On UTF and non-UTF text.
> It's also fairly small.

What does "debug if" do ? Because when I replace it with a simple 
"if" the code generated by "LDC 1.1.0-beta2 -release -O3 
-boundscheck=off" is 15 lines shorter.
That was my first question but I think I see the issue now. By 
removing the debug, the condition is compiled in and the compiler 
short circuits the whole loop and goes to the Linvalid.
The error is that cc is loaded the first time with the same value 
as c1 because idx=0 and it is post incremented. It should be 
preincremented and c would have to be initialised with the data 
bits of c1 not 0.



More information about the Digitalmars-d mailing list