Reducing the cost of autodecoding
Stefan Koch via Digitalmars-d
digitalmars-d at puremagic.com
Sat Oct 15 14:21:22 PDT 2016
On Saturday, 15 October 2016 at 19:42:03 UTC, Uplink_Coder wrote:
> On Saturday, 15 October 2016 at 19:07:50 UTC, Patrick Schluter
> wrote:
>> At least with that lookup table below, you can detect isolated
>> continuation bytes (192 and 193) and invalid codes (above 244).
>>
>> __gshared static immutable ubyte[] charWidthTab = [
>> 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
>> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
>> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
>> 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
>> ];
>>
>> length 5 and 6 need not to be tested specifically for your
>> goto.
>
> If you use 0 instead of 1 the length check will suffice for
> throwing on invalid.
__gshared static immutable ubyte[] charWidthTab = [2, 2, 2, 2, 2,
2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,
4, 4, 4, 0, 0, 0, 0,
0, 0, 0, 0];
dchar myFront2(ref char[] str) pure
{
auto c1 = str.ptr[0];
if (c1 & 128)
{
if (c1 & 64)
{
int idx = 0;
int l = charWidthTab.ptr[c1 - 192];
if (str.length < l)
goto Linvalid;
dchar c = 0;
l--;
while (l)
{
l--;
immutable cc = str.ptr[idx++];
debug if (cc & 64) goto Linvalid;
c |= cc;
c <<= 6;
}
c |= str.ptr[idx];
return c;
}
Linvalid:
throw new Exception("yadayada");
}
else
{
return c1;
}
}
This code proofs to be the fastest so far.
On UTF and non-UTF text.
It's also fairly small.
More information about the Digitalmars-d
mailing list