Reducing the cost of autodecoding

Uplink_Coder via Digitalmars-d digitalmars-d at puremagic.com
Sun Oct 16 01:43:23 PDT 2016


On Sunday, 16 October 2016 at 07:59:16 UTC, Patrick Schluter 
wrote:
> Here my version. It's probably not the shortest (100 ligns of 
> assembly with LDC) but it is correct and has following 
> properties:
> - Performance proportional to the encoding length
> - Detects Invalid byte sequences
> - Detects Overlong encodings
> - Detects Invalid code points
>
> I put the exception to be comparable to other routines but 
> Unicode specifies that it is preferable to not abort on 
> encoding errors (to avoid denial of service attacks).
>
> dchar myFront2(ref char[] str)
> {
>   dchar c0 = str.ptr[0];
>   if(c0 < 0x80) {
>     return c0;
>   }
>   else if(str.length > 1) {
>     dchar c1 = str.ptr[1];
>     if(c0 < 0xE0 && (c1 & 0xC0) == 0x80) {
>       c1 = ((c0 & 0x1F) << 6)|(c1 & 0x3F);
>       if(c1 < 0x80) goto Linvalid;
>       return c1;
>     }
>     else if(str.length > 2) {
>       dchar c2 = str.ptr[2];
>       if(c0 < 0xF0 && (c1 & 0xC0) == 0x80 && (c2 & 0xC0) == 
> 0x80) {
>         c2 = ((c0 & 0x0F) << 12)|((c1 & 0x3F) << 6)|(c2 & 0x3F);
>         if(c2 < 0x800) goto Linvalid;
>         return c2;
>       }
>       else if(str.length > 3) {
>         dchar c3 = str.ptr[3];
>         if(c0 < 0xF5 && (c1 & 0xC0) == 0x80 && (c2 & 0xC0) == 
> 0x80 && (c3 & 0xC0) == 0x80) {
>           c3 = ((c0 & 0x07) << 16)|((c1 & 0x3F) << 12)|((c2 & 
> 0x3F) << 6)|(c3 & 0x3F);
>           if(c3 < 0x10000  || c3 > 0x10ffff) goto Linvalid;
>           return c3;
>         }
>       }
>     }
>   }
>   Linvalid:
>      throw new Exception("yadayada");
> //assert(myFront2(['\xC2','\xA2'])==0xA3);
> }

This looks quite slow.
We already have a correct version in utf.decodeImpl.
The goal here was to find a small and fast alternative.



More information about the Digitalmars-d mailing list