New UTF-8 stride function

Dmitry Olshansky dmitry.olsh at gmail.com
Tue May 28 08:31:02 PDT 2013


28-May-2013 00:42, Martin Nowak пишет:
> On 05/27/2013 09:21 PM, Martin Nowak wrote:
>>  > See unittest/benchmark here:
>>  > https://gist.github.com/blackwhale/5653927
>>  >
>> Looks promising.
>
> This will not detect 0xFF as invalid UTF-8 sequence.
> For sequences with 5 or 6 bytes, that aren't used for unicode, it will
> return a stride of 4.
>

First of all there is a minor bug in std.utf in a sense that it accepts 
sequences of 5 and 6 bytes. They are simply explicitly not defined per 
Unicode standard and should throw invalid UTF as well.

OK I just need to consider the next bit making the whole mask 4bits 
wide. Thus I need 16 slots in a register.

64bit version will fit just fine  in a register 4*16 = 64.
32bit version will have to go with packing 2bits per slot and doing +1 
afterwards.

Here is an updated version that I'm testing again:
https://github.com/blackwhale/gsoc-bench-2012/blob/master/fast_stride.d

-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list