Challenge: write a really really small front() for UTF8

Daniel N ufo at orbiting.us
Mon Mar 24 05:21:54 PDT 2014


On Monday, 24 March 2014 at 11:48:00 UTC, Dmitry Olshansky wrote:
>> RFC 3629 (http://tools.ietf.org/html/rfc3629) restricted UTF-8 
>> to
>> conform to constraints in UTF-16, removing all 5- and 6-byte 
>> sequences.
>
> More importantly Unicode standard explicitly fixed the range of 
> code points to that of representable in UTF-16. Starting with 
> the 5th version of the standard if memory serves me right.

I did some hacks using C at work with _pext_u32, it's an 
absolutely wonderful instruction(pext) with the corresponding 
pdep.
http://software.intel.com/sites/landingpage/IntrinsicsGuide/

And ridiculously fast according to Agner(Latency 3, Throughput 1):
http://www.agner.org/optimize/instruction_tables.pdf

I think we should add this as an intrinsic to D as well(if it 
isn't already, but I couldn't find it)... it could do wonders for 
utf decoding.

I'm currently too busy to submit a complete solution, but please 
feel free to use my idea if you think it sounds promising.


More information about the Digitalmars-d mailing list