UTF8 and unary encoding

Andrei Alexandrescu via Digitalmars-d digitalmars-d at puremagic.com
Mon Sep 12 04:37:05 PDT 2016


While looking at https://en.wikipedia.org/wiki/Unary_coding I found that 
UTF8 uses unary encoding for the length of multibyte sequences. 
Investigating further at https://en.wikipedia.org/wiki/UTF-8 reveals 
that indeed "The number of high-order 1s in the leading byte of a 
multi-byte sequence indicates the number of bytes in the sequence. When 
reading from a stream, a reader can process all fully received sequences 
without first having to wait for either the leading byte of a next 
sequence or an end-of-stream indication."

We don't use that explicitly; instead, we load each byte of 
multi-sequences. Who'd be interested in looking whether Phobos' 
primitives can be faster with multibyte-rich text?


Andrei


More information about the Digitalmars-d mailing list