UTF8 and unary encoding
Andrei Alexandrescu via Digitalmars-d
digitalmars-d at puremagic.com
Mon Sep 12 04:37:05 PDT 2016
While looking at https://en.wikipedia.org/wiki/Unary_coding I found that
UTF8 uses unary encoding for the length of multibyte sequences.
Investigating further at https://en.wikipedia.org/wiki/UTF-8 reveals
that indeed "The number of high-order 1s in the leading byte of a
multi-byte sequence indicates the number of bytes in the sequence. When
reading from a stream, a reader can process all fully received sequences
without first having to wait for either the leading byte of a next
sequence or an end-of-stream indication."
We don't use that explicitly; instead, we load each byte of
multi-sequences. Who'd be interested in looking whether Phobos'
primitives can be faster with multibyte-rich text?
Andrei
More information about the Digitalmars-d
mailing list