UTF8 and unary encoding

Jonathan M Davis via Digitalmars-d digitalmars-d at puremagic.com
Mon Sep 12 08:59:31 PDT 2016


On Monday, September 12, 2016 07:37:05 Andrei Alexandrescu via Digitalmars-d 
wrote:
> While looking at https://en.wikipedia.org/wiki/Unary_coding I found that
> UTF8 uses unary encoding for the length of multibyte sequences.
> Investigating further at https://en.wikipedia.org/wiki/UTF-8 reveals
> that indeed "The number of high-order 1s in the leading byte of a
> multi-byte sequence indicates the number of bytes in the sequence. When
> reading from a stream, a reader can process all fully received sequences
> without first having to wait for either the leading byte of a next
> sequence or an end-of-stream indication."
>
> We don't use that explicitly; instead, we load each byte of
> multi-sequences. Who'd be interested in looking whether Phobos'
> primitives can be faster with multibyte-rich text?

Aren't we already doing that with stride? It reads the number of bytes in a
code point from the first code unit and then if we're dealing with a random
access range of char or an array of char, then we skip that many code units
without reading them. The fact that we auto-decode in many cases does mean
that all of the bytes are read in a number of cases where they wouldn't need
to be if we were dealing with ranges of char, but in the cases where we
aren't auto-decoding, we should already be taking advantage of this in
general via stride (though obviously, there could be specific places where
the code is not skipping bytes like it should).

Or am I misunderstanding what you're talking about doing here?

- Jonathan M Davis



More information about the Digitalmars-d mailing list