Why UTF-8/16 character encodings?

H. S. Teoh hsteoh at quickfur.ath.cx
Sat May 25 07:34:28 PDT 2013


On Sat, May 25, 2013 at 03:47:41PM +0200, Joakim wrote:
> On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev wrote:
> >On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
> >>>If you want to split a string by ASCII whitespace (newlines,
> >>>tabs and spaces), it makes no difference whether the string is
> >>>in ASCII or UTF-8 - the code will behave correctly in either
> >>>case, variable-width-encodings regardless.
> >>Except that a variable-width encoding will take longer to decode
> >>while splitting, when compared to a single-byte encoding.
> >
> >No. Are you sure you understand UTF-8 properly?
> Are you sure _you_ understand it properly?  Both encodings have to
> check every single character to test for whitespace, but the
> single-byte encoding simply has to load each byte in the string and
> compare it against the whitespace-signifying bytes, while the
> variable-length code has to first load and parse potentially 4 bytes
> before it can compare, because it has to go through the state
> machine that you linked to above.  Obviously the constant-width
> encoding will be faster.  Did I really need to explain this?
[...]

Have you actually tried to write a whitespace splitter for UTF-8? Do you
realize that you can use an ASCII whitespace splitter for UTF-8 and it
will work correctly?

There is no need to decode UTF-8 for whitespace splitting at all. There
is no need to parse anything. You just iterate over the bytes and split
on 0x20. There is no performance difference over ASCII.

As Dmitry said, UTF-8 is self-synchronizing. While current Phobos code
tries to play it safe by decoding every character, this is not necessary
in many cases.


T

-- 
The best compiler is between your ears. -- Michael Abrash


More information about the Digitalmars-d mailing list