Why UTF-8/16 character encodings?

Sat May 25 02:40:35 PDT 2013

On Saturday, 25 May 2013 at 08:58:57 UTC, Vladimir Panteleev 
wrote:
> Another thing I noticed: sometimes when you think you really 
> need to operate on individual characters (and that your code 
> will not be correct unless you do that), the assumption will be 
> incorrect due to the existence of combining characters in 
> Unicode. Two of the often-quoted use cases of working on 
> individual code points is calculating the string width 
> (assuming a fixed-width font), and slicing the string - both of 
> these will break with combining characters if those are not 
> accounted for. I believe the proper way to approach such tasks 
> is to implement the respective Unicode algorithms for it, which 
> I believe are non-trivial and for which the relative impact for 
> the overhead of working with a variable-width encoding is 
> acceptable.
Combining characters are examples of complexity baked into the 
various languages, so there's no way around that.  I'm arguing 
against layering more complexity on top, through UTF-8.

> Can you post some specific cases where the benefits of a 
> constant-width encoding are obvious and, in your opinion, make 
> constant-width encodings more useful than all the benefits of 
> UTF-8?
Let's take one you listed above, slicing a string.  You have to 
either translate your entire string into UTF-32 so it's 
constant-width, which is apparently what Phobos does, or decode 
every single UTF-8 character along the way, every single time.  A 
constant-width, single-byte encoding would be much easier to 
slice, while still using at most half the space.

> Also, I don't think this has been posted in this thread. Not 
> sure if it answers your points, though:
>
> http://www.utf8everywhere.org/
That seems to be a call to using UTF-8 on Windows, with a lot of 
info on how best to do so, with little justification for why 
you'd want to do so in the first place.  For example,

"Q: But what about performance of text processing algorithms, 
byte alignment, etc?

A: Is it really better with UTF-16? Maybe so."

Not exactly a considered analysis of the two. ;)

> And here's a simple and correct UTF-8 decoder:
>
> http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
You cannot honestly look at those multiple state diagrams and 
tell me it's "simple."  That said, the difficulty of _using_ 
UTF-8 is a much bigger than problem than implementing a decoder 
in a library.