Why UTF-8/16 character encodings?
Joakim
joakim at airpost.net
Sat May 25 02:40:35 PDT 2013
On Saturday, 25 May 2013 at 08:58:57 UTC, Vladimir Panteleev
wrote:
> Another thing I noticed: sometimes when you think you really
> need to operate on individual characters (and that your code
> will not be correct unless you do that), the assumption will be
> incorrect due to the existence of combining characters in
> Unicode. Two of the often-quoted use cases of working on
> individual code points is calculating the string width
> (assuming a fixed-width font), and slicing the string - both of
> these will break with combining characters if those are not
> accounted for. I believe the proper way to approach such tasks
> is to implement the respective Unicode algorithms for it, which
> I believe are non-trivial and for which the relative impact for
> the overhead of working with a variable-width encoding is
> acceptable.
Combining characters are examples of complexity baked into the
various languages, so there's no way around that. I'm arguing
against layering more complexity on top, through UTF-8.
> Can you post some specific cases where the benefits of a
> constant-width encoding are obvious and, in your opinion, make
> constant-width encodings more useful than all the benefits of
> UTF-8?
Let's take one you listed above, slicing a string. You have to
either translate your entire string into UTF-32 so it's
constant-width, which is apparently what Phobos does, or decode
every single UTF-8 character along the way, every single time. A
constant-width, single-byte encoding would be much easier to
slice, while still using at most half the space.
> Also, I don't think this has been posted in this thread. Not
> sure if it answers your points, though:
>
> http://www.utf8everywhere.org/
That seems to be a call to using UTF-8 on Windows, with a lot of
info on how best to do so, with little justification for why
you'd want to do so in the first place. For example,
"Q: But what about performance of text processing algorithms,
byte alignment, etc?
A: Is it really better with UTF-16? Maybe so."
Not exactly a considered analysis of the two. ;)
> And here's a simple and correct UTF-8 decoder:
>
> http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
You cannot honestly look at those multiple state diagrams and
tell me it's "simple." That said, the difficulty of _using_
UTF-8 is a much bigger than problem than implementing a decoder
in a library.
More information about the Digitalmars-d
mailing list