Why UTF-8/16 character encodings?

Joakim joakim at airpost.net
Fri May 24 13:37:57 PDT 2013


On Friday, 24 May 2013 at 17:43:03 UTC, Peter Alexander wrote:
> Simple: backwards compatibility with all ASCII APIs (e.g. most 
> C libraries), and because I don't want my strings to consume 
> multiple bytes per character when I don't need it.
And yet here we are today, where an early decision made solely to 
accommodate the authors of then-dominant all-ASCII APIs has now 
foisted an unnecessarily complex encoding on all of us, with 
reduced performance as the result.  You do realize that my 
encoding would encode almost all languages' characters in single 
bytes, unlike UTF-8, right?  Your latter argument is one against 
UTF-8.

> Your language header idea is no good for at least three reasons:
>
> 1. What happens if I want to take a substring slice of your 
> string? I'll need to allocate a new string to add the header in.
Good point.  The solution that comes to mind right now is that 
you'd parse my format and store it in memory as a String class, 
storing the chars in an internal array with the header stripped 
out and the language stored in a property.  That way, even a 
slice could be made to refer to the same language, by referring 
to the language of the containing array.

Strictly speaking, this solution could also be implemented with 
UTF-8, simply by changing the format for the data structure you 
use in memory to the one I've outlined, as opposed to using the 
the UTF-8 encoding for both transmission and processing.  But if 
you're going to use my format for processing, you might as well 
use it for transmission also, since it is much smaller for 
non-ASCII text.

Before you ridicule my solution as somehow unworkable, let me 
remind you of the current monstrosity.  Currently, the language 
is stored in every single UTF-8 character, by having the length 
vary from one to four bytes depending on the language.  This 
leads to Phobos converting every UTF-8 string to UTF-32, so that 
it can easily run its algorithms on a constant-width 32-bit 
character set, and the resulting performance penalties.  Perhaps 
the biggest loss is that programmers everywhere are pushed to 
wrap their heads around this mess, predictably leading to either 
ignorance or broken code.

Which seems more unworkable to you?

> 2. What if I have a long string with the ASCII header and want 
> to append a non-ASCII character on the end? I'll need to 
> reallocate the whole string and widen it with the new header.
How often does this happen in practice?  I suspect that this 
almost never happens.  But if it does, it would be solved by the 
String class I outlined above, as the header isn't stored in the 
array anymore.

> 3. Even if I have a string that is 99% ASCII then I have to pay 
> extra bytes for every character just because 1% wasn't ASCII. 
> With UTF-8, I only pay the extra bytes when needed.
I don't understand what you mean here.  If your string has a 
thousand non-ASCII characters, the UTF-8 version will have one or 
two thousand more characters, ie 1 or 2 KB more.  My format would 
add a couple bytes in the header for each non-ASCII language 
character used, that's it.  It's a clear win for my format.

In any case, I just came up with the simplest format I could off 
the top of my head, maybe there are gaping holes in it.  But my 
point is that we should be able to come up with such a much 
simpler format, which keeps most characters to a single byte, not 
that my format is best.  All I want to argue is that UTF-8 is the 
worst. ;)


More information about the Digitalmars-d mailing list