Why UTF-8/16 character encodings?
Joakim
joakim at airpost.net
Fri May 24 13:37:57 PDT 2013
On Friday, 24 May 2013 at 17:43:03 UTC, Peter Alexander wrote:
> Simple: backwards compatibility with all ASCII APIs (e.g. most
> C libraries), and because I don't want my strings to consume
> multiple bytes per character when I don't need it.
And yet here we are today, where an early decision made solely to
accommodate the authors of then-dominant all-ASCII APIs has now
foisted an unnecessarily complex encoding on all of us, with
reduced performance as the result. You do realize that my
encoding would encode almost all languages' characters in single
bytes, unlike UTF-8, right? Your latter argument is one against
UTF-8.
> Your language header idea is no good for at least three reasons:
>
> 1. What happens if I want to take a substring slice of your
> string? I'll need to allocate a new string to add the header in.
Good point. The solution that comes to mind right now is that
you'd parse my format and store it in memory as a String class,
storing the chars in an internal array with the header stripped
out and the language stored in a property. That way, even a
slice could be made to refer to the same language, by referring
to the language of the containing array.
Strictly speaking, this solution could also be implemented with
UTF-8, simply by changing the format for the data structure you
use in memory to the one I've outlined, as opposed to using the
the UTF-8 encoding for both transmission and processing. But if
you're going to use my format for processing, you might as well
use it for transmission also, since it is much smaller for
non-ASCII text.
Before you ridicule my solution as somehow unworkable, let me
remind you of the current monstrosity. Currently, the language
is stored in every single UTF-8 character, by having the length
vary from one to four bytes depending on the language. This
leads to Phobos converting every UTF-8 string to UTF-32, so that
it can easily run its algorithms on a constant-width 32-bit
character set, and the resulting performance penalties. Perhaps
the biggest loss is that programmers everywhere are pushed to
wrap their heads around this mess, predictably leading to either
ignorance or broken code.
Which seems more unworkable to you?
> 2. What if I have a long string with the ASCII header and want
> to append a non-ASCII character on the end? I'll need to
> reallocate the whole string and widen it with the new header.
How often does this happen in practice? I suspect that this
almost never happens. But if it does, it would be solved by the
String class I outlined above, as the header isn't stored in the
array anymore.
> 3. Even if I have a string that is 99% ASCII then I have to pay
> extra bytes for every character just because 1% wasn't ASCII.
> With UTF-8, I only pay the extra bytes when needed.
I don't understand what you mean here. If your string has a
thousand non-ASCII characters, the UTF-8 version will have one or
two thousand more characters, ie 1 or 2 KB more. My format would
add a couple bytes in the header for each non-ASCII language
character used, that's it. It's a clear win for my format.
In any case, I just came up with the simplest format I could off
the top of my head, maybe there are gaping holes in it. But my
point is that we should be able to come up with such a much
simpler format, which keeps most characters to a single byte, not
that my format is best. All I want to argue is that UTF-8 is the
worst. ;)
More information about the Digitalmars-d
mailing list